Build a complete ML pipeline that trains a classifier on the Iris dataset, evaluates performance, and saves the model for reuse.

What You’ll Build

A script that:

  1. Loads and explores data
  2. Splits into train/test sets
  3. Builds a preprocessing + model pipeline
  4. Trains and evaluates with cross-validation
  5. Saves the best model to disk

Setup

  pip install scikit-learn pandas matplotlib joblib
  

Step 1: Load and Explore

  import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["species"] = iris.target
df["species_name"] = df["species"].map({0: "setosa", 1: "versicolor", 2: "virginica"})

print(df.head())
print(df["species_name"].value_counts())
print(df.describe())
  

Step 2: Visualize

  import matplotlib.pyplot as plt
import seaborn as sns

sns.pairplot(df, hue="species_name")
plt.savefig("iris_pairplot.png")
plt.show()
  

Step 3: Build Pipeline

  from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", RandomForestClassifier(random_state=42)),
])
  

Step 4: Hyperparameter Tuning

  param_grid = {
    "classifier__n_estimators": [50, 100, 200],
    "classifier__max_depth": [None, 5, 10],
}

grid = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy")
grid.fit(X_train, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")
print(f"Test score: {grid.score(X_test, y_test):.3f}")
  

Step 5: Evaluate

  from sklearn.metrics import classification_report, confusion_matrix

y_pred = grid.predict(X_test)
print(classification_report(y_test, y_pred, target_names=iris.target_names))
print(confusion_matrix(y_test, y_pred))
  

Step 6: Save and Load Model

  import joblib

joblib.dump(grid.best_estimator_, "iris_model.pkl")

loaded = joblib.load("iris_model.pkl")
sample = [[5.1, 3.5, 1.4, 0.2]]
prediction = loaded.predict(sample)
print(f"Predicted species: {iris.target_names[prediction[0]]}")
  

Step 7: Prediction Function

  def predict_species(sepal_length, sepal_width, petal_length, petal_width):
    model = joblib.load("iris_model.pkl")
    features = [[sepal_length, sepal_width, petal_length, petal_width]]
    pred = model.predict(features)[0]
    return iris.target_names[pred]

print(predict_species(5.1, 3.5, 1.4, 0.2))  # setosa
  

Concepts Applied

Bonus Challenges

  1. Try different algorithms (SVM, Gradient Boosting) and compare
  2. Use a real-world dataset from Kaggle
  3. Build a CLI that accepts measurements and returns predictions (CLI Apps)
  4. Wrap the model in a FastAPI endpoint (REST API Project)
  5. Add feature importance visualization

This project teaches the standard ML workflow used in every data science team.