Status: Complete Python Coverage License

📖 Pipeline Automation¶

🎯 Goal of This Notebook
🔁 Sklearn Pipelines
🧪 MLflow Basics
📊 Tracking Experiments
🏛️ Model Registry Concepts
🧬 Reproducibility Tips


🎯 Goal of This Notebook¶

Why This Notebook Exists¶

As projects grow, manual steps become bottlenecks. This notebook focuses on making your workflows repeatable, trackable, and production-friendly using pipelines and experiment tracking.

You’ll learn to:

  • Use sklearn.pipeline and ColumnTransformer to cleanly chain steps.
  • Track experiments with tools like MLflow, including metrics and artifacts.
  • Compare different model versions in a structured way.
  • Register models with basic metadata to support staging and promotion.

We’ll keep things lightweight but realistic — aligned with how tools like Azure ML, SageMaker, and Vertex AI expect models to be trained and tracked.

Back to the top


🔁 Sklearn Pipelines¶

🧱 Pipeline and ColumnTransformer¶

Cleanly Chain Your Workflow¶

sklearn.pipeline.Pipeline lets you chain multiple steps — like imputation, scaling, and modeling — into one object.

Use ColumnTransformer to apply different preprocessing to numeric vs categorical columns. This helps:

  • Avoid repetitive code
  • Ensure consistent transformations during train and predict
  • Make your model portable and testable

⚙️ Fit → Transform → Predict Flow¶

What Happens Internally¶

When you call pipeline.fit(X, y):

  1. ColumnTransformer.fit() learns scalers / encoders
  2. transform() is applied to produce the processed matrix
  3. The estimator (e.g. LogisticRegression) is trained

Later, calling pipeline.predict(X_new) will:

  • Apply the exact same transforms from training
  • Use the trained model to generate predictions

This avoids manual mismatches and keeps inference consistent.

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Dummy data
df = pd.DataFrame({
    "age": [25, 32, 47],
    "gender": ["M", "F", "M"],
    "purchased": [0, 1, 1]
})

X = df[["age", "gender"]]
y = df["purchased"]

# Define transformations
preprocessor = ColumnTransformer(transformers=[
    ("num", StandardScaler(), ["age"]),
    ("cat", OneHotEncoder(), ["gender"])
])

# Define pipeline
clf = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("classifier", LogisticRegression())
])

clf.fit(X, y)
Out[1]:
Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['age']),
                                                 ('cat', OneHotEncoder(),
                                                  ['gender'])])),
                ('classifier', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['age']),
                                                 ('cat', OneHotEncoder(),
                                                  ['gender'])])),
                ('classifier', LogisticRegression())])
ColumnTransformer(transformers=[('num', StandardScaler(), ['age']),
                                ('cat', OneHotEncoder(), ['gender'])])
['age']
StandardScaler()
['gender']
OneHotEncoder()
LogisticRegression()

🧪 Avoiding Leakage (fit vs transform scope)¶

The #1 Pipeline Mistake¶

Never fit preprocessing steps outside the pipeline. If you do:

  • Scalers will learn from the full dataset (including test data)
  • Encoders might see categories that don’t exist in real input
  • Model metrics will look better than they really are

Always fit everything inside the pipeline using only training data. The pipeline will then safely handle unseen data during inference.

Back to the top


🧪 MLflow Basics¶

🧭 What MLflow Tracks¶

High-Level Tracking Capabilities¶

MLflow helps you track and organize model experiments. Each run can log:

  • Parameters – hyperparameters, configs
  • Metrics – accuracy, loss, AUC, etc.
  • Artifacts – models, plots, files
  • Source info – code version, environment, tags

It gives you a web UI to explore results and compare runs — locally or on cloud platforms like Databricks, Azure ML, or SageMaker Studio Lab.

⚙️ Tracking Parameters, Metrics, Artifacts¶

The Core MLflow Logging API¶

Use mlflow.start_run() to begin tracking, then log everything during training. You can log:

  • mlflow.log_param("learning_rate", 0.01)
  • mlflow.log_metric("accuracy", 0.92)
  • mlflow.sklearn.log_model(model, "model")
  • Any file using mlflow.log_artifact("plot.png")

This creates a reproducible record of each experiment.

In [5]:
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=200)

with mlflow.start_run():
    model.fit(X, y)
    acc = model.score(X, y)
    
    mlflow.log_param("model_type", "LogisticRegression")
    mlflow.log_metric("accuracy", acc)
    mlflow.sklearn.log_model(model, "model")
2025/06/30 13:25:17 WARNING mlflow.models.model: `artifact_path` is deprecated. Please use `name` instead.
2025/06/30 13:25:19 WARNING mlflow.models.model: Model logged without a signature and input example. Please set `input_example` parameter when logging the model to auto infer the model signature.

🗂️ Local vs Remote Storage Options¶

Where Are Your Runs Saved?¶

By default, MLflow saves runs in a local directory like mlruns/. This is fine for solo work but won’t scale.

Options:

  • Local filesystem – Default, quick and dirty.
  • S3 / GCS / Azure Blob – Use these for artifact storage.
  • Remote tracking server – Use with databases (PostgreSQL, MySQL) to centralize logs.
  • Managed MLflow – Available via Databricks or Azure ML.

Set MLFLOW_TRACKING_URI and MLFLOW_ARTIFACT_URI env variables to configure where runs go.

Back to the top


📊 Tracking Experiments¶

🏷️ Naming Runs & Experiments¶

Keep Your Logs Human-Readable¶

Use mlflow.set_experiment() to group related runs under a common name (e.g., "xgboost_baseline").

You can also label individual runs using with mlflow.start_run(run_name="run_01") so they’re easy to identify in the UI.

This becomes essential when you’re running dozens of variations across time or teammates.

🧼 Organizing Output¶

Store the Right Things, Not Everything¶

You don’t need to log your entire disk — just:

  • Final metrics (accuracy, AUC, etc.)
  • Key parameters (learning rate, max_depth, seed)
  • Final model file
  • A summary plot (like a confusion matrix or ROC curve)

Use folders like artifacts/plots/, artifacts/metrics.json to keep things clean if logging manually.

🔍 Comparing Model Results¶

Spot Patterns Across Experiments¶

Use the MLflow UI or search API to:

  • Sort runs by metrics (e.g., accuracy descending)
  • Compare hyperparameter impacts (e.g., max_depth vs AUC)
  • Filter by tags or run name

This allows you to move beyond guesswork and select the best performing, best-documented run for promotion or deployment.

Back to the top


🏛️ Model Registry Concepts¶

🔁 Lifecycle States (Staging, Prod, Archived)¶

What a Model Registry Actually Does¶

A model registry helps manage model versions like software:

  • Staging: Under evaluation or testing
  • Production: Actively used for serving
  • Archived: Deprecated or superseded

Instead of using filenames like model_v7_final_FINAL.joblib, use a registry to track:

  • Who trained it
  • What data it used
  • When it was promoted

Available in tools like:

  • MLflow Model Registry
  • SageMaker Model Registry
  • Vertex AI Model Management

🧪 Promoting + Loading Registered Models¶

Go From Best Run → Production Model¶

Once you select the best-performing run:

  1. Register the model:
    mlflow.register_model("runs:/<run_id>/model", "MyModel")

  2. Transition it to a lifecycle stage:
    client.transition_model_version_stage(...)

  3. Load by name + stage (instead of path):
    mlflow.pyfunc.load_model("models:/MyModel/Production")

This avoids hardcoding file paths or manually copying models across environments.

Back to the top


🧬 Reproducibility Tips¶

🧪 Seed Setting + Random State¶

Make Results Stable Across Runs¶

Randomness affects model training (e.g., train/test splits, shuffling, weight init). To make your experiments reproducible:

  • Set random_state in sklearn, XGBoost, etc.
  • Use np.random.seed() and random.seed() for general randomness
  • For PyTorch or TensorFlow, use their dedicated seed APIs

This ensures your metrics are repeatable and fair when comparing runs.

🗂️ Artifact Logging for Reuse¶

Log More Than Just the Model¶

Log any file that might help you or others rerun or understand your results:

  • config.yaml used for training
  • Data schema (e.g., input columns, dtypes)
  • Feature importance plots
  • Version info (requirements.txt, conda.yaml)
  • Evaluation summary or confusion matrix

MLflow lets you attach these as artifacts via mlflow.log_artifact(). This enables traceability even 6 months later when someone asks “what did we ship?”

Back to the top