As projects grow, manual steps become bottlenecks. This notebook focuses on making your workflows repeatable, trackable, and production-friendly using pipelines and experiment tracking.
You’ll learn to:
sklearn.pipeline
and ColumnTransformer
to cleanly chain steps.We’ll keep things lightweight but realistic — aligned with how tools like Azure ML, SageMaker, and Vertex AI expect models to be trained and tracked.
Pipeline
and ColumnTransformer
¶sklearn.pipeline.Pipeline
lets you chain multiple steps — like imputation, scaling, and modeling — into one object.
Use ColumnTransformer
to apply different preprocessing to numeric vs categorical columns. This helps:
When you call pipeline.fit(X, y)
:
ColumnTransformer.fit()
learns scalers / encoderstransform()
is applied to produce the processed matrixLogisticRegression
) is trainedLater, calling pipeline.predict(X_new)
will:
This avoids manual mismatches and keeps inference consistent.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
import pandas as pd
# Dummy data
df = pd.DataFrame({
"age": [25, 32, 47],
"gender": ["M", "F", "M"],
"purchased": [0, 1, 1]
})
X = df[["age", "gender"]]
y = df["purchased"]
# Define transformations
preprocessor = ColumnTransformer(transformers=[
("num", StandardScaler(), ["age"]),
("cat", OneHotEncoder(), ["gender"])
])
# Define pipeline
clf = Pipeline(steps=[
("preprocessing", preprocessor),
("classifier", LogisticRegression())
])
clf.fit(X, y)
Pipeline(steps=[('preprocessing', ColumnTransformer(transformers=[('num', StandardScaler(), ['age']), ('cat', OneHotEncoder(), ['gender'])])), ('classifier', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('preprocessing', ColumnTransformer(transformers=[('num', StandardScaler(), ['age']), ('cat', OneHotEncoder(), ['gender'])])), ('classifier', LogisticRegression())])
ColumnTransformer(transformers=[('num', StandardScaler(), ['age']), ('cat', OneHotEncoder(), ['gender'])])
['age']
StandardScaler()
['gender']
OneHotEncoder()
LogisticRegression()
Never fit preprocessing steps outside the pipeline. If you do:
Always fit everything inside the pipeline using only training data. The pipeline will then safely handle unseen data during inference.
MLflow helps you track and organize model experiments. Each run can log:
It gives you a web UI to explore results and compare runs — locally or on cloud platforms like Databricks, Azure ML, or SageMaker Studio Lab.
Use mlflow.start_run()
to begin tracking, then log everything during training. You can log:
mlflow.log_param("learning_rate", 0.01)
mlflow.log_metric("accuracy", 0.92)
mlflow.sklearn.log_model(model, "model")
mlflow.log_artifact("plot.png")
This creates a reproducible record of each experiment.
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=200)
with mlflow.start_run():
model.fit(X, y)
acc = model.score(X, y)
mlflow.log_param("model_type", "LogisticRegression")
mlflow.log_metric("accuracy", acc)
mlflow.sklearn.log_model(model, "model")
2025/06/30 13:25:17 WARNING mlflow.models.model: `artifact_path` is deprecated. Please use `name` instead.
2025/06/30 13:25:19 WARNING mlflow.models.model: Model logged without a signature and input example. Please set `input_example` parameter when logging the model to auto infer the model signature.
By default, MLflow saves runs in a local directory like mlruns/
. This is fine for solo work but won’t scale.
Options:
Set MLFLOW_TRACKING_URI
and MLFLOW_ARTIFACT_URI
env variables to configure where runs go.
Use mlflow.set_experiment()
to group related runs under a common name (e.g., "xgboost_baseline").
You can also label individual runs using with mlflow.start_run(run_name="run_01")
so they’re easy to identify in the UI.
This becomes essential when you’re running dozens of variations across time or teammates.
You don’t need to log your entire disk — just:
Use folders like artifacts/plots/
, artifacts/metrics.json
to keep things clean if logging manually.
Use the MLflow UI or search API to:
This allows you to move beyond guesswork and select the best performing, best-documented run for promotion or deployment.
A model registry helps manage model versions like software:
Instead of using filenames like model_v7_final_FINAL.joblib
, use a registry to track:
Available in tools like:
Once you select the best-performing run:
Register the model:
mlflow.register_model("runs:/<run_id>/model", "MyModel")
Transition it to a lifecycle stage:
client.transition_model_version_stage(...)
Load by name + stage (instead of path):
mlflow.pyfunc.load_model("models:/MyModel/Production")
This avoids hardcoding file paths or manually copying models across environments.
Randomness affects model training (e.g., train/test splits, shuffling, weight init). To make your experiments reproducible:
random_state
in sklearn, XGBoost, etc.np.random.seed()
and random.seed()
for general randomnessThis ensures your metrics are repeatable and fair when comparing runs.
Log any file that might help you or others rerun or understand your results:
config.yaml
used for trainingrequirements.txt
, conda.yaml
)MLflow lets you attach these as artifacts via mlflow.log_artifact()
. This enables traceability even 6 months later when someone asks “what did we ship?”