In practice, most models never make it beyond the notebook. The goal here is to bridge the gap between local experimentation and a reusable, portable model artifact that can be shared, deployed, or tested.
This notebook focuses on:
We’ll keep it tool-agnostic where possible, but also reference:
A clean folder structure helps collaborators and downstream systems understand where things live and how to run them. Here's a minimal but effective layout:
project_name/
├── data/ # input datasets (gitignored)
├── notebooks/ # exploratory analysis, scratch work
├── src/ # reusable scripts and modules
│ ├── train.py # training logic
│ └── predict.py # inference logic
├── models/ # serialized model files (gitignored)
├── config/ # config files (YAML/JSON)
├── requirements.txt # or environment.yml
├── Dockerfile # container definition (optional)
└── README.md
This layout aligns with what cloud platforms like SageMaker, Azure ML, and Vertex AI expect for upload or integration.
.gitignore
to Keep the Repo Clean¶You don’t want to bloat your repository or leak sensitive files. Add the following to your .gitignore
:
# Ignore local data and model artifacts
data/
models/
# Ignore virtual environments
.env/
.venv/
# Ignore system files
.DS_Store
.ipynb_checkpoints/
# Ignore large logs or output dumps
logs/
For experiment tracking, artifacts should be logged to external storage (e.g., S3, GCS, or MLflow artifact store), not committed directly to the repo.
Once a model is trained, it needs to be saved in a portable format so it can be reused later for prediction. This is known as serialization or model persistence.
Common Python options:
joblib
– optimized for NumPy arrays; preferred for sklearn models.pickle
– general-purpose but less robust for large numerical objects.cloudpickle
– supports more complex Python objects (e.g., custom classes, lambdas).Tool preference:
joblib
for sklearn-style modelscloudpickle
for custom pipelines or advanced workflows!pip install "numpy<2"
Requirement already satisfied: numpy<2 in /Users/ashrithreddy/anaconda3/lib/python3.11/site-packages (1.26.4)
import os
import joblib
import xgboost as xgb
import numpy as np
# Ensure directory exists
os.makedirs("models", exist_ok=True)
# Generate dummy data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 1, 1, 1])
# Train XGBoost model
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric="logloss")
model.fit(X, y)
# Save model
joblib.dump(model, "models/xgb_model.joblib")
['models/xgb_model.joblib']
Saving just the .joblib
file isn’t enough. Always save associated metadata like:
Store this in a sidecar JSON file or use a model registry (like MLflow) to manage it.
This becomes critical when multiple models are trained and deployed — especially in tools like Vertex AI Model Registry, Azure ML Model Registry, or SageMaker Model Packages.
import os
import json
# Create the models directory if it doesn't exist
os.makedirs("models", exist_ok=True)
# Metadata dictionary
metadata = {
"model_version": "v1.0",
"git_commit": "a1b2c3d",
"trained_on": "2025-06-29",
"features": ["sepal_length", "sepal_width", "petal_length", "petal_width"],
"accuracy": 0.96
}
# Save to disk
with open("models/logreg_model_metadata.json", "w") as f:
json.dump(metadata, f, indent=2)
requirements.txt
vs environment.yml
¶requirements.txt
: Used with pip
. Simple, flat list of packages.environment.yml
: Used with conda
. Supports channels, dependencies, and environment naming.💡 Cloud platforms like Vertex AI, SageMaker, and Azure ML all accept both formats — but often prefer requirements.txt
for Docker-based custom containers.
If you're using pip
:
pip freeze > requirements.txt
If you're using conda
:
conda env export > environment.yml
Without version pinning, your training and deployment environments may drift over time, leading to subtle bugs.
Best practices:
pip freeze
to pin exact versions for production environments.venv
, conda
, or poetry
) for isolation.Pipfile.lock
, poetry.lock
) to enforce deterministic builds.Tooling options:
Notebooks are great for exploration, but production workflows need standalone scripts. Converting your training logic into a train.py
script allows:
Minimal structure for train.py
:
Use tools like nbconvert
or manually modularize key logic from the notebook.
import numpy as np
import random
import os
def set_seed(seed=42):
np.random.seed(seed)
random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s | %(levelname)s | %(message)s"
)
logging.info("Training started")
2025-06-30 12:58:26,540 | INFO | Training started
For serious use, consider structured loggers like:
logging
structlog
for structured logsargparse
Interface¶Command-line interfaces (CLIs) make your model usable without needing a notebook or UI. This is useful for:
We'll use Python’s built-in argparse
to capture input paths and model paths.
import argparse
import sys
# Simulate CLI args
sys.argv = ['predict.py', '--input', 'models/sample.csv', '--model', 'models/xgb_model.joblib']
parser = argparse.ArgumentParser(description="Run inference using a trained model.")
parser.add_argument('--input', type=str, required=True, help="Path to input CSV or JSON")
parser.add_argument('--model', type=str, default='models/logreg_model.joblib', help="Path to model file")
args = parser.parse_args()
The script should:
This structure is usable in most cloud platforms (e.g., GCP AI Platform, SageMaker Processing Jobs).
import pandas as pd
import os
# Make sure the current directory is writeable and file is saved
sample = pd.DataFrame({"feature": [1, 2, 3, 4, 5]})
sample.to_csv("models/sample.csv", index=False)
# Confirm it's created
assert os.path.exists("models/sample.csv"), "models/sample.csv was not created."
import joblib
import pandas as pd
# Load model
model = joblib.load(args.model)
# Load input data
data = pd.read_csv(args.input)
# Predict
preds = model.predict(data)
# Output predictions
print("Predictions:", preds.tolist())
Predictions: [1, 1, 1, 1, 1]
Docker lets you bundle code, libraries, and dependencies into a single image that runs the same way everywhere — laptop, server, or cloud.
Think of it like a portable box that contains:
Cloud tools like AWS SageMaker, Vertex AI, and Azure ML all support Docker images for custom training or inference jobs.
# Your Dockerfile content here, save the file with the name "Dockerfile"
# # Use an official lightweight Python image
# FROM python:3.10-slim
# # Set working directory
# WORKDIR /app
# # Copy files into the image
# COPY requirements.txt .
# COPY predict.py .
# COPY models/ models/
# COPY sample.csv .
# # Install Python dependencies
# RUN pip install --no-cache-dir -r requirements.txt
# # Run prediction script by default
# ENTRYPOINT ["python", "predict.py"]
Even if your model is correct, your CLI wrapper might break due to:
Write a basic test script that:
These tests can be run with pytest
, unittest
, or even just shell scripts.
# test_predict.py (example structure)
import subprocess
def test_predict_runs():
result = subprocess.run(
["python", "predict.py", "--input", "models/sample.csv", "--model", "models/xgb_model.joblib"],
capture_output=True,
text=True
)
assert result.returncode == 0
assert "Predictions" in result.stdout
Before shipping the model anywhere, confirm that:
This should be part of your dev workflow before Dockerizing or deploying.
import joblib
import pandas as pd
model = joblib.load("models/xgb_model.joblib")
df = pd.read_csv("models/sample.csv")
print("Prediction output:", model.predict(df).tolist())
Prediction output: [1, 1, 1, 1, 1]