Status: Complete Python Coverage License

📖 Model Packaging¶

🎯 Goal of This Notebook
🧱 Project Structure & Folder Hygiene
🧠 Model Serialization
🗂️ Dependency Management
🧪 Reproducible Training Script
🧰 CLI for Prediction
🐳 Introduction to Docker
🧪 Testing the Packaged Model


🎯 Goal of This Notebook¶

In practice, most models never make it beyond the notebook. The goal here is to bridge the gap between local experimentation and a reusable, portable model artifact that can be shared, deployed, or tested.

This notebook focuses on:

  • Structuring your project so it’s clean and production-friendly.
  • Saving models in a consistent, interoperable format.
  • Managing dependencies for reproducibility.
  • Writing simple scripts for training and prediction.
  • Introducing containerization (Docker) for consistency across environments.

We’ll keep it tool-agnostic where possible, but also reference:

  • AWS Sagemaker, Google AI Platform, Azure ML — for how they expect inputs
  • joblib, cloudpickle, and argparse — for local scripts
  • Docker — for packaging code into containers

Back to the top


🧱 Project Structure & Folder Hygiene¶

📁 Suggested Layout¶

Organizing Your Project for Reuse¶

A clean folder structure helps collaborators and downstream systems understand where things live and how to run them. Here's a minimal but effective layout:

project_name/
├── data/               # input datasets (gitignored)
├── notebooks/          # exploratory analysis, scratch work
├── src/                # reusable scripts and modules
│   ├── train.py        # training logic
│   └── predict.py      # inference logic
├── models/             # serialized model files (gitignored)
├── config/             # config files (YAML/JSON)
├── requirements.txt    # or environment.yml
├── Dockerfile          # container definition (optional)
└── README.md

This layout aligns with what cloud platforms like SageMaker, Azure ML, and Vertex AI expect for upload or integration.

🧼 What to Keep Out of Version Control¶

Use .gitignore to Keep the Repo Clean¶

You don’t want to bloat your repository or leak sensitive files. Add the following to your .gitignore:

# Ignore local data and model artifacts
data/
models/

# Ignore virtual environments
.env/
.venv/

# Ignore system files
.DS_Store
.ipynb_checkpoints/

# Ignore large logs or output dumps
logs/

For experiment tracking, artifacts should be logged to external storage (e.g., S3, GCS, or MLflow artifact store), not committed directly to the repo.

Back to the top


🧠 Model Serialization¶

💾 Saving Models (joblib, pickle, cloudpickle)¶

Why Model Serialization Matters¶

Once a model is trained, it needs to be saved in a portable format so it can be reused later for prediction. This is known as serialization or model persistence.

Common Python options:

  • joblib – optimized for NumPy arrays; preferred for sklearn models.
  • pickle – general-purpose but less robust for large numerical objects.
  • cloudpickle – supports more complex Python objects (e.g., custom classes, lambdas).

Tool preference:

  • joblib for sklearn-style models
  • cloudpickle for custom pipelines or advanced workflows
In [1]:
!pip install "numpy<2"
Requirement already satisfied: numpy<2 in /Users/ashrithreddy/anaconda3/lib/python3.11/site-packages (1.26.4)
In [2]:
import os
import joblib
import xgboost as xgb
import numpy as np

# Ensure directory exists
os.makedirs("models", exist_ok=True)

# Generate dummy data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 1, 1, 1])

# Train XGBoost model
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric="logloss")
model.fit(X, y)

# Save model
joblib.dump(model, "models/xgb_model.joblib")
Out[2]:
['models/xgb_model.joblib']

🧪 Versioning + Metadata Storage¶

Add Context to Every Saved Model¶

Saving just the .joblib file isn’t enough. Always save associated metadata like:

  • Model version
  • Git commit hash
  • Training date
  • Feature schema
  • Evaluation metrics

Store this in a sidecar JSON file or use a model registry (like MLflow) to manage it.

This becomes critical when multiple models are trained and deployed — especially in tools like Vertex AI Model Registry, Azure ML Model Registry, or SageMaker Model Packages.

In [3]:
import os
import json

# Create the models directory if it doesn't exist
os.makedirs("models", exist_ok=True)

# Metadata dictionary
metadata = {
    "model_version": "v1.0",
    "git_commit": "a1b2c3d",
    "trained_on": "2025-06-29",
    "features": ["sepal_length", "sepal_width", "petal_length", "petal_width"],
    "accuracy": 0.96
}

# Save to disk
with open("models/logreg_model_metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)

Back to the top


🗂️ Dependency Management¶

🧾 requirements.txt vs environment.yml¶

Two Styles of Declaring Dependencies¶
  • requirements.txt: Used with pip. Simple, flat list of packages.
  • environment.yml: Used with conda. Supports channels, dependencies, and environment naming.

💡 Cloud platforms like Vertex AI, SageMaker, and Azure ML all accept both formats — but often prefer requirements.txt for Docker-based custom containers.

If you're using pip:

pip freeze > requirements.txt

If you're using conda:

conda env export > environment.yml

🔐 Deterministic Environments (pip freeze, lock files)¶

Why Just Listing Packages Isn’t Enough¶

Without version pinning, your training and deployment environments may drift over time, leading to subtle bugs.

Best practices:

  • Use pip freeze to pin exact versions for production environments.
  • Use virtual environments (e.g., venv, conda, or poetry) for isolation.
  • Consider lock files (Pipfile.lock, poetry.lock) to enforce deterministic builds.

Tooling options:

  • pip-tools or Poetry if you want extra control.
  • In production: use Docker to freeze the entire environment, not just Python dependencies.

Back to the top


🧪 Reproducible Training Script¶

📜 Convert Notebook to Script¶

From Experiment to Reusable Code¶

Notebooks are great for exploration, but production workflows need standalone scripts. Converting your training logic into a train.py script allows:

  • Easier automation
  • Integration with CI/CD tools
  • Better version control

Minimal structure for train.py:

  • Load and validate data
  • Preprocess features
  • Train model
  • Evaluate and save model
  • Save metadata (e.g., metrics, timestamp, git hash)

Use tools like nbconvert or manually modularize key logic from the notebook.

🔁 Add Seed Control + Logging¶

Making Training Repeatable and Traceable¶
In [4]:
import numpy as np
import random
import os

def set_seed(seed=42):
    np.random.seed(seed)
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
In [5]:
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s"
)

logging.info("Training started")
2025-06-30 12:58:26,540 | INFO | Training started

For serious use, consider structured loggers like:

  • Python’s built-in logging
  • structlog for structured logs
  • Cloud-native options like AWS CloudWatch, Azure Monitor, or GCP Logging

Back to the top


🧰 CLI for Prediction¶

🧑‍💻 argparse Interface¶

Why Build a CLI?¶

Command-line interfaces (CLIs) make your model usable without needing a notebook or UI. This is useful for:

  • Quick local testing
  • Batch predictions via cron jobs or scripts
  • Integrating with CI/CD workflows

We'll use Python’s built-in argparse to capture input paths and model paths.

In [14]:
import argparse
import sys

# Simulate CLI args
sys.argv = ['predict.py', '--input', 'models/sample.csv', '--model', 'models/xgb_model.joblib']

parser = argparse.ArgumentParser(description="Run inference using a trained model.")
parser.add_argument('--input', type=str, required=True, help="Path to input CSV or JSON")
parser.add_argument('--model', type=str, default='models/logreg_model.joblib', help="Path to model file")

args = parser.parse_args()

📦 Load Model → Accept Input → Return Output¶

A Minimal End-to-End Prediction Script¶

The script should:

  • Load the trained model from disk
  • Accept a structured input file (CSV or JSON)
  • Return predictions to the console or save them to a file

This structure is usable in most cloud platforms (e.g., GCP AI Platform, SageMaker Processing Jobs).

In [15]:
import pandas as pd
import os

# Make sure the current directory is writeable and file is saved
sample = pd.DataFrame({"feature": [1, 2, 3, 4, 5]})
sample.to_csv("models/sample.csv", index=False)

# Confirm it's created
assert os.path.exists("models/sample.csv"), "models/sample.csv was not created."
In [16]:
import joblib
import pandas as pd

# Load model
model = joblib.load(args.model)

# Load input data
data = pd.read_csv(args.input)

# Predict
preds = model.predict(data)

# Output predictions
print("Predictions:", preds.tolist())
Predictions: [1, 1, 1, 1, 1]

Back to the top


🐳 Introduction to Docker¶

📦 What Is Docker (In Plain English)¶

Why ML Engineers Love It¶

Docker lets you bundle code, libraries, and dependencies into a single image that runs the same way everywhere — laptop, server, or cloud.

Think of it like a portable box that contains:

  • Your Python environment
  • Your model files
  • Any OS-level packages or dependencies
  • Instructions on how to run your code

Cloud tools like AWS SageMaker, Vertex AI, and Azure ML all support Docker images for custom training or inference jobs.

🧱 Writing a Simple Dockerfile¶

Minimal Dockerfile to Serve a Model¶

This Dockerfile assumes you have:

  • A requirements.txt with your Python dependencies
  • A predict.py file for running inference
In [17]:
# Your Dockerfile content here, save the file with the name "Dockerfile"

# # Use an official lightweight Python image
# FROM python:3.10-slim

# # Set working directory
# WORKDIR /app

# # Copy files into the image
# COPY requirements.txt .
# COPY predict.py .
# COPY models/ models/
# COPY sample.csv .

# # Install Python dependencies
# RUN pip install --no-cache-dir -r requirements.txt

# # Run prediction script by default
# ENTRYPOINT ["python", "predict.py"]

Back to the top


🧪 Testing the Packaged Model¶

🧪 Unit Test the CLI¶

Why Test the Interface?¶

Even if your model is correct, your CLI wrapper might break due to:

  • Bad input parsing
  • File path issues
  • Incorrect model loading
  • Missing dependencies

Write a basic test script that:

  • Mocks command-line inputs
  • Confirms the script runs end-to-end without crashing
  • Optionally checks output format or types

These tests can be run with pytest, unittest, or even just shell scripts.

In [18]:
# test_predict.py (example structure)
import subprocess

def test_predict_runs():
    result = subprocess.run(
        ["python", "predict.py", "--input", "models/sample.csv", "--model", "models/xgb_model.joblib"],
        capture_output=True,
        text=True
    )
    assert result.returncode == 0
    assert "Predictions" in result.stdout

🧪 Sanity Check After Serialization¶

Quick Load → Predict Roundtrip¶

Before shipping the model anywhere, confirm that:

  1. The saved model file actually loads
  2. It accepts valid input without crashing
  3. The predictions look consistent with training

This should be part of your dev workflow before Dockerizing or deploying.

In [19]:
import joblib
import pandas as pd

model = joblib.load("models/xgb_model.joblib")
df = pd.read_csv("models/sample.csv")
print("Prediction output:", model.predict(df).tolist())
Prediction output: [1, 1, 1, 1, 1]

Back to the top