📖 Monitoring & CI¶

🎯 Goal of This Notebook
📈 Monitoring Predictions
🌊 Data + Concept Drift
🧪 Model Health Signals
📤 Logging + Alerting Basics
⚙️ CI for Model Code
🔁 GitHub Actions Example
🪛 Rollback & Hotfix Strategy

🎯 Goal of This Notebook¶

This notebook covers the final phase of the MLOps cycle: post-deployment monitoring and continuous integration (CI). Our focus is on:

Capturing real-world model behavior in production
Setting up automated tests and alerts
Addressing model degradation due to data or concept drift
Enabling rollback, hotfixes, and safe deployments

These practices ensure your deployed models remain trustworthy, maintainable, and auditable over time.

Back to the top

📈 Monitoring Predictions¶

📊 Log Real-Time Inputs + Outputs¶

To monitor live behavior, log the inputs received and the predictions returned by your model.

Logging should include:

Timestamp of request
Input features
Model version used
Prediction output

These logs can be pushed to:

Cloud-native logging services (e.g., GCP Cloud Logging, AWS CloudWatch, Azure Monitor)
File-based or database storage for post-hoc analysis

Avoid logging personally identifiable information (PII).

🧾 Store Metadata for Each Call¶

In addition to prediction outputs, capture contextual metadata for each inference:

Request ID or session ID
User ID (if applicable)
Inference latency
Model version
Upstream/downstream service info

Metadata is essential for:

Debugging failed predictions
Monitoring system health
Investigating performance regressions

Cloud-friendly formats: JSON logs, structured logging, or schema-enforced event pipelines.

Back to the top

🌊 Data + Concept Drift¶

📉 Data Drift Detection (e.g., PSI, KS tests)¶

Data drift occurs when the distribution of incoming features diverges from the training data.

Common drift detection techniques:

PSI (Population Stability Index) for continuous variables
KS test (Kolmogorov–Smirnov) to compare distributions
Jensen–Shannon Divergence, Hellinger Distance, etc.

Where to use:

Periodic batch jobs (e.g., daily PSI computation)
Real-time windowed monitoring (e.g., 24-hour input window vs. training baseline)

Cloud tools: AWS SageMaker Model Monitor, GCP Vertex AI Model Monitoring, Azure Data Drift Monitor

🧠 Concept Drift: Label Distribution Shifts¶

Concept drift refers to changes in the relationship between inputs and labels over time.

Symptoms include:

Stable input distributions, but prediction accuracy drops
Change in target label distributions (e.g., fraud patterns shift)

Detection methods:

Performance degradation alerts (if labels available)
Monitoring class balance over time
Manual reviews or business feedback loops

Harder to detect than data drift, especially in delayed-feedback or label-scarce environments.

Back to the top

🧪 Model Health Signals¶

This section outlines key indicators to monitor your model’s performance in production.

Unlike offline metrics, production health signals help identify real-time issues such as:

Unusual prediction patterns
Degrading confidence
Invalid input rates
Post-deployment accuracy decay

These indicators help trigger alerts, re-training, or rollback workflows.

🧠 Prediction Confidence¶

Track the model’s prediction confidence over time to detect:

Overconfident wrong predictions
Increased uncertainty, which might signal drift or unfamiliar inputs

Common metrics:

Mean confidence per class
Distribution of prediction probabilities
Entropy of output distribution

Helpful for early warning systems in classification tasks.

🧼 Percent Invalid Inputs / Errors¶

Monitor the percentage of requests that fail due to:

Malformed or missing input fields
Schema mismatches
Data type errors
Preprocessing failures

Helps detect upstream pipeline issues or external API shifts.

📉 Accuracy Decay Over Time¶

Compare the model’s actual performance (if ground truth is available) over time.

Key metrics:

Accuracy, AUC, precision/recall trends
Deviation from expected performance bounds
Error rate spikes after code or data changes

Requires a feedback loop where true labels arrive with some delay.

Back to the top

📤 Logging + Alerting Basics¶

🗂️ Log Aggregation Tools (e.g., ELK, Prometheus)¶

In production environments, raw logs from your ML service can be aggregated and queried using:

ELK Stack (Elasticsearch, Logstash, Kibana): Powerful for full-text log search and dashboarding.
Prometheus + Grafana: Time-series monitoring, often used with Kubernetes. Tracks metrics like latency, error rate, request volume.
Fluentd + CloudWatch: Common in AWS stacks for unified logging.

These tools help centralize logs from multiple containers/services into a single searchable interface.

📬 Trigger Threshold-Based Alerts¶

Alerts notify teams when monitored signals breach thresholds. Example cases:

Drop in prediction confidence below a set limit
Spike in input errors
Sustained accuracy drop from post-deployment validation

Popular tools:

PagerDuty or Opsgenie for alert delivery
Grafana alerting for visual thresholds
Cloud-native alerts: CloudWatch Alarms (AWS), Google Cloud Monitoring

Good practice: start with low-noise alerts and tune thresholds iteratively.

Back to the top

⚙️ CI for Model Code¶

🧪 Run Tests on Every Commit¶

CI (Continuous Integration) ensures that every code commit triggers a test suite, reducing regressions.

Use tools like GitHub Actions, GitLab CI, or CircleCI to automate test execution.
Include unit tests for preprocessing, model logic, and utility functions.
Example trigger: on: [push, pull_request] in a .github/workflows/ci.yml.

This protects against breaking the training/inference flow inadvertently.

📦 Build + Lint + Validate Artifacts¶

Along with tests, your CI should also:

Build model packaging scripts or Docker images
Run linters like flake8, black, or pylint for clean code
Validate metadata: schema, versioning, file paths
Check serialized model integrity (e.g., load with joblib)

These steps ensure consistent, production-ready deliverables from every push.

Back to the top

🔁 GitHub Actions Example¶

GitHub Actions is a popular CI/CD tool natively integrated into GitHub.

It allows you to automate steps like testing, linting, and deployment by writing simple YAML files inside .github/workflows/.

For ML projects, common use cases include:

Running unit tests when code is pushed
Validating model artifact integrity
Notifying Slack/email if tests fail
Automating Docker image builds or cloud deployments

You can trigger actions on events like push, pull_request, or schedule (for cron jobs).

🧾 YAML Setup for Python Project¶

To automate tests and packaging, add a .github/workflows/ci.yml file.

Key sections:

name: Name of the workflow
on: Trigger events like push or pull_request
jobs: Define one or more jobs (e.g., testing, linting)
runs-on: OS environment (e.g., ubuntu-latest)
steps: Checkout code, set up Python, install deps, run commands

This sets the foundation for reproducible CI.

🧪 Testing & Notification Workflow¶

A complete workflow includes:

Checkout code using actions/checkout
Set up Python with actions/setup-python
Install dependencies (pip install -r requirements.txt)
Run unit tests with pytest
Lint code with flake8, black, or pylint
Optional: Send notifications via Slack or email on failure

This ensures high confidence in every commit to the repo.

Back to the top

🪛 Rollback & Hotfix Strategy¶

🚨 When Model Goes Bad¶

Despite best efforts, model failures can occur due to:

Data drift or concept drift
Upstream pipeline changes
Hidden bugs in code
Latency spikes or unexpected inputs

It's crucial to detect failures early and have a defined rollback protocol.

🔁 Safe Fallback + Version Control¶

Strategies to reduce blast radius:

Always keep a known-good model version registered (e.g., “Production” stage in MLflow)
Version all model artifacts and configs in a registry or Git
Implement a fallback rule (e.g., revert to previous model if accuracy drops below threshold)
Rollback should be automated via CI/CD, not manual

Versioning + automation = safer hotfixes.

Back to the top