🎯 Goal of This Notebook¶
This notebook covers the final phase of the MLOps cycle: post-deployment monitoring and continuous integration (CI). Our focus is on:
- Capturing real-world model behavior in production
- Setting up automated tests and alerts
- Addressing model degradation due to data or concept drift
- Enabling rollback, hotfixes, and safe deployments
These practices ensure your deployed models remain trustworthy, maintainable, and auditable over time.
📈 Monitoring Predictions¶
📊 Log Real-Time Inputs + Outputs¶
To monitor live behavior, log the inputs received and the predictions returned by your model.
Logging should include:
- Timestamp of request
- Input features
- Model version used
- Prediction output
These logs can be pushed to:
- Cloud-native logging services (e.g., GCP Cloud Logging, AWS CloudWatch, Azure Monitor)
- File-based or database storage for post-hoc analysis
Avoid logging personally identifiable information (PII).
🧾 Store Metadata for Each Call¶
In addition to prediction outputs, capture contextual metadata for each inference:
- Request ID or session ID
- User ID (if applicable)
- Inference latency
- Model version
- Upstream/downstream service info
Metadata is essential for:
- Debugging failed predictions
- Monitoring system health
- Investigating performance regressions
Cloud-friendly formats: JSON logs, structured logging, or schema-enforced event pipelines.
🌊 Data + Concept Drift¶
📉 Data Drift Detection (e.g., PSI, KS tests)¶
Data drift occurs when the distribution of incoming features diverges from the training data.
Common drift detection techniques:
- PSI (Population Stability Index) for continuous variables
- KS test (Kolmogorov–Smirnov) to compare distributions
- Jensen–Shannon Divergence, Hellinger Distance, etc.
Where to use:
- Periodic batch jobs (e.g., daily PSI computation)
- Real-time windowed monitoring (e.g., 24-hour input window vs. training baseline)
Cloud tools: AWS SageMaker Model Monitor, GCP Vertex AI Model Monitoring, Azure Data Drift Monitor
🧠 Concept Drift: Label Distribution Shifts¶
Concept drift refers to changes in the relationship between inputs and labels over time.
Symptoms include:
- Stable input distributions, but prediction accuracy drops
- Change in target label distributions (e.g., fraud patterns shift)
Detection methods:
- Performance degradation alerts (if labels available)
- Monitoring class balance over time
- Manual reviews or business feedback loops
Harder to detect than data drift, especially in delayed-feedback or label-scarce environments.
🧪 Model Health Signals¶
This section outlines key indicators to monitor your model’s performance in production.
Unlike offline metrics, production health signals help identify real-time issues such as:
- Unusual prediction patterns
- Degrading confidence
- Invalid input rates
- Post-deployment accuracy decay
These indicators help trigger alerts, re-training, or rollback workflows.
🧠 Prediction Confidence¶
Track the model’s prediction confidence over time to detect:
- Overconfident wrong predictions
- Increased uncertainty, which might signal drift or unfamiliar inputs
Common metrics:
- Mean confidence per class
- Distribution of prediction probabilities
- Entropy of output distribution
Helpful for early warning systems in classification tasks.
🧼 Percent Invalid Inputs / Errors¶
Monitor the percentage of requests that fail due to:
- Malformed or missing input fields
- Schema mismatches
- Data type errors
- Preprocessing failures
Helps detect upstream pipeline issues or external API shifts.
📉 Accuracy Decay Over Time¶
Compare the model’s actual performance (if ground truth is available) over time.
Key metrics:
- Accuracy, AUC, precision/recall trends
- Deviation from expected performance bounds
- Error rate spikes after code or data changes
Requires a feedback loop where true labels arrive with some delay.
📤 Logging + Alerting Basics¶
🗂️ Log Aggregation Tools (e.g., ELK, Prometheus)¶
In production environments, raw logs from your ML service can be aggregated and queried using:
- ELK Stack (Elasticsearch, Logstash, Kibana): Powerful for full-text log search and dashboarding.
- Prometheus + Grafana: Time-series monitoring, often used with Kubernetes. Tracks metrics like latency, error rate, request volume.
- Fluentd + CloudWatch: Common in AWS stacks for unified logging.
These tools help centralize logs from multiple containers/services into a single searchable interface.
📬 Trigger Threshold-Based Alerts¶
Alerts notify teams when monitored signals breach thresholds. Example cases:
- Drop in prediction confidence below a set limit
- Spike in input errors
- Sustained accuracy drop from post-deployment validation
Popular tools:
- PagerDuty or Opsgenie for alert delivery
- Grafana alerting for visual thresholds
- Cloud-native alerts: CloudWatch Alarms (AWS), Google Cloud Monitoring
Good practice: start with low-noise alerts and tune thresholds iteratively.
⚙️ CI for Model Code¶
🧪 Run Tests on Every Commit¶
CI (Continuous Integration) ensures that every code commit triggers a test suite, reducing regressions.
- Use tools like GitHub Actions, GitLab CI, or CircleCI to automate test execution.
- Include unit tests for preprocessing, model logic, and utility functions.
- Example trigger:
on: [push, pull_request]
in a.github/workflows/ci.yml
.
This protects against breaking the training/inference flow inadvertently.
📦 Build + Lint + Validate Artifacts¶
Along with tests, your CI should also:
- Build model packaging scripts or Docker images
- Run linters like
flake8
,black
, orpylint
for clean code - Validate metadata: schema, versioning, file paths
- Check serialized model integrity (e.g., load with
joblib
)
These steps ensure consistent, production-ready deliverables from every push.
🔁 GitHub Actions Example¶
GitHub Actions is a popular CI/CD tool natively integrated into GitHub.
It allows you to automate steps like testing, linting, and deployment by writing simple YAML files inside .github/workflows/
.
For ML projects, common use cases include:
- Running unit tests when code is pushed
- Validating model artifact integrity
- Notifying Slack/email if tests fail
- Automating Docker image builds or cloud deployments
You can trigger actions on events like push
, pull_request
, or schedule
(for cron jobs).
🧾 YAML Setup for Python Project¶
To automate tests and packaging, add a .github/workflows/ci.yml
file.
Key sections:
name
: Name of the workflowon
: Trigger events likepush
orpull_request
jobs
: Define one or more jobs (e.g., testing, linting)runs-on
: OS environment (e.g.,ubuntu-latest
)steps
: Checkout code, set up Python, install deps, run commands
This sets the foundation for reproducible CI.
🧪 Testing & Notification Workflow¶
A complete workflow includes:
- Checkout code using
actions/checkout
- Set up Python with
actions/setup-python
- Install dependencies (
pip install -r requirements.txt
) - Run unit tests with
pytest
- Lint code with
flake8
,black
, orpylint
- Optional: Send notifications via Slack or email on failure
This ensures high confidence in every commit to the repo.
🪛 Rollback & Hotfix Strategy¶
🚨 When Model Goes Bad¶
Despite best efforts, model failures can occur due to:
- Data drift or concept drift
- Upstream pipeline changes
- Hidden bugs in code
- Latency spikes or unexpected inputs
It's crucial to detect failures early and have a defined rollback protocol.
🔁 Safe Fallback + Version Control¶
Strategies to reduce blast radius:
- Always keep a known-good model version registered (e.g., “Production” stage in MLflow)
- Version all model artifacts and configs in a registry or Git
- Implement a fallback rule (e.g., revert to previous model if accuracy drops below threshold)
- Rollback should be automated via CI/CD, not manual
Versioning + automation = safer hotfixes.