Status: Complete Python Coverage License

📖 Monitoring & CI¶

🎯 Goal of This Notebook
📈 Monitoring Predictions
🌊 Data + Concept Drift
🧪 Model Health Signals
📤 Logging + Alerting Basics
⚙️ CI for Model Code
🔁 GitHub Actions Example
🪛 Rollback & Hotfix Strategy


🎯 Goal of This Notebook¶

This notebook covers the final phase of the MLOps cycle: post-deployment monitoring and continuous integration (CI). Our focus is on:

  • Capturing real-world model behavior in production
  • Setting up automated tests and alerts
  • Addressing model degradation due to data or concept drift
  • Enabling rollback, hotfixes, and safe deployments

These practices ensure your deployed models remain trustworthy, maintainable, and auditable over time.

Back to the top


📈 Monitoring Predictions¶

📊 Log Real-Time Inputs + Outputs¶

To monitor live behavior, log the inputs received and the predictions returned by your model.

Logging should include:

  • Timestamp of request
  • Input features
  • Model version used
  • Prediction output

These logs can be pushed to:

  • Cloud-native logging services (e.g., GCP Cloud Logging, AWS CloudWatch, Azure Monitor)
  • File-based or database storage for post-hoc analysis

Avoid logging personally identifiable information (PII).

🧾 Store Metadata for Each Call¶

In addition to prediction outputs, capture contextual metadata for each inference:

  • Request ID or session ID
  • User ID (if applicable)
  • Inference latency
  • Model version
  • Upstream/downstream service info

Metadata is essential for:

  • Debugging failed predictions
  • Monitoring system health
  • Investigating performance regressions

Cloud-friendly formats: JSON logs, structured logging, or schema-enforced event pipelines.

Back to the top


🌊 Data + Concept Drift¶

📉 Data Drift Detection (e.g., PSI, KS tests)¶

Data drift occurs when the distribution of incoming features diverges from the training data.

Common drift detection techniques:

  • PSI (Population Stability Index) for continuous variables
  • KS test (Kolmogorov–Smirnov) to compare distributions
  • Jensen–Shannon Divergence, Hellinger Distance, etc.

Where to use:

  • Periodic batch jobs (e.g., daily PSI computation)
  • Real-time windowed monitoring (e.g., 24-hour input window vs. training baseline)

Cloud tools: AWS SageMaker Model Monitor, GCP Vertex AI Model Monitoring, Azure Data Drift Monitor

🧠 Concept Drift: Label Distribution Shifts¶

Concept drift refers to changes in the relationship between inputs and labels over time.

Symptoms include:

  • Stable input distributions, but prediction accuracy drops
  • Change in target label distributions (e.g., fraud patterns shift)

Detection methods:

  • Performance degradation alerts (if labels available)
  • Monitoring class balance over time
  • Manual reviews or business feedback loops

Harder to detect than data drift, especially in delayed-feedback or label-scarce environments.

Back to the top


🧪 Model Health Signals¶

This section outlines key indicators to monitor your model’s performance in production.

Unlike offline metrics, production health signals help identify real-time issues such as:

  • Unusual prediction patterns
  • Degrading confidence
  • Invalid input rates
  • Post-deployment accuracy decay

These indicators help trigger alerts, re-training, or rollback workflows.

🧠 Prediction Confidence¶

Track the model’s prediction confidence over time to detect:

  • Overconfident wrong predictions
  • Increased uncertainty, which might signal drift or unfamiliar inputs

Common metrics:

  • Mean confidence per class
  • Distribution of prediction probabilities
  • Entropy of output distribution

Helpful for early warning systems in classification tasks.

🧼 Percent Invalid Inputs / Errors¶

Monitor the percentage of requests that fail due to:

  • Malformed or missing input fields
  • Schema mismatches
  • Data type errors
  • Preprocessing failures

Helps detect upstream pipeline issues or external API shifts.

📉 Accuracy Decay Over Time¶

Compare the model’s actual performance (if ground truth is available) over time.

Key metrics:

  • Accuracy, AUC, precision/recall trends
  • Deviation from expected performance bounds
  • Error rate spikes after code or data changes

Requires a feedback loop where true labels arrive with some delay.

Back to the top


📤 Logging + Alerting Basics¶

🗂️ Log Aggregation Tools (e.g., ELK, Prometheus)¶

In production environments, raw logs from your ML service can be aggregated and queried using:

  • ELK Stack (Elasticsearch, Logstash, Kibana): Powerful for full-text log search and dashboarding.
  • Prometheus + Grafana: Time-series monitoring, often used with Kubernetes. Tracks metrics like latency, error rate, request volume.
  • Fluentd + CloudWatch: Common in AWS stacks for unified logging.

These tools help centralize logs from multiple containers/services into a single searchable interface.

📬 Trigger Threshold-Based Alerts¶

Alerts notify teams when monitored signals breach thresholds. Example cases:

  • Drop in prediction confidence below a set limit
  • Spike in input errors
  • Sustained accuracy drop from post-deployment validation

Popular tools:

  • PagerDuty or Opsgenie for alert delivery
  • Grafana alerting for visual thresholds
  • Cloud-native alerts: CloudWatch Alarms (AWS), Google Cloud Monitoring

Good practice: start with low-noise alerts and tune thresholds iteratively.

Back to the top


⚙️ CI for Model Code¶

🧪 Run Tests on Every Commit¶

CI (Continuous Integration) ensures that every code commit triggers a test suite, reducing regressions.

  • Use tools like GitHub Actions, GitLab CI, or CircleCI to automate test execution.
  • Include unit tests for preprocessing, model logic, and utility functions.
  • Example trigger: on: [push, pull_request] in a .github/workflows/ci.yml.

This protects against breaking the training/inference flow inadvertently.

📦 Build + Lint + Validate Artifacts¶

Along with tests, your CI should also:

  • Build model packaging scripts or Docker images
  • Run linters like flake8, black, or pylint for clean code
  • Validate metadata: schema, versioning, file paths
  • Check serialized model integrity (e.g., load with joblib)

These steps ensure consistent, production-ready deliverables from every push.

Back to the top


🔁 GitHub Actions Example¶

GitHub Actions is a popular CI/CD tool natively integrated into GitHub.

It allows you to automate steps like testing, linting, and deployment by writing simple YAML files inside .github/workflows/.

For ML projects, common use cases include:

  • Running unit tests when code is pushed
  • Validating model artifact integrity
  • Notifying Slack/email if tests fail
  • Automating Docker image builds or cloud deployments

You can trigger actions on events like push, pull_request, or schedule (for cron jobs).

🧾 YAML Setup for Python Project¶

To automate tests and packaging, add a .github/workflows/ci.yml file.

Key sections:

  • name: Name of the workflow
  • on: Trigger events like push or pull_request
  • jobs: Define one or more jobs (e.g., testing, linting)
  • runs-on: OS environment (e.g., ubuntu-latest)
  • steps: Checkout code, set up Python, install deps, run commands

This sets the foundation for reproducible CI.

🧪 Testing & Notification Workflow¶

A complete workflow includes:

  • Checkout code using actions/checkout
  • Set up Python with actions/setup-python
  • Install dependencies (pip install -r requirements.txt)
  • Run unit tests with pytest
  • Lint code with flake8, black, or pylint
  • Optional: Send notifications via Slack or email on failure

This ensures high confidence in every commit to the repo.

Back to the top


🪛 Rollback & Hotfix Strategy¶

🚨 When Model Goes Bad¶

Despite best efforts, model failures can occur due to:

  • Data drift or concept drift
  • Upstream pipeline changes
  • Hidden bugs in code
  • Latency spikes or unexpected inputs

It's crucial to detect failures early and have a defined rollback protocol.

🔁 Safe Fallback + Version Control¶

Strategies to reduce blast radius:

  • Always keep a known-good model version registered (e.g., “Production” stage in MLflow)
  • Version all model artifacts and configs in a registry or Git
  • Implement a fallback rule (e.g., revert to previous model if accuracy drops below threshold)
  • Rollback should be automated via CI/CD, not manual

Versioning + automation = safer hotfixes.

Back to the top