Model Monitoring: Detecting Drift Before Disaster
Models Degrade Silently
A production model doesn't fail catastrophically: it degrades gradually. Input data distribution shifts (data drift), the relationship between features and target evolves (concept drift), and predictions become less calibrated (prediction drift). Without active monitoring, these degradations go unnoticed until business impact is visible, and by then it's too late.
Three Types of Drift and How to Measure Them
PSI (Population Stability Index) compares a feature's distribution between the baseline (training) and production. A PSI > 0.2 indicates significant drift. The KS test (Kolmogorov-Smirnov) is more sensitive for continuous distributions. KL divergence measures the information difference between distributions. In practice, we combine all three: PSI for quick alerts, KS for statistical validation, KL to quantify the magnitude of change.
Alerts, Shadow Scoring, and Auto-Retraining
When drift exceeds a threshold, the system must react in cascade: (1) alert the team, (2) activate shadow scoring with a candidate model, (3) if the candidate outperforms the production model on business metrics, trigger the retraining pipeline, (4) validate with evaluation gates, (5) canary deploy. All automated, with human-in-the-loop only for critical decisions.
Key Takeaways
- Data drift (P(X)) is detectable without labels and should be monitored with PSI, KS test, and KL divergence.
- Concept drift (P(Y|X)) requires labels and is the most dangerous: the model fails silently.
- Auto-retraining must be gated by evaluation suites, not blindly triggered by drift.
