Waiting for AUC to nosedive is like noticing a flat tyre after the rim sparks. We’ll build a 30-minute drop-in drift-detection stack:
Copy-paste Docker Compose, Prometheus rules, and Grafana JSON dashboards included. Works with AWS SageMaker, Vertex AI, or on-prem KFServing.
Concept: Data drift = feature distribution shifts; concept drift = relationship between X and y changes.
Impact: credit denials, ad-spend waste, mis-diagnoses. FinTech client lost $70 k in one day when Covid-era outliers smashed its scoring model—alert fired 12 hours late.Goal: Detect drift within 5 minutes of pattern change, auto-route traffic to safe model, and start retraining.
pgsql
CopyEdit
Client → API Gateway
├── 95 % → Prod Model (v2)
└── 5 % → Shadow Model (v1)
└── Inference-Logger (Fluent Bit)
└─→ Loki
Prometheus ←— Drift-Exporter ←-─ Shadow & Prod logs
└─ PSI, JS div, latency
Grafana ←— Prometheus
Slack Alert ←– Prom Alert-manager
Prometheus + Alertmanager: threshold rules; Grafana dashboard for trending.
js
CopyEdit
function handler(event) {
const req = event.request;
const rand = Math.random();
if (rand < 0.05) { // 5 % shadow
req.headers[‘x-shadow’] = [{ key: ‘x-shadow’, value: ‘true’ }];
}
return req;
}
Downstream Nginx splits:
nginx
CopyEdit
map $http_x_shadow $shadow {
default 0;
“true” 1;
}
upstream prod { server model-v2:8080; }
upstream shadow { server model-v1:8080; }
server {
location /infer {
if ($shadow) {
proxy_pass http://shadow;
}
proxy_pass http://prod;
}
}
Latency overhead: < 2 ms because shadow response is fire-and-forget (async).
Dockerfile
dockerfile
CopyEdit
FROM golang:1.22-alpine
WORKDIR /app
COPY . .
RUN go build -o drift-exporter .
CMD [“./drift-exporter”]
Core logic:
go
CopyEdit
for req := range kafkaConsumer {
feats := extractFeatures(req)
prodVec.Observe(feats) // ring buffer 1 000
shadowVec.Observe(feats)
if prodVec.Len() == 1000 {
psi := calcPSI(prodVec, shadowVec)
js := calcJSDiv(prodVec, shadowVec)
promPSI.Set(psi)
promJS.Set(js)
prodVec.Reset(); shadowVec.Reset()
}
}
Exports psi and js_divergence gauges on /metrics.PSI calculation: 10 equal-width bins; alert when PSI > 0.2.
drift_alerts.yml
yaml
CopyEdit
groups:
– name: drift
rules:
– alert: FeaturePSIHigh
expr: psi > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: “Feature drift PSI {{ $value }}”
description: |
PSI exceeded 0.2 for 5 minutes.
Check retraining pipeline.
– alert: LatencySpike
expr: histogram_quantile(0.95, sum(rate(inference_latency_bucket[5m])) by (le)) > 500
for: 3m
labels:
severity: critical
Alertmanager route → Slack #mlops.
Panels:
Full JSON available in repo grafana/drift_dashboard.json.
Alertmanager → Webhook Lambda:
python
CopyEdit
if alert[‘labels’][‘severity’] == ‘critical’:
ssm.put_parameter(Name=”/model/current”, Value=”v1″, Overwrite=True)
Nginx watches SSM Parameter Store via sidecar; flips traffic 100 % back to stable v1 within 90 s.
Rollback tests:
| Component | Extra Latency | Cost /1 M req |
| Shadow clone | +2 ms | +$0.15 (extra model calls) |
| Drift-Exporter CPU | N/A | $0.03 |
| Prom/Loki/Grafana | N/A | $0.08 |
| Total | ≈ +2 ms | $0.26 |
Less than 1 % of OpenAI token spend for same QPS.
Problem: purchase-fraud model mis-fired after festival sale; false positives ↑ 70 %.
Solution: implemented shadow v2 model + drift-exporter in 3 days.
Outcome:
| Metric | Before | After |
| Manual review queue | 18 k/day | 3 k/day |
| PSI alert time | N/A | 4 m 20 s |
| Revenue loss | $9.2 k | $1.1 k |
Saved $8 k/day during festive surge.
| Pitfall | Fix |
| Shadow model costs doubling tokens | Use log-only shadow for LLMs—skip generation. |
| Mixed tenant data corrupting drift signal | Compute PSI per tenant; Alert only if three tenants red. |
| Loki disk bloat | 30-day retention on shadow label; prod kept 90 days. |
| False positives on low-traffic night hours | for: 5m + condition psi > 0.2 and requests > 100. |
| Hashing PII before logs | Use SHA-256; comply with redaction guide from Post #3. |
| Sprint | Milestone |
| 1 | Deploy drift-exporter sidecar & expose Prom metrics |
| 2 | Clone 1 % shadow traffic; Grafana dashboard live |
| 3 | Slack alerts + manual rollback runbook |
| 4 | Auto-rollback via SSM or Terraform Cloud |
| 5 | Expand to multi-feature PSI & tenant segmentation |