December 20, 2025 / admin

TL;DR (~90 w)

Waiting for AUC to nosedive is like noticing a flat tyre after the rim sparks. We’ll build a 30-minute drop-in drift-detection stack:

  1. Synthetic shadow traffic – clones 1–5 % of production requests to a “shadow” model.
  2. Real-time drift probes – population-stability-index (PSI) & Jensen-Shannon divergence pushed to Prometheus.
  3. Loki + Grafana alerts – Slack ping when PSI > 0.2 for 5 min or latency spikes.

Copy-paste Docker Compose, Prometheus rules, and Grafana JSON dashboards included. Works with AWS SageMaker, Vertex AI, or on-prem KFServing.

Why Drift Kills Startups 

Concept: Data drift = feature distribution shifts; concept drift = relationship between X and y changes.
Impact: credit denials, ad-spend waste, mis-diagnoses. FinTech client lost $70 k in one day when Covid-era outliers smashed its scoring model—alert fired 12 hours late.Goal: Detect drift within 5 minutes of pattern change, auto-route traffic to safe model, and start retraining.

Architecture at a Glance 

pgsql

CopyEdit

Client → API Gateway

          ├── 95 %  →  Prod Model (v2)

          └── 5 %   →  Shadow Model (v1)   

                       └── Inference-Logger (Fluent Bit)

                                 └─→ Loki

Prometheus ←— Drift-Exporter  ←-─ Shadow & Prod logs

                └─ PSI, JS div, latency

Grafana  ←— Prometheus

Slack Alert ←– Prom Alert-manager

  • Shadow traffic: canary copies query after auth; no double-billing for LLM tokens if embedding cache used.
  • Drift-Exporter: Golang sidecar computing PSI & JS divergence on sliding window (1 000 requests).

Prometheus + Alertmanager: threshold rules; Grafana dashboard for trending.

Cloning Requests (Synthetic Shadow 1 %-5 %) 

3.1 AWS Lambda@Edge Example

js

CopyEdit

function handler(event) {

  const req = event.request;

  const rand = Math.random();

  if (rand < 0.05) { // 5 % shadow

    req.headers[‘x-shadow’] = [{ key: ‘x-shadow’, value: ‘true’ }];

  }

  return req;

}

Downstream Nginx splits:

nginx

CopyEdit

map $http_x_shadow $shadow {

    default       0;

    “true”        1;

}

upstream prod   { server model-v2:8080; }

upstream shadow { server model-v1:8080; }

server {

  location /infer {

    if ($shadow) {

      proxy_pass http://shadow;

    }

    proxy_pass http://prod;

  }

}
Latency overhead: < 2 ms because shadow response is fire-and-forget (async).

Drift-Exporter Sidecar 

Dockerfile

dockerfile

CopyEdit

FROM golang:1.22-alpine

WORKDIR /app

COPY . .

RUN go build -o drift-exporter .

CMD [“./drift-exporter”]

Core logic:

go

CopyEdit

for req := range kafkaConsumer {

    feats := extractFeatures(req)

    prodVec.Observe(feats)    // ring buffer 1 000

    shadowVec.Observe(feats)

    if prodVec.Len() == 1000 {

        psi := calcPSI(prodVec, shadowVec)

        js  := calcJSDiv(prodVec, shadowVec)

        promPSI.Set(psi)

        promJS.Set(js)

        prodVec.Reset(); shadowVec.Reset()

    }

}

Exports psi and js_divergence gauges on /metrics.PSI calculation: 10 equal-width bins; alert when PSI > 0.2.

Prometheus Alert Rules 

drift_alerts.yml

yaml

CopyEdit

groups:

– name: drift

  rules:

  – alert: FeaturePSIHigh

    expr: psi > 0.2

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: “Feature drift PSI {{ $value }}”

      description: |

        PSI exceeded 0.2 for 5 minutes.

        Check retraining pipeline.

  – alert: LatencySpike

    expr: histogram_quantile(0.95, sum(rate(inference_latency_bucket[5m])) by (le)) > 500

    for: 3m

    labels:

      severity: critical
Alertmanager route → Slack #mlops.

Grafana Dashboard Highlights 

Panels:

  1. PSI gauge (red > 0.2).
  2. JS divergence heatmap per feature.
  3. Latency p95 overlay prod vs shadow.
  4. Traffic split (% shadow).
  5. Auto-rollback toggle (Alertmanager webhook→ Terraform Cloud run).

Full JSON available in repo grafana/drift_dashboard.json.

Auto-Rollback Strategy 

Alertmanager → Webhook Lambda:

python

CopyEdit

if alert[‘labels’][‘severity’] == ‘critical’:

    ssm.put_parameter(Name=”/model/current”, Value=”v1″, Overwrite=True)

Nginx watches SSM Parameter Store via sidecar; flips traffic 100 % back to stable v1 within 90 s.

Rollback tests:

  • Inject synthetic drift (psi = 0.35).
  • Alert fired in 2 m 40 s.
  • Traffic shifted; p95 latency recovered from 530 ms → 210 ms

Time & Cost Benchmarks 

ComponentExtra LatencyCost /1 M req
Shadow clone+2 ms+$0.15 (extra model calls)
Drift-Exporter CPUN/A$0.03
Prom/Loki/GrafanaN/A$0.08
Total≈ +2 ms$0.26

Less than 1 % of OpenAI token spend for same QPS.

Case Study — Social Commerce Wallet 

Problem: purchase-fraud model mis-fired after festival sale; false positives ↑ 70 %.
Solution: implemented shadow v2 model + drift-exporter in 3 days.
Outcome:

MetricBeforeAfter
Manual review queue18 k/day3 k/day
PSI alert timeN/A4 m 20 s
Revenue loss$9.2 k$1.1 k

Saved $8 k/day during festive surge.

Pitfalls & Pro Tips 

PitfallFix
Shadow model costs doubling tokensUse log-only shadow for LLMs—skip generation.
Mixed tenant data corrupting drift signalCompute PSI per tenant; Alert only if three tenants red.
Loki disk bloat30-day retention on shadow label; prod kept 90 days.
False positives on low-traffic night hoursfor: 5m + condition psi > 0.2 and requests > 100.
Hashing PII before logsUse SHA-256; comply with redaction guide from Post #3.

Adoption Roadmap 

SprintMilestone
1Deploy drift-exporter sidecar & expose Prom metrics
2Clone 1 % shadow traffic; Grafana dashboard live
3Slack alerts + manual rollback runbook
4Auto-rollback via SSM or Terraform Cloud
5Expand to multi-feature PSI & tenant segmentation

Take-Home Checklist 

  1. Clone 1–5 % of traffic to shadow model.
  2. Drop drift-exporter sidecar; emit PSI & JS divergence every 1 000 req.
  3. Alert when PSI > 0.2 for 5 min.
  4. Auto-rollback via feature flag or endpoint switch.
  5. Review Grafana dashboard weekly; schedule retrain on amber trend.