December 20, 2025 / admin

TL;DR (≈ 85 words)

Notebooks rot, hand-clicked MLflow UI tags drift, and “one-off” Bash deploys become snowflakes. We’ll turn that mess into a Git-ops pipeline where a single pull-request:

  1. Runs unit tests on notebooks with Papermill.
  2. Registers the model in MLflow.
  3. Creates or updates features in a Feast store.
  4. Deploys the model to a live endpoint (SageMaker, Vertex, or on-prem) via GitHub Actions.

End-to-end latency: ≈ 12 minutes; rollback in < 90 seconds. All YAML, Terraform, and Grafana dashboards included.

Why “Throw It Over the Wall” Still Rules ML 

Most data-science orgs still:

  • Export a model.pkl, ping MLOps, and wait days for infra tickets.
  • Copy/paste feature code between Airflow DAGs—guaranteed drift.
  • Discover too late that the “dev” feature uses log1p(x) while prod uses log10(x)—hello, shadow drift.

The solution is Shift-Left MLOps: turn model, feature definitions, deployment infra, and monitoring into code committed with the PR. If the code changes, the pipeline enforces:

  • Tests pass
  • Model and features register
  • Endpoint rolls forward—or reverts

No ticket queues. No stale notebooks. Predictable releases.

Reference Stack 

LayerToolWhy
Version ControlGitHub – trunk-basedPR triggers Action
Model TrackingMLflow 2.9REST API + model lineage
Feature StoreFeast 0.37Decouples online vs offline
Training NotebookJupyter + PapermillParameterised tests
CI/CDGitHub Actions + Terraform CloudGit-ops, rollback
ServingAWS SageMaker EndpointBlue/green & A/B
MonitoringPrometheus / GrafanaDrift & latency alerts

(Swap SageMaker for Vertex AI or On-Prem KFServing by changing one Terraform module.)

Notebook Unit Tests 

3.1 Papermill Parameter Test

train.ipynb cell:

python

CopyEdit

epochs = int(params.get(“EPOCHS”, 10))

assert 1 <= epochs <= 100, “Epochs out of range”

GitHub Action step:

yaml

CopyEdit

– name: Execute notebook tests

  run: |

    papermill train.ipynb output.ipynb -p EPOCHS 5

    papermill evaluate.ipynb output_eval.ipynb

Failures break the PR early—before expensive GPUs spin.Median runtime: 2 minutes on t3.large runner.

Model + Feature Registration 

4.1 MLflow Registration

bash

CopyEdit

mlflow models register \

  –model-uri “runs:/${RUN_ID}/model” \

  –name “credit_risk_classifier”

Action parses run ID from Papermill output. Tag model with Git SHA and dataset hash for lineage.

4.2 Feast Apply in CI

features/credit_risk.py

python

CopyEdit

from feast import FeatureView, Entity, Field

from feast.types import Float32, Int64

customer = Entity(name=”customer_id”, value_type=Int64)

credit_features = FeatureView(

    name=”credit_features”,

    entities=[customer],

    ttl=86400,

    schema=[

        Field(name=”avg_balance_30d”, dtype=Float32),

        Field(name=”max_txn_amt_30d”, dtype=Float32),

    ],

    online=True,

)

Action step:

yaml

CopyEdit

– name: Feast apply

  run: feast apply

Feast bumps version if schema changes, ensuring online/offline parity.

Automated Deployment via Terraform & GitHub Actions 

yaml

CopyEdit

env:

  MODEL_NAME: credit_risk_classifier

  MODEL_STAGE: Staging

jobs:

  deploy:

    needs: [test, register]

    runs-on: ubuntu-latest

    steps:

      – uses: actions/checkout@v4

      – name: Terraform plan

        run: terraform -chdir=infra plan -input=false

      – name: Terraform apply

        run: terraform -chdir=infra apply -auto-approve -input=false

infra/main.tf:

hcl

CopyEdit

module “sagemaker_model” {

  source     = “terraform-aws-modules/sagemaker/aws//modules/model”

  name       = var.model_name

  primary_container = {

    image = “763104351884.dkr.ecr.us-east-1.amazonaws.com/xgboost:1.5-1”

    model_data_url = “${var.model_s3_path}”

  }

}

module “sagemaker_endpoint” {

  source = “terraform-aws-modules/sagemaker/aws//modules/endpoint”

  name   = “${var.model_name}-ep”

  variant_weight = 0.10   # blue/green—10 % to new model

}

On successful health probes (p95 latency < 300 ms & 0 ≥ error_rate < 1 %), traffic shifts to 100 %; else Terraform rollbacks.Median deploy time: 6 minutes (model S3 pull largest slice).

Monitoring & Drift Alerts 

Prometheus agent on endpoint emits:

  • inference_latency_ms
  • prediction_drift_psi (population stability index)
  • feature_null_ratio

Grafana threshold panel:

  • Red light if prediction_drift_psi > 0.2 for 3 consecutive minutes.

Slack alert to #mlops-critical; Terraform -target=endpoint rollback plan fires if red > 10 min.

Time & Cost Benchmarks 

StageMedian TimeAWS Cost / run
Notebook tests2 m$0.02
Build + push model3 m$0.05
Feature apply1 m$0.004
Terraform deploy6 m$0.12
Total12 m$0.19

Rollback (blue/green revert) costs $0.03 & takes 80 s.

Real-World Impact (FinTech Credit-Risk Model) 

Before pipeline: Model refresh every 3 months, drift alerts manual, rollbacks 4 hours.
After CI/CD:

  • Weekly model refresh → credit-risk AUC +4 pp.
  • p95 latency held steady at 220 ms.
  • Rollback tested live — production revert in 80 s, zero user impact.
  • Compliance audit passed 1st try—full lineage via MLflow tags.

Pitfalls & Pro Tips 

PitfallFix
“MLflow UI tag drift”Enforce tags via mlflow.set_tags() inside notebook; fail Action if missing.
Feast online/offline skewSchedule hourly feast materialize-incremental $(date +%s)
Terraform apply timeout 30 mUse EFS-backed model; warm container shortcut.
Feature DAG raceSerialize Airflow tasks that mutate same entity using task-level mutex.
CI bill shockSelf-host GitHub runner spot fleet; cost drops 60 %.

Adoption Roadmap 

SprintMilestone
1Add Papermill tests + MLflow tracking
2Feast offline + online stores, feast apply in CI
3Terraform SageMaker blue/green deploy
4Prometheus drift metrics + auto-rollback
5Merge notebooks into repo trunk; freeze ad-hoc JupyterHub

Take-Home Checklist 

  1. Parameter-test notebooks with Papermill.
  2. Register models and features in the same PR.
  3. Deploy via Terraform blue/green; rollback on health fail.
  4. Monitor drift & latency in Grafana; auto-alert Slack.
  5. Audit lineage with MLflow tags & Git SHA.