From Notebook to Prod in One Commit: MLflow + Feature Store CI/CD

December 20, 2025 / admin

TL;DR (≈ 85 words)

Notebooks rot, hand-clicked MLflow UI tags drift, and “one-off” Bash deploys become snowflakes. We’ll turn that mess into a Git-ops pipeline where a single pull-request:

Runs unit tests on notebooks with Papermill.
Registers the model in MLflow.
Creates or updates features in a Feast store.
Deploys the model to a live endpoint (SageMaker, Vertex, or on-prem) via GitHub Actions.

End-to-end latency: ≈ 12 minutes; rollback in < 90 seconds. All YAML, Terraform, and Grafana dashboards included.

Why “Throw It Over the Wall” Still Rules ML

Most data-science orgs still:

Export a model.pkl, ping MLOps, and wait days for infra tickets.
Copy/paste feature code between Airflow DAGs—guaranteed drift.
Discover too late that the “dev” feature uses log1p(x) while prod uses log10(x)—hello, shadow drift.

The solution is Shift-Left MLOps: turn model, feature definitions, deployment infra, and monitoring into code committed with the PR. If the code changes, the pipeline enforces:

Tests pass
Model and features register
Endpoint rolls forward—or reverts

No ticket queues. No stale notebooks. Predictable releases.

Reference Stack

Layer	Tool	Why
Version Control	GitHub – trunk-based	PR triggers Action
Model Tracking	MLflow 2.9	REST API + model lineage
Feature Store	Feast 0.37	Decouples online vs offline
Training Notebook	Jupyter + Papermill	Parameterised tests
CI/CD	GitHub Actions + Terraform Cloud	Git-ops, rollback
Serving	AWS SageMaker Endpoint	Blue/green & A/B
Monitoring	Prometheus / Grafana	Drift & latency alerts

(Swap SageMaker for Vertex AI or On-Prem KFServing by changing one Terraform module.)

Notebook Unit Tests

3.1 Papermill Parameter Test

train.ipynb cell:

python

CopyEdit

epochs = int(params.get(“EPOCHS”, 10))

assert 1 <= epochs <= 100, “Epochs out of range”

GitHub Action step:

yaml

CopyEdit

– name: Execute notebook tests

run: |

papermill train.ipynb output.ipynb -p EPOCHS 5

papermill evaluate.ipynb output_eval.ipynb

Failures break the PR early—before expensive GPUs spin.Median runtime: 2 minutes on t3.large runner.

Model + Feature Registration

4.1 MLflow Registration

bash

CopyEdit

mlflow models register \

–model-uri “runs:/${RUN_ID}/model” \

–name “credit_risk_classifier”

Action parses run ID from Papermill output. Tag model with Git SHA and dataset hash for lineage.

4.2 Feast Apply in CI

features/credit_risk.py

python

CopyEdit

from feast import FeatureView, Entity, Field

from feast.types import Float32, Int64

customer = Entity(name=”customer_id”, value_type=Int64)

credit_features = FeatureView(

name=”credit_features”,

entities=[customer],

ttl=86400,

schema=[

Field(name=”avg_balance_30d”, dtype=Float32),

Field(name=”max_txn_amt_30d”, dtype=Float32),

online=True,

)

Action step:

yaml

CopyEdit

– name: Feast apply

run: feast apply

Feast bumps version if schema changes, ensuring online/offline parity.

Automated Deployment via Terraform & GitHub Actions

yaml

CopyEdit

env:

MODEL_NAME: credit_risk_classifier

MODEL_STAGE: Staging

jobs:

deploy:

needs: [test, register]

runs-on: ubuntu-latest

steps:

– uses: actions/checkout@v4

– name: Terraform plan

run: terraform -chdir=infra plan -input=false

– name: Terraform apply

run: terraform -chdir=infra apply -auto-approve -input=false

infra/main.tf:

hcl

CopyEdit

module “sagemaker_model” {

source = “terraform-aws-modules/sagemaker/aws//modules/model”

name = var.model_name

primary_container = {

image = “763104351884.dkr.ecr.us-east-1.amazonaws.com/xgboost:1.5-1”

model_data_url = “${var.model_s3_path}”

}

module “sagemaker_endpoint” {

source = “terraform-aws-modules/sagemaker/aws//modules/endpoint”

name = “${var.model_name}-ep”

variant_weight = 0.10 # blue/green—10 % to new model

}

On successful health probes (p95 latency < 300 ms & 0 ≥ error_rate < 1 %), traffic shifts to 100 %; else Terraform rollbacks.Median deploy time: 6 minutes (model S3 pull largest slice).

Monitoring & Drift Alerts

Prometheus agent on endpoint emits:

inference_latency_ms
prediction_drift_psi (population stability index)
feature_null_ratio

Grafana threshold panel:

Red light if prediction_drift_psi > 0.2 for 3 consecutive minutes.

Slack alert to #mlops-critical; Terraform -target=endpoint rollback plan fires if red > 10 min.

Time & Cost Benchmarks

Stage	Median Time	AWS Cost / run
Notebook tests	2 m	$0.02
Build + push model	3 m	$0.05
Feature apply	1 m	$0.004
Terraform deploy	6 m	$0.12
Total	12 m	$0.19

Rollback (blue/green revert) costs $0.03 & takes 80 s.

Real-World Impact (FinTech Credit-Risk Model)

Before pipeline: Model refresh every 3 months, drift alerts manual, rollbacks 4 hours.
After CI/CD:

Weekly model refresh → credit-risk AUC +4 pp.
p95 latency held steady at 220 ms.
Rollback tested live — production revert in 80 s, zero user impact.
Compliance audit passed 1st try—full lineage via MLflow tags.

Pitfalls & Pro Tips

Pitfall	Fix
“MLflow UI tag drift”	Enforce tags via mlflow.set_tags() inside notebook; fail Action if missing.
Feast online/offline skew	Schedule hourly feast materialize-incremental $(date +%s)
Terraform apply timeout 30 m	Use EFS-backed model; warm container shortcut.
Feature DAG race	Serialize Airflow tasks that mutate same entity using task-level mutex.
CI bill shock	Self-host GitHub runner spot fleet; cost drops 60 %.

Adoption Roadmap

Sprint	Milestone
1	Add Papermill tests + MLflow tracking
2	Feast offline + online stores, feast apply in CI
3	Terraform SageMaker blue/green deploy
4	Prometheus drift metrics + auto-rollback
5	Merge notebooks into repo trunk; freeze ad-hoc JupyterHub

Take-Home Checklist

Parameter-test notebooks with Papermill.
Register models and features in the same PR.
Deploy via Terraform blue/green; rollback on health fail.
Monitor drift & latency in Grafana; auto-alert Slack.
Audit lineage with MLflow tags & Git SHA.

/ AI & ML Best Practices /

TL;DR (≈ 85 words)

Why “Throw It Over the Wall” Still Rules ML

Reference Stack

Notebook Unit Tests

3.1 Papermill Parameter Test

Model + Feature Registration

4.1 MLflow Registration

4.2 Feast Apply in CI

Automated Deployment via Terraform & GitHub Actions

Monitoring & Drift Alerts

Time & Cost Benchmarks

Real-World Impact (FinTech Credit-Risk Model)

Pitfalls & Pro Tips

Adoption Roadmap

Take-Home Checklist

Recent posts

ROI Math: When the Predictability Premium Pays for Itself in One Sprint

Governance Without Bureaucracy: 7 Plan-Left Gates Your Squad Needs

Scale Up in 48 Hours: How Core-Flex Talent Pipelines Add an Engineer Before the Next Stand-Up

The Buffer Bench Blueprint: Zero % Velocity Loss When Engineers Quit

Archive

Tags

AI Strategy and Consulting