Notebooks rot, hand-clicked MLflow UI tags drift, and “one-off” Bash deploys become snowflakes. We’ll turn that mess into a Git-ops pipeline where a single pull-request:
End-to-end latency: ≈ 12 minutes; rollback in < 90 seconds. All YAML, Terraform, and Grafana dashboards included.
Most data-science orgs still:
The solution is Shift-Left MLOps: turn model, feature definitions, deployment infra, and monitoring into code committed with the PR. If the code changes, the pipeline enforces:
No ticket queues. No stale notebooks. Predictable releases.
| Layer | Tool | Why |
| Version Control | GitHub – trunk-based | PR triggers Action |
| Model Tracking | MLflow 2.9 | REST API + model lineage |
| Feature Store | Feast 0.37 | Decouples online vs offline |
| Training Notebook | Jupyter + Papermill | Parameterised tests |
| CI/CD | GitHub Actions + Terraform Cloud | Git-ops, rollback |
| Serving | AWS SageMaker Endpoint | Blue/green & A/B |
| Monitoring | Prometheus / Grafana | Drift & latency alerts |
(Swap SageMaker for Vertex AI or On-Prem KFServing by changing one Terraform module.)
train.ipynb cell:
python
CopyEdit
epochs = int(params.get(“EPOCHS”, 10))
assert 1 <= epochs <= 100, “Epochs out of range”
GitHub Action step:
yaml
CopyEdit
– name: Execute notebook tests
run: |
papermill train.ipynb output.ipynb -p EPOCHS 5
papermill evaluate.ipynb output_eval.ipynb
Failures break the PR early—before expensive GPUs spin.Median runtime: 2 minutes on t3.large runner.
bash
CopyEdit
mlflow models register \
–model-uri “runs:/${RUN_ID}/model” \
–name “credit_risk_classifier”
Action parses run ID from Papermill output. Tag model with Git SHA and dataset hash for lineage.
features/credit_risk.py
python
CopyEdit
from feast import FeatureView, Entity, Field
from feast.types import Float32, Int64
customer = Entity(name=”customer_id”, value_type=Int64)
credit_features = FeatureView(
name=”credit_features”,
entities=[customer],
ttl=86400,
schema=[
Field(name=”avg_balance_30d”, dtype=Float32),
Field(name=”max_txn_amt_30d”, dtype=Float32),
],
online=True,
)
Action step:
yaml
CopyEdit
– name: Feast apply
run: feast apply
Feast bumps version if schema changes, ensuring online/offline parity.
yaml
CopyEdit
env:
MODEL_NAME: credit_risk_classifier
MODEL_STAGE: Staging
jobs:
deploy:
needs: [test, register]
runs-on: ubuntu-latest
steps:
– uses: actions/checkout@v4
– name: Terraform plan
run: terraform -chdir=infra plan -input=false
– name: Terraform apply
run: terraform -chdir=infra apply -auto-approve -input=false
infra/main.tf:
hcl
CopyEdit
module “sagemaker_model” {
source = “terraform-aws-modules/sagemaker/aws//modules/model”
name = var.model_name
primary_container = {
image = “763104351884.dkr.ecr.us-east-1.amazonaws.com/xgboost:1.5-1”
model_data_url = “${var.model_s3_path}”
}
}
module “sagemaker_endpoint” {
source = “terraform-aws-modules/sagemaker/aws//modules/endpoint”
name = “${var.model_name}-ep”
variant_weight = 0.10 # blue/green—10 % to new model
}
On successful health probes (p95 latency < 300 ms & 0 ≥ error_rate < 1 %), traffic shifts to 100 %; else Terraform rollbacks.Median deploy time: 6 minutes (model S3 pull largest slice).
Prometheus agent on endpoint emits:
Grafana threshold panel:
Slack alert to #mlops-critical; Terraform -target=endpoint rollback plan fires if red > 10 min.
| Stage | Median Time | AWS Cost / run |
| Notebook tests | 2 m | $0.02 |
| Build + push model | 3 m | $0.05 |
| Feature apply | 1 m | $0.004 |
| Terraform deploy | 6 m | $0.12 |
| Total | 12 m | $0.19 |
Rollback (blue/green revert) costs $0.03 & takes 80 s.
Before pipeline: Model refresh every 3 months, drift alerts manual, rollbacks 4 hours.
After CI/CD:
| Pitfall | Fix |
| “MLflow UI tag drift” | Enforce tags via mlflow.set_tags() inside notebook; fail Action if missing. |
| Feast online/offline skew | Schedule hourly feast materialize-incremental $(date +%s) |
| Terraform apply timeout 30 m | Use EFS-backed model; warm container shortcut. |
| Feature DAG race | Serialize Airflow tasks that mutate same entity using task-level mutex. |
| CI bill shock | Self-host GitHub runner spot fleet; cost drops 60 %. |
| Sprint | Milestone |
| 1 | Add Papermill tests + MLflow tracking |
| 2 | Feast offline + online stores, feast apply in CI |
| 3 | Terraform SageMaker blue/green deploy |
| 4 | Prometheus drift metrics + auto-rollback |
| 5 | Merge notebooks into repo trunk; freeze ad-hoc JupyterHub |