{"id":77,"date":"2025-12-20T06:52:05","date_gmt":"2025-12-20T06:52:05","guid":{"rendered":"https:\/\/steadyrabbit.in\/blogs\/?p=77"},"modified":"2025-12-20T06:52:50","modified_gmt":"2025-12-20T06:52:50","slug":"from-notebook-to-prod-in-one-commit-mlflow-feature-store-ci-cd","status":"publish","type":"post","link":"https:\/\/steadyrabbit.in\/blogs\/from-notebook-to-prod-in-one-commit-mlflow-feature-store-ci-cd\/","title":{"rendered":"From Notebook to Prod in One Commit: MLflow + Feature Store CI\/CD"},"content":{"rendered":"\n<h4 class=\"wp-block-heading\">TL;DR (\u2248 85 words)<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Notebooks rot, hand-clicked MLflow UI tags drift, and \u201cone-off\u201d Bash deploys become snowflakes. We\u2019ll turn that mess into a <strong>Git-ops pipeline<\/strong> where a single pull-request:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Runs unit tests on notebooks with Papermill.<br><\/li>\n\n\n\n<li>Registers the model in MLflow.<br><\/li>\n\n\n\n<li>Creates or updates features in a Feast store.<br><\/li>\n\n\n\n<li>Deploys the model to a live endpoint (SageMaker, Vertex, or on-prem) via GitHub Actions.<br><\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">End-to-end latency: <strong>\u2248 12 minutes<\/strong>; rollback in &lt; 90 seconds. All YAML, Terraform, and Grafana dashboards included.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Why \u201cThrow It Over the Wall\u201d Still Rules ML\u00a0<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Most data-science orgs still:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Export a model.pkl, ping MLOps, and wait days for infra tickets.<br><\/li>\n\n\n\n<li>Copy\/paste feature code between Airflow DAGs\u2014guaranteed drift.<br><\/li>\n\n\n\n<li>Discover too late that the \u201cdev\u201d feature uses log1p(x) while prod uses log10(x)\u2014hello, shadow drift.<br><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The solution is <strong>Shift-Left MLOps<\/strong>: turn model, feature definitions, deployment infra, and monitoring into <strong>code committed with the PR<\/strong>. If the code changes, the pipeline enforces:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Tests pass<br><\/li>\n\n\n\n<li>Model and features register<br><\/li>\n\n\n\n<li>Endpoint rolls forward\u2014or reverts<br><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">No ticket queues. No stale notebooks. Predictable releases.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Reference Stack\u00a0<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Layer<\/strong><\/td><td><strong>Tool<\/strong><\/td><td><strong>Why<\/strong><\/td><\/tr><tr><td><strong>Version Control<\/strong><\/td><td>GitHub \u2013 trunk-based<\/td><td>PR triggers Action<\/td><\/tr><tr><td><strong>Model Tracking<\/strong><\/td><td>MLflow 2.9<\/td><td>REST API + model lineage<\/td><\/tr><tr><td><strong>Feature Store<\/strong><\/td><td>Feast 0.37<\/td><td>Decouples online vs offline<\/td><\/tr><tr><td><strong>Training Notebook<\/strong><\/td><td>Jupyter + Papermill<\/td><td>Parameterised tests<\/td><\/tr><tr><td><strong>CI\/CD<\/strong><\/td><td>GitHub Actions + Terraform Cloud<\/td><td>Git-ops, rollback<\/td><\/tr><tr><td><strong>Serving<\/strong><\/td><td>AWS SageMaker Endpoint<\/td><td>Blue\/green &amp; A\/B<\/td><\/tr><tr><td><strong>Monitoring<\/strong><\/td><td>Prometheus \/ Grafana<\/td><td>Drift &amp; latency alerts<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><em>(Swap SageMaker for Vertex AI or On-Prem KFServing by changing one Terraform module.)<\/em><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Notebook Unit Tests\u00a0<\/h4>\n\n\n\n<h5 class=\"wp-block-heading\">3.1 Papermill Parameter Test<\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">train.ipynb cell:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">python<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">CopyEdit<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">epochs = int(params.get(&#8220;EPOCHS&#8221;, 10))<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">assert 1 &lt;= epochs &lt;= 100, &#8220;Epochs out of range&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">GitHub Action step:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">yaml<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">CopyEdit<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&#8211; name: Execute notebook tests<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;run: |<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;papermill train.ipynb output.ipynb -p EPOCHS 5<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;papermill evaluate.ipynb output_eval.ipynb<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Failures break the PR early\u2014before expensive GPUs spin.<em>Median runtime:<\/em> 2 minutes on t3.large runner.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Model + Feature Registration\u00a0<\/h4>\n\n\n\n<h5 class=\"wp-block-heading\">4.1 MLflow Registration<\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">bash<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">CopyEdit<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">mlflow models register \\<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&#8211;model-uri &#8220;runs:\/${RUN_ID}\/model&#8221; \\<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&#8211;name &#8220;credit_risk_classifier&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Action parses run ID from Papermill output. Tag model with Git SHA and dataset hash for lineage.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">4.2 Feast Apply in CI<\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">features\/credit_risk.py<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">python<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">CopyEdit<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">from feast import FeatureView, Entity, Field<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">from feast.types import Float32, Int64<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">customer = Entity(name=&#8221;customer_id&#8221;, value_type=Int64)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">credit_features = FeatureView(<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;name=&#8221;credit_features&#8221;,<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;entities=[customer],<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;ttl=86400,<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;schema=[<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Field(name=&#8221;avg_balance_30d&#8221;, dtype=Float32),<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Field(name=&#8221;max_txn_amt_30d&#8221;, dtype=Float32),<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;],<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;online=True,<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Action step:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">yaml<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">CopyEdit<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&#8211; name: Feast apply<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;run: feast apply<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Feast bumps version if schema changes, ensuring online\/offline parity.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Automated Deployment via Terraform &amp; GitHub Actions\u00a0<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">yaml<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">CopyEdit<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">env:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;MODEL_NAME: credit_risk_classifier<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;MODEL_STAGE: Staging<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">jobs:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;deploy:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;needs: [test, register]<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;runs-on: ubuntu-latest<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;steps:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#8211; uses: actions\/checkout@v4<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#8211; name: Terraform plan<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;run: terraform -chdir=infra plan -input=false<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#8211; name: Terraform apply<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;run: terraform -chdir=infra apply -auto-approve -input=false<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">infra\/main.tf:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">hcl<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">CopyEdit<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">module &#8220;sagemaker_model&#8221; {<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;source &nbsp; &nbsp; = &#8220;terraform-aws-modules\/sagemaker\/aws\/\/modules\/model&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;name &nbsp; &nbsp; &nbsp; = var.model_name<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;primary_container = {<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;image = &#8220;763104351884.dkr.ecr.us-east-1.amazonaws.com\/xgboost:1.5-1&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;&nbsp;&nbsp;model_data_url = &#8220;${var.model_s3_path}&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">module &#8220;sagemaker_endpoint&#8221; {<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;source = &#8220;terraform-aws-modules\/sagemaker\/aws\/\/modules\/endpoint&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;name &nbsp; = &#8220;${var.model_name}-ep&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;&nbsp;variant_weight = 0.10 &nbsp; # blue\/green\u201410 % to new model<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">On successful health probes (p95 latency &lt; 300 ms &amp; 0 \u2265 error_rate &lt; 1 %), traffic shifts to 100 %; else Terraform rollbacks.<em>Median deploy time:<\/em> 6 minutes (model S3 pull largest slice).<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Monitoring &amp; Drift Alerts\u00a0<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Prometheus agent on endpoint emits:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>inference_latency_ms<br><\/li>\n\n\n\n<li>prediction_drift_psi (population stability index)<br><\/li>\n\n\n\n<li>feature_null_ratio<br><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Grafana threshold panel:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Red light if prediction_drift_psi > 0.2 for 3 consecutive minutes.<br><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Slack alert to #mlops-critical; Terraform -target=endpoint rollback plan fires if red &gt; 10 min.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Time &amp; Cost Benchmarks\u00a0<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Stage<\/strong><\/td><td><strong>Median Time<\/strong><\/td><td><strong>AWS Cost \/ run<\/strong><\/td><\/tr><tr><td>Notebook tests<\/td><td>2 m<\/td><td>$0.02<\/td><\/tr><tr><td>Build + push model<\/td><td>3 m<\/td><td>$0.05<\/td><\/tr><tr><td>Feature apply<\/td><td>1 m<\/td><td>$0.004<\/td><\/tr><tr><td>Terraform deploy<\/td><td>6 m<\/td><td>$0.12<\/td><\/tr><tr><td><strong>Total<\/strong><\/td><td><strong>12 m<\/strong><\/td><td><strong>$0.19<\/strong><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Rollback (blue\/green revert) costs $0.03 &amp; takes 80 s.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Real-World Impact (FinTech Credit-Risk Model)\u00a0<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Before pipeline<\/em>: Model refresh every 3 months, drift alerts manual, rollbacks 4 hours.<br><em>After CI\/CD<\/em>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Weekly model refresh \u2192 credit-risk AUC +4 pp.<br><\/li>\n\n\n\n<li>p95 latency held steady at 220 ms.<br><\/li>\n\n\n\n<li>Rollback tested live \u2014 production revert in 80 s, zero user impact.<br><\/li>\n\n\n\n<li>Compliance audit passed 1st try\u2014full lineage via MLflow tags.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Pitfalls &amp; Pro Tips\u00a0<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Pitfall<\/strong><\/td><td><strong>Fix<\/strong><\/td><\/tr><tr><td>\u201cMLflow UI tag drift\u201d<\/td><td>Enforce tags via mlflow.set_tags() inside notebook; fail Action if missing.<\/td><\/tr><tr><td>Feast online\/offline skew<\/td><td>Schedule hourly feast materialize-incremental $(date +%s)<\/td><\/tr><tr><td>Terraform apply timeout 30 m<\/td><td>Use EFS-backed model; warm container shortcut.<\/td><\/tr><tr><td>Feature DAG race<\/td><td>Serialize Airflow tasks that mutate same entity using task-level mutex.<\/td><\/tr><tr><td>CI bill shock<\/td><td>Self-host GitHub runner spot fleet; cost drops 60 %.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Adoption Roadmap\u00a0<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Sprint<\/strong><\/td><td><strong>Milestone<\/strong><\/td><\/tr><tr><td>1<\/td><td>Add Papermill tests + MLflow tracking<\/td><\/tr><tr><td>2<\/td><td>Feast offline + online stores, feast apply in CI<\/td><\/tr><tr><td>3<\/td><td>Terraform SageMaker blue\/green deploy<\/td><\/tr><tr><td>4<\/td><td>Prometheus drift metrics + auto-rollback<\/td><\/tr><tr><td>5<\/td><td>Merge notebooks into repo trunk; freeze ad-hoc JupyterHub<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h4 class=\"wp-block-heading\">Take-Home Checklist\u00a0<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Parameter-test notebooks with Papermill.<br><\/li>\n\n\n\n<li>Register models <strong>and<\/strong> features in the same PR.<br><\/li>\n\n\n\n<li>Deploy via Terraform blue\/green; rollback on health fail.<br><\/li>\n\n\n\n<li>Monitor drift &amp; latency in Grafana; auto-alert Slack.<br><\/li>\n\n\n\n<li>Audit lineage with MLflow tags &amp; Git SHA.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>TL;DR (\u2248 85 words) Notebooks rot, hand-clicked MLflow UI tags drift, and \u201cone-off\u201d Bash deploys become snowflakes. We\u2019ll turn that mess into a Git-ops pipeline where a single pull-request: End-to-end latency: \u2248 12 minutes; rollback in &lt; 90 seconds. All YAML, Terraform, and Grafana dashboards included. Why \u201cThrow It Over the Wall\u201d Still Rules ML\u00a0 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":20,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-77","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml-best-practices"],"_links":{"self":[{"href":"https:\/\/steadyrabbit.in\/blogs\/wp-json\/wp\/v2\/posts\/77","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/steadyrabbit.in\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/steadyrabbit.in\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/steadyrabbit.in\/blogs\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/steadyrabbit.in\/blogs\/wp-json\/wp\/v2\/comments?post=77"}],"version-history":[{"count":2,"href":"https:\/\/steadyrabbit.in\/blogs\/wp-json\/wp\/v2\/posts\/77\/revisions"}],"predecessor-version":[{"id":79,"href":"https:\/\/steadyrabbit.in\/blogs\/wp-json\/wp\/v2\/posts\/77\/revisions\/79"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/steadyrabbit.in\/blogs\/wp-json\/wp\/v2\/media\/20"}],"wp:attachment":[{"href":"https:\/\/steadyrabbit.in\/blogs\/wp-json\/wp\/v2\/media?parent=77"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/steadyrabbit.in\/blogs\/wp-json\/wp\/v2\/categories?post=77"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/steadyrabbit.in\/blogs\/wp-json\/wp\/v2\/tags?post=77"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}