MLOps

MLOps in Practice: From Notebook to Production

By Juan Francisco Lebrero 8 min read

Key takeaways

  • The notebook-to-production gap is mostly organizational and procedural, not technical. The tooling has been commoditized; the missing pieces are ownership, SLAs, and change control.
  • A minimum viable MLOps stack has six parts: model registry, feature store (or equivalent), pipeline orchestrator, serving layer, monitoring, and evaluation harness. Every part must be in place before you ship a model you intend to support.
  • CI/CD for models is not CI/CD for code with extra steps. Models have data inputs, training nondeterminism, and a separate retraining cadence that makes traditional DevOps pipelines insufficient on their own.
  • Observability for ML covers four axes: data drift, concept drift, performance decay, and cost drift. Teams that instrument only latency and error rate discover problems weeks after they matter.
  • Canary deployments with automated rollback gates, owned by the business process team with ML as second-line, are the operating model that makes production ML sustainable over years.

The notebook-to-production gap

A model that reaches 91 percent accuracy in a notebook is the beginning of the project, not the end. The distance from that notebook to a supported production service is measured not in model performance but in the eight or ten artifacts the team has not yet built: a reproducible training pipeline, a model registry, a deployment, a monitoring job, a rollback path, on-call ownership, an evaluation harness, a retraining trigger, and a clear contract with the business process the model supports.

Teams underestimate this gap because the notebook work is the visible work. A promising demo gets shared in a Monday standup and leadership assumes the hard part is done. Six months later, the same team is still wiring up the feature pipeline and explaining to finance why the system cannot be supported on weekends.

This post covers the stack, the process, and the organizational pieces that close that gap. It assumes you have a model worth deploying and focuses on everything that happens between "the model works" and "the model is part of the business."

The MLOps maturity ladder

We use a five-level ladder in engagements to locate a team, set expectations, and pick the right next investment. No team should skip more than one level at a time.

MLOps maturity levels — from manual to fully automated
Level Name What it looks like Primary risk
0 Manual Notebook in production, no reproducibility, deploys happen by hand. Model breaks silently, no one notices for weeks.
1 Scripted Training and deployment scripts exist but are not triggered automatically. Drift between dev and prod environments; tribal knowledge.
2 CI-automated CI runs tests on code changes; deployments go through a pipeline. Training remains manual; models retrained on an ad-hoc schedule.
3 CT-automated Continuous training pipelines triggered by time, data, or performance. Automated retraining without automated evaluation can silently degrade quality.
4 Fully automated Drift detection triggers retraining and gated canary deploys with auto-rollback. Governance debt; audit and explainability become the hard problem.

Most mid-market teams we work with are honestly at level 1 when we arrive. The fastest valuable move is usually level 1 to level 2 plus basic monitoring, which gives the team the confidence to ship changes without fear of silent regressions. Level 3 and 4 are worth pursuing only once the model portfolio justifies the investment — typically at three or more production models per business unit.

The MLOps stack we deploy

The specific tools vary by client (we are deliberately tool-agnostic), but the component roles are fixed. A production-ready stack has six layers, and gaps in any of them show up as incidents later:

  • Model registry. Versioned store of trained model artifacts with lineage to training data and code. The source of truth for what is deployable.
  • Feature store (or pragmatic equivalent). Shared definitions of features, computed consistently across training and serving. Full Feast/Tecton deployments are often overkill; a well-designed feature table plus a small online lookup is the 80/20.
  • Pipeline orchestrator. Airflow, Dagster, or Argo Workflows. Owns training, evaluation, and data validation as code.
  • Serving layer. REST or gRPC service, batch job, or streaming worker, depending on the use case. Sits behind the same gateway and observability as your other services.
  • Monitoring. Metrics for data drift, prediction drift, performance decay, and cost. Alerting wired to the on-call rotation.
  • Evaluation harness. Reproducible offline evaluation plus a lightweight online or shadow evaluation for production. Required on every deploy.

Here is the canonical data flow we set up. Nothing exotic, but every arrow must be instrumented:

Ingest → Validate → Features → Train → Evaluate → Registry → Deploy → Serve → Monitor
                                                                           │
                                                                           ▼
                                                                       Feedback
                                                                           │
                                            ┌───── retraining trigger ────┘
                                            ▼
                                     back to Features / Train

Every arrow emits telemetry. Every box has an owner. Every failure has a runbook.

The goal is not elegance. The goal is that on a Tuesday at 2am, the on-call engineer can look at the system, identify the broken arrow, and act. Our post on data engineering foundations for ML teams goes deeper on the ingestion and validation steps on the left side of this flow.

CI/CD for models — what changes vs software CI/CD

CI/CD for software validates that the code does what the code says. CI/CD for models has to validate two additional things: the code trains a model that meets a quality bar, and the deployed artifact matches the evaluated one. Three specific differences matter in practice.

Tests are multi-layered. Unit tests on transformations. Integration tests on the training pipeline end-to-end. Evaluation tests on held-out metrics. Regression tests on the deployed endpoint against a fixed test set. Skip any layer and you will ship a silent regression within the quarter.

Data is an input. A code change that does not touch the model can still break it if training data shifted under it. The CI pipeline has to either pin the training data or explicitly detect and flag data changes. Running CI against stale data is worse than no CI because it produces false confidence.

Promotion is gated on evaluation. The artifact that passes CI is not automatically production-eligible. It enters the registry with a status. A separate promotion step compares the candidate against the current production model on a fixed benchmark, and only promotes when it meets or exceeds the incumbent. Human sign-off is optional; the evaluation gate is not.

Reuse existing DevOps tooling where you can. GitHub Actions, GitLab CI, Argo, Buildkite: they all work. The places you need ML-specific tools are the registry, the feature store interaction, and the ML-aware monitoring. The CI runner itself is a solved problem.

Observability: data drift, concept drift, performance decay, cost drift

Model observability is harder than service observability because failures are usually quiet. The endpoint returns 200s, the latency is normal, no exceptions fire, and the predictions are wrong. Four axes cover the ways a model degrades:

Data drift. Inputs look different from training data. Detect with per-feature distribution tests (Kolmogorov-Smirnov, PSI, or simple binned KL divergence). Alert on statistically significant changes in the features that most influence predictions. Data drift is often the first warning that something upstream — a data source, a definition, a sampling frequency — has changed.

Concept drift. The relationship between inputs and outputs has changed. Harder to detect, because it requires ground truth or a proxy. For models with eventual labels, monitor rolling performance. For models without, use proxy business metrics (e.g., downstream approval rates, manual override frequency) and run periodic shadow evaluations against a labeled sample.

Performance decay. Model metrics like AUC, F1, MAE, or calibration slipping against a held-out reference. Can happen without drift if the model was overfit to training conditions that no longer hold. Alert on relative, not absolute, thresholds so noise does not swamp signal.

Cost drift. Per-request cost, GPU utilization, cache hit rate, and retraining frequency. For GenAI and high-volume systems, cost drift is often the first constraint teams hit. A model that is 20 percent less cache-friendly after a deploy can double your infrastructure bill before anyone notices in the product metrics.

Instrument all four. Alert on two: data drift (early warning) and performance decay or a performance proxy (late warning). Review cost and concept drift in the weekly ML operations review, not via pager.

Rollback and canary strategies that actually work

A rollback strategy is not a design document; it is a button that works on a bad day. Two patterns cover almost every production ML deployment we ship.

Canary with automated gates. Serve the new model to one to five percent of traffic. Compare business and model metrics against the incumbent on the same traffic shape, over a fixed evaluation window. If the candidate underperforms on any guardrail, it is rolled back automatically. If it passes, traffic ramps on a schedule — 5, 25, 50, 100 percent — with the same gate at each step.

Shadow mode before canary. For high-stakes systems (fraud, pricing, medical), run the new model in shadow for days before exposing it to real traffic. Shadow means the model makes a prediction that is logged but not acted on. Compare shadow predictions to the incumbent's decisions and to the ground truth. Shadow mode catches the kind of failure that only becomes visible on real-world distribution.

Two rules we hold firm on. First, the gate criteria are set before the canary starts, not after. If you negotiate the thresholds in the middle of a deploy, you are not rolling back; you are rationalizing. Second, a rollback is never a failure; it is the system working. Teams that treat rollbacks as embarrassments stop canarying, which is the real failure.

Guardrails and automated evaluation tie directly into the broader agent engineering patterns our team ships in building AI agents that work in production — the same mental model applies.

The org changes that make MLOps stick

Tooling gets a team to level 2. Organization gets them to level 3 and beyond. Four structural changes separate teams that sustain production ML from teams that keep rebuilding it.

Clear model ownership. Every model in production has a named owner on the business side and a named owner on the ML side. The business owner signs off on the KPI and the rollout. The ML owner maintains the system. Unowned models rot.

On-call rotation that includes ML. Initially, the ML team takes pages for their models. Over six to twelve months, ownership moves to the platform or service team that operates the business process, with ML as second-line escalation. The transition forces documentation and runbooks into existence.

SLAs and SLOs for models. Not just latency and availability, but also data freshness, prediction staleness, and retraining cadence. Written, agreed with the business owner, reviewed quarterly. SLOs make the cost of poor MLOps legible.

Change control that treats models as production services. Retraining is a deploy. A feature pipeline change is a deploy. Threshold tuning is a deploy. All go through the same approval and canary process. Teams that give models exceptions get incidents.

A real migration: notebook → production in 8 weeks

A representative engagement shape, drawn from patterns we have seen repeatedly across our engagements rather than any single client. A mid-market B2B services company had a churn model in a notebook that had been "in production" via weekly manual scoring for over a year. The team wanted proper automation and monitoring.

Week 1. Discovery and scoping. Mapped the current notebook, the manual scoring cadence, and the downstream CS workflow that consumed the scores. Defined the KPI (net revenue retention on scored accounts) and the operating SLOs. For measurement discipline on the KPI side, we applied the patterns from our measuring AI ROI framework.

Week 2. Data foundation. Refactored the notebook feature logic into a reproducible pipeline backed by the existing warehouse. Added validation checks on input distributions and row counts. Published the training dataset as a versioned artifact.

Weeks 3 to 4. Training pipeline and registry. Wrapped the training logic in an orchestrator DAG, wired the registry, and added offline evaluation gates. First model promotion was the incumbent notebook model, retrained on the cleaned pipeline for a sanity check.

Weeks 5 to 6. Serving and monitoring. Deployed the scoring as a nightly batch job writing to the CRM with proper retries and alerting. Instrumented data drift on top features and rolled performance checks on a quarterly label window.

Weeks 7 to 8. Canary, runbook, handoff. Ran the new pipeline in shadow for two weeks against the manual process, compared scores on the same accounts, promoted after meeting the gate. Wrote runbooks for three common incidents (feature pipeline failure, drift alert, score distribution change). Handed on-call to the platform team with ML as second-line.

Eight weeks, one model, level 0 to level 2 with monitoring. The team was not at level 4 at the end, and they should not have been. The goal of a migration like this is not full automation. It is a supportable system that the business can rely on and that the ML team can improve without heroics.

Frequently asked questions

Do we need a feature store from day one?

No. For a first model in production, a well-organized offline feature table and a small online feature service is enough. A full feature store becomes valuable once you have three or more models sharing features, or once training-serving skew starts causing silent quality regressions.

What is the minimum viable MLOps stack?

A model registry, a reproducible training pipeline, a deployment mechanism with rollback, basic data and prediction logging, and a monitoring job that alerts on performance decay. You can stand this up in four to six weeks with off-the-shelf tools.

How do we handle models that do not have ground truth labels in real time?

Monitor leading indicators: input drift, prediction drift, and proxy business metrics. Run periodic shadow evaluations against a labeled sample. Set alerts on relative changes in these signals, and treat them as triggers to collect labels or investigate.

Who should be on-call for models in production?

Initially, the team that trained the model. As the system stabilizes, ownership should move to the platform or service team that owns the business process the model supports, with the ML team on second-line escalation. Split ownership with clear escalation paths works better than full transfer.

How do we do canary deployments for models?

Serve the new model to a small traffic slice (typically one to five percent) while the old model handles the rest. Compare business and model metrics on the same traffic shape. Promote on success, roll back instantly on regression. Canary gates should be automated, not human-reviewed in the critical path.

Should MLOps teams reuse existing DevOps tooling?

Yes, where possible. GitHub Actions, GitLab CI, and Argo all work for ML pipelines. Container registries, secret managers, and observability stacks carry over. The places you need ML-specific tooling are model registry, feature store, and ML-aware monitoring — not the CI runner.

Planning AI work this quarter?

Book a 30-minute strategy call and we'll stress-test your use case before you commit. We will audit your MLOps maturity and map the fastest path to a supportable production system.