Introduction to ML Systems Design

Business and ML Objectives

In 2009, Netflix awarded $1 million to a team that had improved their movie rating prediction by 10%. The company never deployed it. The model was more accurate at predicting what star rating you would give a film — but Netflix didn't make money from accurate ratings. They made money from hours watched. And nothing about predicting your stars guaranteed you'd actually press play.

This is the gap at the heart of every ML system: the metric the model optimizes is rarely the metric the business cares about. Recommender models chase click-through rate; the business needs retention. Ad-ranking models chase prediction accuracy; the business needs revenue per impression. Even within a single team, what counts as "success" fractures the moment ML metrics meet a P&L statement.

Most ML projects fail not because the model was bad, but because nobody agreed what the model was for.

Loading diagram...

Figure 2.1 — Where ML teams spend their time. Adapted from Algorithmia's 2020 State of Enterprise ML survey, as cited in DMLS Chapter 2.

You can see the gap reflected in how organizations spend their time. The chart above is the punchline of Algorithmia's 2020 enterprise ML survey: model training — the part that gets all the academic attention — is where teams spend the least of their time. Most of the work is upstream (getting data into shape) and downstream (deployment, monitoring, governance). And if the deployed model doesn't move the business metric, all of that work delivers zero ROI.

There are two failure modes. The first is choosing the wrong ML metric — optimizing accuracy when watch-time is what you sell. The second is optimizing the right ML metric so well that you tank the business one — the recommender that ranks movies you'll rate highly but never finish. Both happen all the time, and both are invisible until the model is live.

The simulator below makes this visible. Pick a model and a user population, then run a session. Watch what happens to the ML metric (NDCG, the gold-standard ranking score) and the business metric (cumulative watch-time) side by side. They will not always agree.

Requirements for ML Systems

Once the objective is settled, the next question is what the system around the model has to do. ML systems carry the same requirements as any production software — they have to keep working, they have to scale, someone has to be able to maintain them — plus one that traditional software mostly avoids: they have to adapt to a world that won't sit still. The data they were trained on stops looking like the data they see in production, and they have to keep up.

Chip Huyen organizes these into four requirements: reliability, scalability, maintainability, and adaptability. Each of them looks deceptively familiar. Each of them is harder for ML systems than for the traditional software engineering they inherit the words from.

Reliability

A reliable system keeps doing the right thing under load, under bad input, under partial failures, and under the slow accumulation of weird edge cases nobody anticipated. For traditional software this is hard but tractable — when something breaks, it breaks loudly. A service throws a 500. A test goes red. A null-pointer exception lands in the logs. You can alert on it.

ML systems break quietly. A miscalibrated recommender doesn't crash — it just starts recommending the same five products to everyone. A drifted classifier doesn't throw exceptions — it just labels more and more inputs incorrectly with high confidence. The model still runs. The endpoint still returns 200s. The dashboards still tick. And the business slowly bleeds revenue while everyone congratulates themselves on a stable deployment.

This is why ML reliability is fundamentally a measurement problem. You cannot rely on the system to tell you when it's broken. You have to build the instrumentation that asks the question.

Scalability

Scalability has two faces in ML. The first is the one every web service knows: traffic grows, and a single machine stops being enough. You shard, you replicate, you load-balance. ML adds wrinkles — your request handlers carry hundreds of megabytes of model weights, and you cannot just spin up another container without paying for the GPUs to host them — but the playbook is recognizable.

The second face is unique to ML: the model itself grows. A model that fit comfortably in 8 GB of GPU memory at launch may need to be split across multiple devices a year later, after retraining on ten times the data. The serving stack you designed for the small model won't survive the big one. Scalability for ML is not just "how do I handle more requests" — it's "how do I handle a request when the model has outgrown a single machine."

Maintainability

An ML system is maintained by a wider cast than a regular service. Data engineers own the pipelines that feed it. ML engineers own the training and serving code. Subject-matter experts own the labels. DevOps owns the infrastructure. Product owns the objective. When any of them leaves and the model breaks, somebody has to figure out which of these surfaces is the problem — often without the original author around to explain.

Maintainability for ML is the discipline of leaving behind a system that someone else can understand, debug, and retrain six months from now. It is opposed by every shortcut that feels reasonable in the moment: the notebook with hardcoded paths, the dataset version that lives only on someone's laptop, the magic number in the loss function with no comment. These don't break the system today. They make it unmaintainable tomorrow.

Adaptability

This is the requirement traditional software engineering doesn't really have. A correctly-written sorting function will sort correctly ten years from now. A correctly-trained recommender, deployed a year ago, may already be wrong — because user behavior has shifted, because the catalog has changed, because a global event (a pandemic, a viral trend, a competitor's launch) rewrote the patterns the model learned from.

Adaptability is the system's ability to notice this drift and respond to it: to detect that the world has moved, to retrain on fresher data, to roll the new model out without breaking anything, and to roll back if it's worse. It's the requirement that turns ML systems into living systems — they don't ship and stay shipped. They ship and keep shipping.

Most of the rest of Designing Machine Learning Systems is the engineering of these four requirements. Chapters 3 through 5 build the data plumbing that makes reliability and adaptability possible. Chapters 6 and 7 cover the model development and deployment patterns that scalability demands. Chapters 8 and 9 are about detecting drift and continually learning — adaptability in production. Chapter 10 is the infrastructure that keeps any of this maintainable.

If the previous section gave you the what of the system (the business outcome), this section gave you the how well (the non-functional bar). The next section turns to the how — the iterative process of actually building one.

Iterative Process for ML Development

There is a temptation to think of building an ML system as a one-time event: gather some data, train a model, deploy it, ship it, done. This is a recipe for projects that look brilliant in the demo and disappointing in production. ML systems are not built — they are cultivated. They go through a cycle, and the cycle never finishes.

Chip Huyen breaks the cycle into six stages: project scoping, data engineering, ML model development, deployment, monitoring, and business analysis. The output of each feeds the next. The output of the last feeds back to the first. You do not run this cycle once and stop — you run it again, with what you learned last time, and again, and again, for as long as the system is in production.

Loading diagram...

Figure 2.2 — The iterative cycle of ML development. The output of each stage feeds the next; the output of the last feeds back to the first.

The arrows are the important part of the diagram. They are not decorative. A lazy choice in scoping ("we'll figure out the metric later") becomes a noisy dataset in data engineering, which becomes an ambiguous loss function in model development, which becomes a model nobody can evaluate in deployment, which becomes a silent failure in monitoring, which becomes a project the business writes off six months later — at which point you go back to scoping and try again, with budget you no longer have.

The early stages cost the least. Later stages cost the most. And the expensive failures in later stages are almost always rooted in cheap mistakes in earlier ones.

The simulator below makes the cascade tangible. Pick a scoping stance, a data-quality budget, a model complexity, and a deployment strategy. Watch what comes out the other end.

The simulator illustrates two things at once. First, the cascade: a single choice early in the cycle — say, ambitious scoping with a thin data budget — compounds through every subsequent stage. By the time you reach business analysis, the model is unrecoverable, and the business outcome shows it.

Second, the iteration: running the cycle once is not the experiment. The experiment is what happens on the second loop, when you carry forward what you learned. Try setting n_iterations to 3 or 5 and watch how the system either compounds value (good early decisions) or compounds debt (bad ones).

import math def bce(y: int, p: float) -> float: # tiny eps avoids log(0) when the model is confident-and-wrong eps = 1e-15 p = max(min(p, 1 - eps), eps) return -(y * math.log(p) + (1 - y) * math.log(1 - p))

Mind versus Data

There's an old argument in machine learning about where the gains actually come from. One camp — model-centric — argues that progress comes from better algorithms, smarter architectures, cleverer training tricks: build a better model and the rest follows. The other camp — data-centric — argues the opposite: progress comes from more and better data; the model is a detail, the data is the work.

For most of ML's history, the model camp dominated the headlines. New architectures (CNNs, transformers, diffusion models) made the news. But over the last decade the empirical evidence has tilted hard toward data. The single biggest driver of large-language-model progress has been training corpus size, not architectural innovation. The single biggest fix for a broken production model is, more often than not, fixing the labels — not swapping the architecture.

The implication for system design is uncomfortable for engineers who like elegance: most of your platform's complexity should sit below the model, not at it.

Loading diagram...

Figure 2.7 — The data stack. Each layer depends on the one below it. The model — the part that gets the most attention — is the smallest layer.

The pyramid encodes a kind of architectural humility. Whatever model you train at the top sits on top of feature engineering, which sits on top of cleaned data, which sits on top of an ETL pipeline, which sits on top of whatever instrumentation captured the events in the first place. Break any lower layer and everything above it inherits the breakage.

A subtle bug in collection — say, a button that doesn't fire its impression event 5% of the time — propagates into ETL silently, gets cleaned past exploration, becomes a feature with a slight bias, and ends up as a model that systematically misranks 5% of relevant content. Nothing in the upper layers can recover the missing data. The pyramid is, in effect, the order in which mistakes get harder to fix.

Loading diagram...

Figure 2.8 — Language model training corpora over time, on a log scale. Each generation has been, roughly, an order of magnitude larger than the last.

Figure 2.8 is the empirical case for the data-centric view. Each new generation of language models has been, roughly, an order of magnitude larger than the last — not because the architectures stopped improving (they absolutely kept improving) but because data was the binding constraint. The architectures only mattered to the extent that they could absorb more data without losing their grip on it.

The lesson for building ML systems is not that algorithms don't matter. It's that almost every team's marginal hour is better spent on the data side than the model side. Better labels. More coverage of edge cases. Cleaner ETL. More instrumentation. The model is downstream of all of it.

Introduction to ML Systems Design

Business and ML Objectives

Requirements for ML Systems

Reliability

Scalability

Maintainability

Adaptability

Iterative Process for ML Development

Framing ML Problems

Types of ML Tasks

Objective Functions

Mind versus Data

Summary