Overview of Machine Learning Systems

What is an ML System?

In 2016, Google replaced its phrase-based translation system with a neural machine translation model. Overnight, the quality of Google Translate improved more than it had in the previous ten years combined. It was a dramatic demonstration of what ML could do — but the model itself was only one piece of a much larger system.

Behind that model was a data pipeline ingesting billions of translated sentence pairs, a training infrastructure distributing computation across thousands of machines, a serving system handling millions of requests per second with strict latency requirements, and a monitoring system detecting when translation quality degraded.

This is the central insight of Designing Machine Learning Systems: the ML algorithm is a small part of the overall system. The data, infrastructure, deployment, monitoring, and human processes around it are where most of the complexity — and most of the failures — live.

ML Systems Overview

An ML system is much more than just the model. It includes data pipelines, feature engineering, training infrastructure, evaluation frameworks, deployment platforms, and monitoring systems. The model itself is often a small fraction of the overall codebase.

Explore the map below to see how all the components fit together and which chapters cover each one.

When to Use Machine Learning

ML is powerful but not always the right tool. Not every problem benefits from a learned model — some are better solved with simple rules, lookup tables, or human judgment. The key is knowing when each approach fits.

Overview of Machine Learning Systems

What is an ML System?

ML Systems Overview

When to Use Machine Learning

Different Stakeholders and Requirements

Understanding Percentiles

What's Next