Where data comes from, what shape it sits in, and how it moves
Based on Chapter 3 of Designing Machine Learning Systems by Chip Huyen (O'Reilly, 2022).
Data Sources
In 2008, Google launched Flu Trends — a model that predicted regional flu outbreaks from search-query data, often beating the CDC by weeks. It was a celebrated success: a brand-new data source had outperformed the gold standard. Five years later, the model had drifted so badly it was overestimating flu cases by a factor of two.
The problem was not the algorithm. The problem was the source. Search-query patterns shifted as Google's autocomplete changed, as users learned what to type, as the news cycle nudged people toward different terms. The model had treated query volume as a stable measurement, but the source itself was being silently rewritten under it.
Every data source has its own physics. Some are reliable but slow. Some are fast but lie. Some are cheap but borrowed. Some are precise but locked behind legal contracts. This chapter is about telling them apart — because if you cannot, you are designing a system that can fail in ways its model cannot detect.
Loading diagram...
Figure 3.1 — Data flow through an application. The request path runs left to right; the analytics path collects, stores, and feeds training; auxiliary sources fan in from below. Adapted from Chapter 3 of DMLS.
Read the diagram top to bottom. The top lane is what the user sees: a request goes into the application, the application asks the ML model, and a prediction comes back. Underneath, two slower lanes are doing all the work that lets that prediction exist at all. The middle lane is the analytics path — transactional writes get extracted, transformed, and loaded into a warehouse where models can actually train on them. The bottom lane is everything the system did not generate itself: what the user typed, the logs your services emit, the contracts you bought.
Chip Huyen organizes data sources into a small number of categories. Each one comes with a different operating economics — different volume, different freshness, different schema discipline, different trust, different cost — and the system around the model has to respect those differences or pay for them later.
User-input data
What the user types, taps, uploads, ticks, photographs. This is often the data you most want, because it is closest to the labels the model needs — a star rating, a yes/no, a corrected translation. It is also the data you have the least of: every row costs a human second of attention, and humans are slow, distracted, and inconsistent.
User-input data is noisy by default. Free-text fields will arrive with typos, mixed languages, accidental pastes, and the occasional essay where you expected one word. Multiple-choice fields will be skipped or clicked-through. Photos will be blurry, sideways, or of the wrong subject entirely. The validation cost per row is the highest of any source you will use, and skipping that validation is the most common way models silently learn the wrong thing.
System-generated data
Logs, click streams, server metrics, error traces, request metadata. Cheap to generate at petabyte scale; expensive to make sense of. The schema is whatever the engineer wrote yesterday, and it changes the next time someone refactors the emitter. Two services logging "the same" event will disagree about what to call the fields and what to put in them.
System-generated data is fast and dirty. It is what you reach for when you need to know what users actually did (clicks, dwell time, abandonment) instead of what they said they did. The trade-off is that you inherit every emitter's bugs — phantom events, missing events, events emitted twice, events emitted in the wrong order. Real-time monitoring depends on this source. So does most of the feature engineering you will read about in Chapter 5.
Internal databases
The transactional tables that actually run the business: orders, accounts, inventory, the catalog. Highest schema discipline of any source — these tables exist because somebody had to bet money on their structure being right. Highest trust, too: if the orders table is wrong, the company loses revenue immediately, so it tends not to be wrong.
The catch is access cost. The transactional database is the system that lets users place orders right now; if your training job scans it, you have just slowed down the order-placement page for everyone else. The whole reason §4 of this chapter exists — OLTP versus OLAP — is to let analytical workloads use the same data without competing with the transactional ones for the same row locks.
First- and third-party data
Data you did not collect yourself. First-party means a partner you have a direct contract with (a payment processor, a logistics provider, a data co-op). Third-party means someone further removed (a broker, a public dataset, a scraped corpus). Buys you breadth fast — demographic enrichment, weather, geocodes, holiday calendars, anything you would never collect yourself.
Provenance is the failure mode. You do not control how this data was collected, how often it is refreshed, or whether the vendor's definition of "unique user" matches yours. When the vendor's pipeline silently breaks, your model drifts and you do not find out from your monitoring; you find out from a quarterly vendor-relations call. Privacy and regulatory exposure travel with this data too, often invisibly.
Pulling the four sources side by side surfaces the trade-offs faster than the prose does:
Source
Volume
Freshness
Schema discipline
Trust
Cost economics
User-input
Low
Slow (per action)
Loose, noisy
Variable per user
High per row (human attention)
System-generated
Very high
Real-time
Engineer-defined
Internal, debuggable
Cheap to emit, costly to clean
Internal DB
Medium
Real-time
Tight (production)
High
High to access without perturbing prod
Third-party
Medium-high
Vendor-defined
Their schema, fixed
Contract-defined
Highest dollars-per-byte
No source dominates the others. A serious ML system will use all four — and the engineering job is making each one carry its own weight without letting the weakest one set the ceiling for the whole system.
The rest of the chapter takes the sources as given and follows the data downstream. §2 looks at the shape it sits in once it lands — row-major versus columnar layouts, and the serialization formats (JSON, CSV, Parquet, and friends) that move bytes between systems. §3 asks how to model a domain — relational, document, graph, time-series — and what each model makes easy or expensive. §4 draws the line between OLTP and OLAP and shows why you can't run both workloads on the same engine. §5 covers how services hand data to each other (synchronous, message bus, batch transfer). §6 closes with the batch-versus-stream divide that runs through every modern data platform.
Each of those sections has at least one sandbox where the trade-off is something you can run rather than something you have to take on trust. By the end of the chapter the data engineering layer should stop being a wall of acronyms and start looking like a small set of real choices, each with measurable consequences.
Data Formats
Once data has a source, it has a body. Some bytes have been written somewhere — into a buffer, onto disk, into a network packet — and the order in which those bytes are arranged is one of the more consequential choices a system makes about itself.
Two questions hide inside this. The first is the in-memory layout question: when a million rows live on disk, are they written one row at a time or one column at a time? The second is the serialization question: when those bytes have to leave one system and enter another, what shape do they wear on the wire? Both choices are invisible from the application code but loud in the latency budget. This section opens both.
Row-major versus column-major
A table is a two-dimensional thing. Storage is a one-dimensional thing — bytes in some order on a disk or in some order in RAM. There are two natural ways to flatten the table.
Row-major writes one row, then the next, then the next. Every row's columns sit next to each other in memory. This is what most people picture when they think of a database table; it is what PostgreSQL and MySQL use by default; it is what struct-of-fields produces when you write it to a file. The win here is locality for whole-row operations: if you want to load a single user's record, every byte you need is contiguous and one read brings it all in.
Column-major does the opposite. It writes all of column A, then all of column B, then all of column C. The bytes for any one column are contiguous; the bytes for any one row are scattered. This is what columnar systems like Parquet, Apache Arrow, and warehouses like Snowflake or BigQuery use under the hood. The win is locality for whole-column operations: aggregating, filtering, or scanning a single column reads only the bytes that column owns and skips everything else.
Loading diagram...
Figure 3.2 — Same twelve cells, two physical orders. The same query — "read column C" — costs four times as many touches under row-major as under column-major, and the gap grows linearly with the number of columns in the table.
The figure makes a small example concrete; the same shape holds at scale. A point lookup — fetch the order with id = 42 — wants every column for one row, and is best served by row-major: one disk seek, one read, done. An aggregation — average price across ten million rows — wants one column for every row, and is best served by column-major: one column-block fetched, all the irrelevant columns skipped entirely.
There is no universal winner, only mismatched workloads. A row-store running an analytical aggregation will read every column it does not need and quietly thrash the cache. A column-store running a single-row lookup will reassemble the row from N separate column blocks. Either of these can be 10–100× slower than the right layout for the same query — which is the whole reason §4 of this chapter (OLTP versus OLAP) exists.
The sandbox below makes the trade-off small enough to feel. Pick a layout, pick a query, and watch which physical bytes get touched.
Serialization formats
In-memory layout is one half of the story. Once bytes have to leave the system that produced them — moved to another service, another process, another machine, another company — they need a serialization format: a self-contained way to write the data down so that something on the other end can read it back. The format you pick controls three things at once: the size of the bytes on the wire, the speed of writing and reading them, and what happens when the schema on either end drifts.
Three formats cover most of what an ML system runs into in practice.
JSON is text. It is universal, debuggable in a terminal, and self-describing — every field carries its name. The trade-off is that self-describing means expensive: every row repeats every field name, integers get written as decimal text, floats get rounded into strings. JSON is the right answer for low-volume, loosely-typed payloads (API responses, config files, log lines). It is the wrong answer for any dataset large enough that you would care about its size.
CSV is also text, but stripped down. No nesting, no field names per row, just rows of comma-separated values with a single header line. Smaller than JSON, parses almost as fast, still readable in a terminal. The hidden tax is that CSV has no real schema — every value is a string until you decide otherwise — so every reader rewrites the type-inference code, often incorrectly. (07053 is a postal code, not a number; the date 02/03/04 could be three different days.)
Parquet is binary, columnar, and schema-enforced. It writes each column in its own block, with type metadata once at the top, and compresses each block separately. The result is roughly an order of magnitude smaller than JSON for the same data, an order of magnitude faster to scan a single column, and the schema is part of the file rather than part of the documentation. The trade-off is opacity: you cannot cat a Parquet file or eyeball it on a server, and a producer that writes broken Parquet will fail loudly far away from where the bug actually is.
The sandbox below runs real serialize-and-deserialize timings across the three formats on synthetic data, so you can see what the trade-offs cost in milliseconds and bytes. The ordering of winners is rarely a surprise; the size of the gap usually is.
Putting both halves of the section side by side:
Format
Encoding
Schema
Typical size
Typical read speed
When to pick it
JSON
Text
None
Largest
Slowest
Small payloads, debuggability matters
CSV
Text
None
Medium-large
Slow
Tabular hand-off, human-readable export
Parquet
Binary
Embedded
Smallest
Fastest (columnar)
Bulk data, analytical scans, archival
The pattern is the same one the layout question raised: the format that is most flexible (JSON) is also the most expensive in bytes and parse time, and the format that is most efficient (Parquet) trades flexibility for scale-friendly opacity. Real systems tend to use JSON at the edges where humans look, CSV in the middle where spreadsheets and ad-hoc tools meet, and Parquet at the bottom where the data warehouse actually lives.
Layout and format together answer how the bytes sit. The next question is how the bytes are organized into a domain — tables versus documents versus graphs versus time series. Different domains want different shapes, and choosing the wrong shape produces query patterns that no amount of layout or format tuning can rescue. §3 walks the four canonical models and shows how each one makes some operations almost free and others structurally expensive.
Data Models
Storage layout decides how bytes sit. A data model decides what those bytes mean — what counts as an entity, what counts as a relationship, and which operations the system makes easy or expensive. Same domain, four very different mental shapes:
Relational — tables and foreign keys.
Document — nested objects, often JSON-shaped.
Graph — nodes and labeled edges.
Time-series — timestamped, append-only points.
These are not interchangeable wrappers around the same data. Each one bakes a particular set of assumptions into the schema, into the query language, and into the migration path you will have to walk five years from now. The model you pick is harder to change than any individual row in it — which is why this section sits between layout and storage engines, and not after them.
Relational
The relational model is the workhorse — old enough to vote, new enough to still run most production systems. Data lives in tables: each table has a fixed set of columns, each row is an instance, and relationships are expressed by foreign keys pointing from one table to another. Constraints (NOT NULL, UNIQUE, CHECK) are enforced by the database itself, not by the application code that reads it.
The big idea is normalization: each fact lives in exactly one place, and queries reassemble the facts they need with JOINs. A users table holds users; an orders table holds orders with a user_id foreign key; an order_items table holds the individual line items. Updating a user's address touches one row in one table, and every order ever placed by that user automatically reflects the new value.
What this gets right: integrity. Every relationship is schema-enforced, every fact is canonical, every query is expressible in the same boring SQL. What it costs: every interesting query is a JOIN, and JOINs across many tables are where relational systems start to struggle at scale.
Document
A document model gives up on tables. Each record is a self-contained, often deeply nested object — typically JSON-shaped — and the database stores those documents as units. MongoDB, CouchDB, DynamoDB, and most so-called NoSQL stores fall into this family. There is no schema in the relational sense; documents in the same collection can have different shapes and the database will not complain.
A user document might carry the user's profile and the array of their last twenty orders and the address book — all in one blob, retrieved with a single read by user id. This is denormalization turned into the default: you trade JOINs for redundancy. The same address may live in five different user documents, and updating it means writing to all five.
What this gets right: read locality and schema flexibility — a single fetch returns everything one screen needs, and the schema can evolve per-document without coordinated migrations. What it costs: integrity is now your problem. The database will not stop you from writing two users whose email field disagrees about its own type, and cross-document queries ("find all users who ever ordered this product") become expensive scans.
Graph
A graph model elevates relationships to the same status as entities. Data lives as nodes (entities) and edges (relationships), and edges carry their own labels and properties. Alice FOLLOWS Bob is a single edge of type FOLLOWS between two User nodes; Bob WROTE Review-7 is a single edge of type WROTE. Neo4j, Neptune, and TigerGraph are graph-native; many relational systems also bolt on graph extensions.
The diagram below shows a small e-commerce-ish slice — users follow each other, place orders that contain products, and write reviews that point back at products.
Loading diagram...
Figure 3.8 — A small graph slice. Each node carries an entity type; each edge carries a relationship type. Walking edges is a constant-cost operation, which is what makes graph models the right shape for relationship-heavy domains (social, recommendations, fraud).
What graphs get right: walking relationships is cheap. "Friends of friends who bought this product" is two edge hops in a graph; in a relational model the same question is a self-join across a table that grows with the square of the user count. What graphs cost: bulk operations are awkward — "average revenue per user this quarter" is structurally harder in a graph than in a single relational table.
Time-series
Time-series data is a special case of relational where the schema is fixed and the workload is append-mostly. Each row is a (timestamp, series_id, value) triple — a CPU usage reading, a stock tick, a sensor sample, a click event. InfluxDB, TimescaleDB, Prometheus, and Apache Druid all fit here. The schema is small, the row count is enormous, and every query has a time predicate ("the last hour", "this week", "between these two dates").
Time-series engines optimize for that shape. Data is often stored column-major within a time bucket, old data ages out into cheaper storage automatically, and the query language is rich about windowing ("5-minute rolling average over the last hour") and weak about everything else. The trade-off is sharp: if your workload genuinely is timestamps and values, this is the right shape; if it has any sustained relational character, you will be reaching back to the relational model within a quarter.
Stepping back, the four models trade off the same axes in different directions:
Model
Best at
Awkward at
Schema discipline
Typical pick when…
Relational
Integrity, ad-hoc queries
Many-table JOINs at scale
Strong, enforced
Default unless you have a reason
Document
Read locality, schema flexibility
Cross-document queries
Weak, per-document
One-screen fetch dominates the workload
Graph
Walking relationships, n-hop
Bulk aggregation
Per-edge typed
Relationships are the product
Time-series
Append, time-windowed reads
Anything not time-bucketed
Tight + minimal
Telemetry, metrics, logs, market data
The decision is rarely one or the other. Most production systems use two or three of these in different layers — a relational store for the source of truth, a document cache for fast reads, a time-series engine for monitoring, occasionally a graph layer for the relationship-heavy parts. The architecture diagram of a serious ML system tends to show one of each.
Choosing a model tells you what the data is. The next question — §4 — is what kind of storage engine the model lives on. The same logical schema can run on an OLTP engine tuned for short, point-shaped transactions or on an OLAP engine tuned for long, scan-shaped analytical queries. The two engines do not coexist gracefully on the same hardware, and the boundary between them is one of the load-bearing seams of every modern data platform.
Storage Engines
A relational table is a logical shape. The physical shape underneath it — how the rows are laid out on disk, what indexes they carry, what kind of locks the engine takes — is decided by the storage engine, and the same logical schema can run on two engines that look almost nothing alike at runtime.
Real systems split storage into two camps. OLTP engines are tuned for the request path: many short transactions, point lookups, single-row updates, low millisecond latency, thousands of concurrent connections. OLAP engines are tuned for the analytics path: a small number of long-running queries that scan millions of rows, often touching only a few columns, and tolerate seconds-to-minutes latency in exchange for not interfering with the live application.
These two camps are not a refinement of each other. They make opposite trade-offs at every level of the stack.
OLTP — online transaction processing
OLTP is what runs the user-facing application. Postgres, MySQL, SQL Server, Oracle. Data lives in row-major storage so that fetching one entire row — "give me everything about order id = 42" — is a single contiguous read. Every column the application might want is sitting next to the others; one disk seek brings the row in.
Layered on top is a B-tree index (or several) per table. An index turns the cost of finding a row by id from "scan all rows" into "walk a tree of depth four" — and once you have that, point lookups become so cheap that the engine can serve thousands per second on a single machine. Updates write a new version of the row and adjust the index in place; the database guarantees these changes are ACID, so a half-applied transaction is never visible to anyone else.
OLTP engines are optimized for throughput of small things. Their workload is millions of one-row touches per second. Their anti-pattern is one query that touches a million rows.
OLAP — online analytical processing
OLAP is what runs the dashboard, the ad-hoc analyst query, the training job. Snowflake, BigQuery, Redshift, ClickHouse, DuckDB. Data lives in column-major storage (§2 has the diagram), so that scanning one column across ten million rows reads only the bytes that column owns. Aggregations like AVG(price) or SUM(revenue) GROUP BY region become sequential, cache-friendly scans.
Indexes are mostly absent. The engine's bet is that you are going to read a lot of rows anyway, so a B-tree would just be overhead. Instead the engine relies on zone maps (per-block min/max statistics so it can skip whole blocks that cannot contain matching rows) and aggressive compression (values in a column are typically similar to each other, so they compress an order of magnitude better than the same values scattered across rows).
OLAP engines are optimized for throughput of big things. Their workload is a small number of long scans. Their anti-pattern is one row at a time, which is exactly what OLTP is for.
Loading diagram...
Figure 3.9 — OLTP and OLAP make opposite trade-offs at every level. The ETL or CDC bridge is the only place they meet — and it is asynchronous on purpose, so the analytical workload cannot perturb the transactional one.
The two engines do not share storage, and the reason is competitive interference. If an analytical query scans ten million rows on the same machine that is serving the order-placement page, every row it reads competes with the OLTP engine for buffer pool, lock manager, CPU, and I/O bandwidth. The dashboard finishes; the user-facing latency p99 doubled for the duration. This is why §1's architecture diagram has a separate transactional database and data warehouse: the ETL (or CDC — change data capture) bridge is what keeps them out of each other's way.
The ETL bridge is asynchronous on purpose. A few minutes to a few hours of lag is a feature: it means the warehouse can absorb a slow query, restage data into a more aggregation-friendly shape, and give the analyst a stable copy to work against. The cost is freshness — your dashboard is always looking at slightly-old data — and most of the time, that is an acceptable trade.
The sandbox below makes the gap concrete. Pick a workload (point lookups versus aggregations versus mixed), pick an engine, and watch the latency distributions diverge.
Side by side, the two engines look like opposites:
Dimension
OLTP
OLAP
Storage layout
Row-major
Column-major
Indexes
Heavy (B-tree per access pattern)
Sparse (zone maps, no row index)
Compression
Modest (per-row)
Aggressive (per-column block)
Concurrency
10³–10⁴ connections
~10¹ long-running queries
Latency target
Single-digit ms
Seconds to minutes
Tx model
ACID, row-level locks
Snapshot reads, no row locks
Best at
Point lookup, single-row update
Wide scan, aggregation, GROUP BY
Worst at
Million-row scan
Single-row fetch by id
Reading the table top to bottom is one way to make the trade-off list memorable. Reading it left to right makes a different point: every row is one decision the engine made, and that decision implies the next one. Once you commit to row-major, B-tree indexes are basically free; once you commit to column-major, they stop helping. The engines are coherent positions, not a la carte menus.
OLTP and OLAP are where the data lives. The next chapter section — §5 — asks how it gets between systems in the first place. Two services talking over a synchronous REST call have very different failure modes from two services talking over a message bus, and the choice between those modes shapes how much load the OLTP/OLAP split has to absorb in practice.
Modes of Dataflow
An ML system is not one process; it is many. The training pipeline runs somewhere; the feature store runs somewhere else; the prediction service runs in a third place; the monitoring system reads from all of them. Sooner or later, two of those processes have to talk to each other — and how they talk is one of the more consequential decisions the architecture makes about itself.
Chip Huyen organizes inter-service dataflow into three modes. They differ in who owns the bytes, who waits on whom, and what happens when something on the other side breaks. The same logical communication — "service A needs data that lives with service B" — produces wildly different operational shapes depending on which mode you pick.
Mode 1 — Through the database
The oldest and quietest mode. Service A writes to a table; service B reads from the same table. There is no direct wire between them; the database is the contract. This is what happens by default when an organization grows from one service to two without anyone explicitly designing the communication.
What this gets right: it is cheap. No new RPC layer, no new message bus, no new failure domain. Both services already knew how to talk to the database. What it costs: every consumer is now coupled to the schema of the producer, implicitly. A migration that adds a column is fine; a migration that renames a column is a multi-team coordination problem. And a slow query on either side shows up as tail latency on the other — they are quietly competing for the same buffer pool, the same locks, the same I/O bandwidth.
Mode 2 — Service-to-service (RPC)
Service A makes an explicit request to service B. REST, gRPC, GraphQL — the protocol varies, but the shape is the same: synchronous, request-and-reply, A blocks until B answers. The database stays out of it; B owns its own storage and exposes a typed API.
What this gets right: the contract is now an API, not a schema. B can change its storage layout without breaking A, as long as the API still returns the same shape. What it costs: failure cascades. If B is slow, A is slow. If B is down, A is down. If A calls B which calls C which calls D, then a 1% failure rate at each hop compounds to a 4% failure rate end-to-end, and a 100ms tail at each hop adds to a 400ms tail at the user. RPC chains are simple to reason about and unforgiving when any link in the chain wobbles.
Mode 3 — Through a message bus
Service A publishes events to a topic; service B subscribes to the topic and processes events at its own pace. Kafka, RabbitMQ, AWS SQS, Google Pub/Sub. The bus sits between the two services and stores the events durably; producers and consumers do not know about each other directly.
What this gets right: decoupling. A does not wait for B; if B is slow, events queue up at the topic and B catches up later. If B is down, the bus retains the events; once B comes back, it resumes from where it left off. New consumers can subscribe to the same topic without the producer changing at all. What it costs: consistency is now eventual. The moment A publishes, B has not yet seen the event; for some window of milliseconds-to-seconds, the system is in an in-between state. For workloads that tolerate that window (analytics, training pipelines, notification fan-out), the bus is a near-perfect fit. For workloads that don't (a payment confirmation; a fraud-check approval) it is the wrong tool.
Loading diagram...
Figure 3.11 — Same two services, three contracts. The choice is rarely about latency in the happy case — it is about which failure mode the system pays for when something breaks.
Reading the three panels at a glance, an obvious question: if the message bus is so much more forgiving under failure, why doesn't every system use it?
The answer is that decoupling is not free. The bus adds a new piece of infrastructure to operate (Kafka clusters are famously not zero-ops). It adds a window where producer and consumer disagree about reality, and any logic that needs an immediate, consistent answer has to be reworked. It changes how you debug — instead of following a single stack trace through an RPC chain, you correlate event ids across consumer logs, sometimes long after the producer moved on.
RPC pays a different price. Every additional service in a synchronous chain compounds the failure rate; a five-hop chain at 99% per-hop reliability is only 95% reliable end-to-end. The flip side is that RPC's failure mode is loud: when something breaks, a 500 lands somewhere, and the trace points at the offender. Bus failures are quieter — events lag, queues grow, monitors that nobody is watching tick over a threshold — and an unmonitored bus is one of the more durable ways to lose data without anyone noticing.
The sandbox below puts the trade-off on the same axes. Pick a mode, dial in a failure rate, and watch what happens to end-to-end latency, error rate, and throughput. The bus wins in some regimes and the chain wins in others — the interesting question is where the cross-over sits.
Putting the three modes side by side:
Mode
Coupling
Latency
Failure mode
Best at
Database
Schema (implicit)
DB-bound
Slow query → tail latency
Cheap default, single-team systems
Service / RPC
API (explicit)
Sum of hops
Cascade — any hop fails ⇒ fail
Hot-path requests needing consistency
Message bus
Topic schema (decoupled)
Bus-buffered
Lag, eventual consistency
Async fan-out, analytics, ML pipelines
The pattern most production systems converge on is to use all three, in different layers. Hot-path requests go over RPC because the user is waiting and the system can't lie. Cross-team analytics flows over a bus because lag is tolerable and decoupling is worth a lot. Long-lived shared state still lives in databases, because that is where the schema actually belongs.
What you don't want — and what badly-designed systems accidentally end up with — is one physical communication carrying two logical contracts at once: an RPC that really wanted to be an event, or a database table that is secretly being used as a queue. Those are the ones that break in production for reasons nobody can quite explain.
§5 covered how services hand individual messages to each other. §6 zooms out one level. The same two services communicating in batches of millions of records — the nightly ETL job, the hourly retraining run — have a very different operating shape from the real-time stream of individual events. The batch-versus-stream divide runs through every modern data platform, and the two extremes trade throughput for latency in opposite directions.
Batch and Stream Processing
Once data has a place to live (§4) and a way to move between systems (§5), two more questions remain. Where in the pipeline does the transformation happen — before the data lands in storage, or after? And how often does the pipeline run — once a night against the day's accumulated data, or continuously against a stream of single events as they arrive?
The first question is ETL versus ELT. The second is batch versus stream. Both questions sound abstract until you watch the trade-offs ripple through latency budgets, infrastructure costs, and the kinds of mistakes a system can quietly make. This section walks both.
ETL — transform before load
ETL stands for Extract → Transform → Load. It is the older pattern, and the one most data-warehouse documentation from the 1990s and 2000s assumes. A pipeline reads from the source (Extract), reshapes the data into the form the warehouse wants (Transform — joining, cleaning, aggregating, deduplicating), then writes the transformed result into the warehouse (Load). The warehouse only ever sees data that has already been put into shape.
The win: data quality at the warehouse boundary. By the time anything lands in the analytical store, the messy edge cases have been resolved upstream. Schema drift, type confusion, duplicate rows, malformed timestamps — all of those are the transform's problem, and downstream consumers can trust the result. The cost: the transform stage is the bottleneck. If the transformation logic changes, you re-run the pipeline. If a downstream team needs a slightly different shape of the same source data, they need a new transform.
ELT — load raw, transform later
ELT flips the last two letters: Extract → Load → Transform. The pipeline still reads from the source, but instead of reshaping the data on the way, it lands the raw rows directly in the warehouse and lets the warehouse handle the transformation in-place — typically as a SQL view or a materialized table that other queries depend on.
The win: flexibility. The warehouse holds the unmodified source data, so different consumers can shape it differently without anyone having to rebuild the pipeline. Adding a new downstream view is a CREATE VIEW statement, not a Spark job rewrite. The cost: the warehouse pays for every transformation, every time it runs. ELT was operationally absurd until columnar warehouses got fast enough and cheap enough that running aggregations on raw data became comparable to running them on pre-shaped data. With Snowflake, BigQuery, and DuckDB, that point arrived around 2018, and ELT has been the new default since.
Loading diagram...
Figure 3.10 — Same five stages. Move the transform box left of Load and you get ETL; move it right of Load and you get ELT. The stages haven't moved — the contract has.
When does each pattern win? ETL is still the right answer when the warehouse is not the bottleneck — the source data is small, the transformation is expensive, and you want the result cached cleanly. It is also the right answer when downstream consumers must not see anything ungroomed: regulated environments, audit pipelines, anywhere a malformed row would do real damage.
ELT wins when the warehouse is fast and elastic, when the shape of the downstream queries is going to change, and when the cost of pre-computing the wrong transform is higher than the cost of running the right transform on demand. Most modern data stacks default to ELT for this reason — it is cheaper to be wrong with raw data than with transformed data, because the raw form is recoverable and the transformed one isn't.
The sandbox below puts the trade-off on a real timer. Pick a transform complexity, pick a warehouse type, pick the pattern, and watch the per-stage timings.
ETL versus ELT decides where the transform sits in the order. The next question is how often any of those stages run.
Batch processing
Batch processing waits for a window of data to accumulate, then processes the whole window in one shot. Nightly ETL jobs that crunch the day's transactions, weekly retraining runs that rebuild the model from the week's new labels, hourly aggregations that update the dashboards — all batch. The window is fixed, the data is bounded, and the job runs once per window.
Batch's lever is throughput. A nightly job that runs for two hours has access to enormous parallelism, can amortize startup costs across millions of rows, and can use the most efficient algorithms for the given size of data because the size is known up front. The price is latency: an event arriving at 9:01 AM does not show up in any output until that night's batch runs. The model trained on yesterday's data does not yet know about today's drift.
Stream processing
Stream processing reverses the trade-off. Events are processed one at a time (or in tiny micro-batches of milliseconds), as they arrive. The pipeline runs continuously and never finishes. Apache Flink, Apache Beam, Kafka Streams, Spark Structured Streaming all sit in this family.
Stream's lever is latency. The output reflects the input within seconds — a fraud-detection model can flag a transaction before it completes; a recommendation system can react to the user's most recent click. The price is operating cost: the pipeline is running 24/7 whether or not events are arriving, the per-event overhead is real, and the engineering machinery for handling windowing, watermarks, and out-of-order events makes stream code structurally harder than the equivalent batch logic.
The sandbox below makes the trade-off measurable. Pick a mode, pick a window, dial in the event rate, and watch throughput and end-to-end latency move in opposite directions.
Putting both questions on the same page:
Question
Option A
Option B
What's traded
Where does transform sit?
ETL (before load)
ELT (after load)
Pipeline rigidity vs. warehouse cost
How often does it run?
Batch
Stream
Latency vs. operating cost
The four corners of those two questions describe most real data architectures.
Batch + ETL is the classic warehouse pipeline: nightly jobs that clean and shape data on the way in. Stable, predictable, slow.
Batch + ELT is the modern default for analytics: dump raw data into the warehouse, let SQL views shape it on demand. Cheap to evolve, expensive to scan.
Stream + ETL is the real-time feature pipeline: an event arrives, gets transformed in-flight, lands in the feature store ready for the model to read. Fast and rigid; schema changes are painful.
Stream + ELT is the recently-viable corner: events land raw in a streaming warehouse (Materialize, RisingWave, Kafka + ksqlDB), and continuously-recomputed views shape them. Powerful, operationally hairy, and the part of the stack that most production systems are still figuring out.
The batch-vs-stream sandbox above treats each window as a single flush event. The next sandbox zooms in on the window itself — what counts as a "window", and how the three canonical types (tumbling, sliding, session) carve up the same event stream in different ways. Every stream-processing engine in production today implements all three; choosing between them is mostly a question of what temporal coupling the downstream consumer needs.
§6 closes the technical tour of the chapter. Sources, formats, models, storage engines, dataflow modes, and the batch-versus-stream divide — each one a small set of real choices with measurable consequences. The summary section that follows pulls the threads together and frames how the rest of the book builds on this layer.
Summary
This chapter walked the data engineering layer end to end. You started with the four data sources and the economics that sets each one apart — user-input is slow and noisy, system-generated is fast and dirty, internal databases are trustworthy but expensive to access, third-party data is flexible but provenance-fragile. You looked at how the bytes sit on disk — row-major versus column-major — and what JSON, CSV, and Parquet trade off when those bytes move between systems. You compared four data models against the same domain and saw why model choice is harder to change than the data inside it. You drew the line between OLTP and OLAP and watched the same logical schema produce wildly different latency profiles depending on which side of the line ran it. You picked between three modes of inter-service dataflow and saw that the choice is rarely about latency in the happy case — it is about which failure mode the system pays for. You closed with the batch-versus-stream divide and the four corners of ETL × ELT × batch × stream that describe most real architectures.
If there is a single thesis, it is that the data infrastructure is not a layer below the model — it is the substrate the model lives in. Every choice in this chapter constrains what the chapters after it can be: training data only exists if you can ship it; features only update if a pipeline updates them; predictions only land if a serving stack reads them. Most production ML failures are downstream of decisions made in this chapter, often years before the model that surfaces them was even trained.
Chapter 4 picks up where this one ends — the actual training data. How is it sampled? How is it labeled? What goes wrong when the labels disagree, the sampling drifts, or the dataset gets too large to look at? Coming soon.