Mary Fung
essayApril 7, 2026

Dataset vs data product

A dataset is a file. A data product is a contract about what questions it can answer, who owns the answers, and what happens when those answers change.

The same CSV gets used to train a fraud model on Tuesday and to brief a board on Thursday. By the next quarter, no one remembers which version was used for which, the model is producing predictions the business doesn't trust, and the analyst who built the briefing has moved teams. The CSV is fine. It is the same bytes it was. The problem is that nobody wrote down what it was for, so it ended up being used for everything, which is the same as being used for nothing.

This is the most common failure mode I see in enterprise data work. It does not happen because anyone is careless. It happens because the field treats data as the deliverable when the deliverable is actually a contract about the data.

The distinction

A dataset is bytes. A schema, some rows, maybe a README. It tells you what is in the file. It does not tell you what the file is for.

A data product is a dataset plus an explicit contract that specifies:

Notice that none of those are properties of the bytes. They are properties of the relationship between the producer and the consumer of the data. The data itself is mute on all of them.

Why the distinction matters more in the AI era

In a pre-AI world, the consumer of an enterprise dataset was usually a person — an analyst, a controller, a business reviewer — who would look at the data, notice if something seemed wrong, and ask. Bad data caused noisy meetings. Embarrassing, but bounded.

In an AI world, the consumer is increasingly a model. Models do not notice that something seems wrong. They produce predictions indifferently across the entire distribution they were trained on, including the parts of the distribution that were never meant to be in scope. A schema change that an analyst would have caught silently breaks a model's calibration. A staleness bound that a controller would have honored gets violated by a pipeline that doesn't know one exists.

The contract was always implicit. It used to be implicit because humans were holding it in their heads. Now the consumers don't have heads, and the contract has to be written down or it doesn't exist.

What the contract actually contains

I write data product contracts as plain Markdown documents that live next to the generation code. They tend to look like this in shape:

The document is two to four pages. It is not exotic. The exotic part is that almost no one writes it.

What goes wrong without the contract

Three failure modes, in roughly the order I see them.

Silent scope creep. A dataset built for one purpose gets quietly adopted for another. Both purposes work for a while. Then a producer makes a change appropriate for the original purpose that breaks the secondary one. No one knows the secondary consumer existed, so no one warns them. The breakage shows up downstream as a model whose predictions started drifting on a Tuesday for no apparent reason.

Schema brittleness as politeness. Producers refuse to evolve the schema because they have no idea who depends on what. The dataset becomes a museum piece — internally inconsistent, bloated with deprecated fields, painful to add to. The team that owns it is afraid to touch it, and the team that uses it is increasingly afraid to trust it.

Decision laundering. A consumer makes a high-stakes decision based on the data, the decision goes wrong, and the producer correctly points out that the data was never meant to support that decision. The producer is right. The consumer is also right that no one told them. The contract was never written, so no one is accountable, and the institution learns nothing.

The lightest-weight version that works

Nobody is going to read a forty-page data governance manual. The version of contract-writing that survives in the wild is the shortest one that is still true:

That's the shape. It is not enough to solve every problem. It is enough to make the problems visible, which is most of the improvement.

What I still don't know

The unifying observation is the boring one. The interesting work in enterprise data is not at the dataset. It is at the boundary between what the dataset claims and what its consumers assume. The contract makes that boundary explicit. Programs that treat the contract as the deliverable scale. Programs that treat the data as the deliverable scale until they don't, and then they break in ways that take years to undo.

← back to the field