Mary Fung

A dataset is a file. A data product is a contract about what questions it can answer, who owns the answers, and what happens when those answers change.

The same CSV gets used to train a fraud model on Tuesday and to brief a board on Thursday. By the next quarter, no one remembers which version was used for which, the model is producing predictions the business doesn't trust, and the analyst who built the briefing has moved teams. The CSV is fine. It is the same bytes it was. The problem is that nobody wrote down what it was for, so it ended up being used for everything, which is the same as being used for nothing.

This is the most common failure mode I see in enterprise data work. It does not happen because anyone is careless. It happens because the field treats data as the deliverable when the deliverable is actually a contract about the data.

The distinction

A dataset is bytes. A schema, some rows, maybe a README. It tells you what is in the file. It does not tell you what the file is for.

A data product is a dataset plus an explicit contract that specifies:

What questions this data can answer — the legitimate use cases
What questions it should refuse to answer — the misuses to be defended against
Who owns the answers — the team accountable when something goes wrong
How fresh the data is — the cadence of updates and the staleness bound
What guarantees the schema makes — which fields will not change without notice, which might
Who the consumers are — and what they have agreed to assume

Notice that none of those are properties of the bytes. They are properties of the relationship between the producer and the consumer of the data. The data itself is mute on all of them.

Why the distinction matters more in the AI era

In a pre-AI world, the consumer of an enterprise dataset was usually a person — an analyst, a controller, a business reviewer — who would look at the data, notice if something seemed wrong, and ask. Bad data caused noisy meetings. Embarrassing, but bounded.

In an AI world, the consumer is increasingly a model. Models do not notice that something seems wrong. They produce predictions indifferently across the entire distribution they were trained on, including the parts of the distribution that were never meant to be in scope. A schema change that an analyst would have caught silently breaks a model's calibration. A staleness bound that a controller would have honored gets violated by a pipeline that doesn't know one exists.

The contract was always implicit. It used to be implicit because humans were holding it in their heads. Now the consumers don't have heads, and the contract has to be written down or it doesn't exist.

What the contract actually contains

I write data product contracts as plain Markdown documents that live next to the generation code. They tend to look like this in shape:

Purpose. One paragraph. This product exists to support X. It is not appropriate for Y. The Y matters as much as the X.
Producer. A named team and a named on-call rotation. Not a shared inbox.
Consumers. A list of known downstream systems and the team responsible for each. New consumers must register before they rely on the product.
Schema guarantees. A versioned spec of which fields are stable, which may change with notice, and which may change at any time. Stability is the deliverable, not the column count.
Freshness and SLA. When the data updates, how late an update can be before consumers are notified, and what the consumer should do during a stale window.
Known limitations. The biases, the gaps, the populations the data underrepresents. Stated in plain language. If a downstream consumer is going to make a decision the data won't support, this is where they get warned.
Change protocol. The process by which any of the above changes. Who has to be told. How much notice. What kind of change triggers a version bump.

The document is two to four pages. It is not exotic. The exotic part is that almost no one writes it.

What goes wrong without the contract

Three failure modes, in roughly the order I see them.

Silent scope creep. A dataset built for one purpose gets quietly adopted for another. Both purposes work for a while. Then a producer makes a change appropriate for the original purpose that breaks the secondary one. No one knows the secondary consumer existed, so no one warns them. The breakage shows up downstream as a model whose predictions started drifting on a Tuesday for no apparent reason.

Schema brittleness as politeness. Producers refuse to evolve the schema because they have no idea who depends on what. The dataset becomes a museum piece — internally inconsistent, bloated with deprecated fields, painful to add to. The team that owns it is afraid to touch it, and the team that uses it is increasingly afraid to trust it.

Decision laundering. A consumer makes a high-stakes decision based on the data, the decision goes wrong, and the producer correctly points out that the data was never meant to support that decision. The producer is right. The consumer is also right that no one told them. The contract was never written, so no one is accountable, and the institution learns nothing.

The lightest-weight version that works

Nobody is going to read a forty-page data governance manual. The version of contract-writing that survives in the wild is the shortest one that is still true:

One page per data product.
Living document in version control next to the generation code.
Reviewed when the schema changes or a new consumer registers.
Owned by the producing team, not by a central data-governance function.
Linked from the dataset itself, so any consumer who finds the data finds the contract.

That's the shape. It is not enough to solve every problem. It is enough to make the problems visible, which is most of the improvement.

What I still don't know

How to enforce the contract automatically. Schema linters and freshness alerts catch the easy cases. The harder cases — consumer X started using this product for purpose Y it isn't meant to support — are not solved by tooling I know of.
How to retire data products gracefully. Most of them outlive their usefulness and become zombie assets. I do not have a clean pattern for sunsetting one without breaking consumers I didn't know existed.
How small the contract can get before it stops working. I have seen one-page contracts succeed and ten-page contracts fail. The variable is not length. I think it is whether the producer actually believed the contract before they wrote it down.

The unifying observation is the boring one. The interesting work in enterprise data is not at the dataset. It is at the boundary between what the dataset claims and what its consumers assume. The contract makes that boundary explicit. Programs that treat the contract as the deliverable scale. Programs that treat the data as the deliverable scale until they don't, and then they break in ways that take years to undo.