Mary Fung

A handful of real records, a regulator who won't release more, and a system that has to be tested anyway. Most synthetic data writing is for a different problem.

Suppose you're handed a dozen real records — a vendor list, a sample of insurance claims, a slice of patient encounters. The downstream system that needs them will eventually process millions. You cannot use the real data on the production system; privacy review hasn't cleared it. You cannot ask for more; privacy review took weeks to release the dozen.

What do you do?

Most enterprise AI work begins inside that question. Almost none of the synthetic data literature is written for it.

Why the standard answers don't work

The literature on synthetic data is dominated by a handful of generators — CTGAN, TVAE, GaussianCopula, and a growing family of diffusion-based models. They are statistical machines: feed in a million rows of real data, they learn the shape of it, and they produce new rows that look like it. They are good at this. The active research questions are about which scoring metric to publish.

Run any of them on a dozen rows and the math breaks. There isn't enough signal to learn a shape. The "synthetic" output is a ventriloquism of the original twelve — same outliers, same correlations, same names if you squint. You haven't generated anything new. You've laundered the original.

So the trick has to come from somewhere else. The trick comes from asking what the synthetic data is for.

The reframe

Take the vendor list as the canonical example: twelve payees, two countries, three currencies, two payment terms. The downstream pipeline doesn't care that one vendor is named Shaw and another is named Tristar. It cares whether a German vendor on euro Net 30 flows through a different exception path than a US vendor on dollar Net 45. The test is about combinations of categories, not about specific entities.

That observation is the whole insight. The dozen rows are not a miniature copy of the underlying population. They are a vocabulary — a list of the categorical features that show up in the world the system has to handle. The job is to generate plausible records that exercise the combinations of that vocabulary, not to mimic the distribution of who paid whom.

Once you see the data as vocabulary instead of sample, the technique follows. You write down the legal value sets — twenty countries, six currencies, five payment-term buckets — and you sample combinations subject to the schema's constraints. Now your test data covers cases the dozen never showed you. A statistical generator could not have done this; it would faithfully reproduce the two countries it saw and never invent a third, because the third had zero probability in twelve rows.

This vocabulary-not-sample move is most of the enterprise synthetic data work I do.

Three families, in the order I reach for them

Three techniques cover most of the cases I run into. They are not exclusive — a real program uses all three at different layers — but the order matters. Pick the cheapest one that answers the question you actually have.

Templates with structured noise. A template is a recipe: this field is one of these twenty values, this field is a date inside this range, this field must reference a row in that other table. You sample combinations under those rules. The generator is a domain model, not a statistical one — closer to a Mad Libs book than a neural network. The output is correct by construction: every invoice references a plausible vendor, every payment lands within the contract terms, every record passes the production system's validations. What you get is full coverage of the categorical space. What you don't get is anything about the shape of the underlying distribution. For most enterprise testing — pipeline correctness, agent evaluation, demo environments — that's the right trade. The model isn't being trained on this data. It's being tested against it.

Single-distribution simulators. Sometimes the domain expert can give you one good distribution — a histogram of typical transaction sizes, a frequency table of common error types, a representative day's order volume. You build a simulator that respects that distribution and lets the rest emerge. Workforce composition follows public benchmarks. Transaction amounts follow a plausible log-normal (a curve where most values cluster small with a long tail of large ones — the shape almost every economic quantity takes). Supplier concentration follows the power law most supply chains exhibit (a few vendors account for most of the volume). The honest version of this writes down which distributions came from real data and which came from a domain expert's intuition. The dishonest version publishes the data without disclosure. I have watched the dishonest version cause real harm.

Agent-based generation. When the data is a byproduct of behavior — support tickets, audit trails, sales calls — the generator is a population of agents rather than a statistical model. Agent A wants a refund. Agent B is a service representative with a script and a quota. They have a conversation. The ticket falls out as a side effect. This is the most expensive technique and the most realistic when it works. It is also the most likely to produce convincing-looking nonsense, because the population is only as real as the persona library, and persona libraries are notoriously brittle. Use it last and validate it adversarially.

The contract is the deliverable

If a director asks me whether a synthetic dataset is "good," I refuse to answer the question. The right question is good for what. A dataset that exercises a reconciliation pipeline is not a dataset suitable for training a fraud model. A dataset built to demo a tariff scenario to a board is not a dataset suitable for benchmark publication. Conflating those categories is the most common failure mode I see, and it is rarely caught because the synthetic data looks fine.

So the deliverable is not the data. The deliverable is the contract: this generator is appropriate for X, inappropriate for Y, here is the reasoning, here is who to ask if you disagree. The data is a side effect of the contract. Most of what I do day-to-day is write that contract. The rest is engineering.

This is also why most synthetic data programs fail. Not because the generators are bad. Because the contract was implicit, the downstream consumer made assumptions the upstream team never intended, and by the time anyone noticed, a model was in production.

What I still don't know

Stating the intuitions, not hiding behind humility.

Agent-based generators feel real and may be wrong in ways no metric catches until production. The only honest validation I've found is adversarial: red-team the persona library with someone who actually does the job the agents are simulating, and count the moments they laugh.
The contract leaks no matter what you do. Watermarks, schema annotations, governance flags — all help, none are sufficient. Someone copies the file out of context. I do not yet know how to prevent this without slowing the program below useful speed.
The threshold where templates should give way to a fitted model is decision-specific. I do not have a clean sample-size cutoff and I no longer trust people who claim to.

The unifying observation is the boring one. The interesting work in synthetic data is not at the generator. It is at the boundary between what the generator produces and what the consumer assumes about it. Programs that respect that boundary scale. Programs that don't quietly cause damage and never show up on a dashboard.

Zero-to-low sample generation

Why the standard answers don't work

The reframe

Three families, in the order I reach for them

The contract is the deliverable

What I still don't know