Mary Fung

There's no general score, the absence is structural, and the test that matters is the one you run on your own use case.

A question I get often: what's the standard metric for synthetic data quality? The honest answer is more interesting than people expect.

There isn't one. The field has converged on a vocabulary, not a number — a three-axis frame of fidelity (does the synthetic data match the source distribution), utility (does it work for the downstream task), and privacy (does it leak about the source). Fidelity and privacy are reasonably portable. Utility isn't, and that's the part that breaks any attempt at a single cross-domain score.

The vendor "quality scores" — the ones with proprietary aggregate numbers in the marketing — are doing fidelity work and calling it quality. Inside a vendor, comparing two of their own synthetic datasets, the score is meaningful. Across vendors, it isn't. The docs say so themselves.

A synthetic dataset that is excellent for training a fraud model can be useless for stress-testing a credit risk policy. Quality is local. A general score is averaging across incompatible questions and producing a number that is precise about nothing.

So when someone offers you "high-quality synthetic data," the only useful response is: which axis, against what use case, with what holdout? If they hand you a headline number and stop, you have a vendor report. If they can walk you through the three axes against your specific task — and name what the eval doesn't catch — you have an evaluation.

Picking those metrics is itself the project. Which fidelity tests fit the data shape. Which utility setup mirrors the real workflow. Which privacy attacks match the actual threat model. That selection isn't a step before the work; on any serious problem it is weeks of the work.

The score is what they sold. The recomputation on your own use case is what you'd buy.

Synthetic data quality is not a number