Mary Fung

Twenty typed correlations, a sample-size gate per phase, and one confounder that broke half of them. The half that survived are the only ones I trust to surface to a user.

Most consumer health apps work the same way under the hood. Take a collection of signals — heart rate, steps, sleep, mood entries, food logs — run pairwise correlations across them, surface the statistically significant ones to the user. We noticed your sleep is lower on days you exercise less. It is a tempting product loop. Personalized, data-driven, and entirely cheap to compute.

It is also, most of the time, wrong.

I've spent the last year on the inverse approach in Fyll: deliberately small, deliberately disciplined, and conditioned on the variable most of the field ignores. Twenty correlations. Each with a sample-size gate. Each conditioned on cycle phase. By the end of the work, ten of the original twenty survived. The other ten were artifacts of timing or noise.

This essay is about what the surviving ten have in common, and what the discarded ten taught me.

Why the standard approach fails

If you take any twenty pairs of variables and run a correlation test on three months of self-reported data from a user, you will find several "significant" relationships by chance. This is not a hot take. It is what the math says. The conventional cutoff for statistical significance — p<0.05 — means that one in twenty random pairs will clear the bar through luck alone. Run a hundred pairs and you get five false positives by definition.

Consumer health apps tend to run more than a hundred pairs. They correlate everything against everything, surface whatever survives, and call the surviving ones "insights." Most of those insights are the statistical equivalent of a lottery winner being asked for stock tips.

The standard fix from statistics is false discovery rate correction — mathematical adjustments that lower the bar for how many findings you accept when you've tested many hypotheses. These help, and they are rarely applied in consumer apps. But they don't address the deeper problem, which is that the correlations being tested were never well-typed to begin with.

What "typed correlations" means

A typed correlation is one where you've stated, before looking at the data, what biological mechanism would have to be true for the relationship to make sense. Sleep duration and recovery — typed, because the literature has a clear story for how rest restores. Steps and resting heart rate — typed, because the cardiovascular mechanism is well-characterized. Caffeine timing and sleep onset — typed, because the half-life of caffeine in the bloodstream is known.

Cross-correlating every signal against every other signal, with no theory connecting them, is untyped. It is fishing. Most of what shows up will be coincidence.

I started with twenty hypotheses, each of which had a plausible mechanism I could write down in a sentence. Twenty is not a magic number. It was the most I could think through carefully without losing track of what each one was supposed to mean.

Sample-size gates

A correlation between two variables computed across, say, fifteen days of data is not a finding. It is a guess that happens to come with a confidence interval the user shouldn't take seriously. The math gives you a coefficient regardless of how thin the evidence is — that is its job. The judgment about whether to trust the coefficient sits outside the math.

For each of the twenty correlations, I set a minimum number of paired observations per cycle phase before the correlation would be computed at all. Below that gate, the app shows nothing. The user sees a count of how many more days are needed. This is unfashionable in product circles because empty states are seen as failures of engagement. I think they are the most honest screen in the app.

The gate values are different for different correlations — some mechanisms produce larger effects and need fewer observations to detect, others are subtle and need many more. The unifying principle is that no correlation surfaces below the gate, ever, regardless of how interesting the user might find it. The cost of a wrong answer is a user who changes a real behavior on the basis of noise. That cost is asymmetric and it falls on the user, not on the app.

The confounder that broke ten of them

The original cycle essay makes the case in detail. The short version: hormonal phase shifts resting heart rate, body temperature, sleep architecture, glucose response, mood, appetite, and inflammation in systematic ways. If you pool data across phases, those shifts contaminate any correlation that touches one of those variables.

When I conditioned each of the twenty correlations on cycle phase, roughly half lost statistical signal entirely. The "relationship" between, for instance, certain food patterns and energy levels turned out to be a function of where in the cycle the user happened to log — the food pattern wasn't doing the work, the luteal-phase fatigue was. Once I stratified by phase, the food–energy correlation collapsed.

The ones that survived were the relationships where the underlying mechanism was robust to hormonal state — sleep duration and next-day fatigue, hydration and perceived energy, caffeine timing and sleep onset. These are not exotic findings. They are the boring, well-replicated relationships from the literature. That is the point. The boring relationships are the ones that hold up under honest conditioning. The exciting ones mostly don't.

What this changed about how I build

Three operating principles fell out of the work:

Show fewer things, with more confidence. Ten correlations the user can trust beats fifty correlations they have to second-guess.
The empty state is a feature. A correlation that hasn't cleared its sample-size gate should show its gate, not its coefficient. The user gets to see exactly what would be required to generate a real answer.
Phase-condition by default, not on demand. If the architecture treats phase as a first-class variable from the start, the discipline is cheap. If it gets bolted on later, it is nearly impossible to retrofit.

What I still don't know

Whether twenty was the right number. I think it was the right shape — small enough to think through, large enough to cover the biggest user-facing questions. Whether the next set should be another twenty, or fewer with deeper investigation each, is open.
How to communicate uncertainty without losing the user. Showing a coefficient with an honest confidence interval reads as hedging to most people. I do not yet have an interface pattern that solves this without dumbing the math down to "high / medium / low," which loses information in the other direction.
What the analog is for variables outside the cycle. Cycle is the cleanest confounder because the mechanism is well-understood and the timing is predictable. There are surely other systematic variables I am not yet conditioning on. I do not know how to find them except by waiting until one breaks a correlation I trusted.