questionApril 20, 2026

What does an honest eval look like for a system that learns from feedback?

Once the loop closes, last month's benchmark is part of the training distribution. So what are we measuring?

Open question. The cleanest evals I've seen are also the ones most likely to be obsolete by the next release. The robust ones are messier and the numbers are harder to defend in a meeting.

← back to the field