When I launched the first version of Listen2RE without a systematic way to evaluate the LLM output, the first month had quality inconsistencies that hurt early retention. That mistake taught me the rule I now apply to every AI feature: the eval harness is pre-launch infrastructure, not a phase-two nice-to-have.
Why "ship and watch" fails for AI
For a deterministic feature, "ship it and watch the dashboards" is fine. The behavior is fixed. For an LLM feature, the behavior is a distribution. The same prompt can produce a great answer, a mediocre one, and a confidently wrong one across three runs. If you have no harness, you are not measuring quality — you are sampling vibes.
The cost shows up as silent retention loss. Users hit a bad output, lose trust, and churn without filing a complaint. You never see the cause in your funnel.
What an eval harness actually is
Three parts:
- A reference dataset. A frozen set of representative inputs with known-good expectations. Not your whole traffic — a curated 100 to 300 cases that cover the common path, the edge cases, and the known failure modes.
- Scorers. Automated graders that turn each output into a number or a label. LLM-as-a-judge for subjective quality, exact-match or regex for structured fields, and cheap heuristics where they apply.
- A gate. A threshold the feature must clear before it ships, and a number you watch after.
The LLM-as-a-judge pattern, done responsibly
LLM-as-a-judge means a second model scores the first model's output against a rubric. It works, but only if you treat the judge like any other model: validate it. On Listen2RE the judge ran as a pre-publish filter at an 85 percent pass threshold; anything below routed to a human spot-check. We also sampled judge decisions against human labels to confirm the judge agreed with us roughly 9 times out of 10. A judge you have not calibrated is just another unmeasured model.
Where the leverage is
Once the harness exists, three things change:
- Prompt changes become measurable. You stop arguing about whether a new prompt is "better" and run it through the set.
- Regressions get caught. A model upgrade that improves the common path can quietly break an edge case. The set catches it before users do.
- Cost drops with confidence. We cut cost per session by 40 percent over six months because the harness let us aggressively optimize prompts and caching without fear of silently degrading quality.
The PM's job here
This is not "the ML team's problem." The PM owns the rubric, because the rubric is a product decision: what does good mean for this feature, for this user, in this context? Engineers can build the scorer. Only the PM can define what it should reward.
Start before you ship. The harness you build in week one is the thing that lets you move fast for the next two years.