Most AI product dashboards I see make the same mistake: they measure the model and forget the product. Token counts, latency, model accuracy — all real, none of them the point. The user does not care that an LLM generated the answer. They care whether their problem got solved. Here is how I structure metrics so the model numbers serve the product outcome instead of replacing it.
Two scorecards, kept separate
AI products need two distinct sets of metrics, and conflating them is how teams get lost:
Product health answers "is this product working for users?" — activation, completion, retention, the north star. These would matter even if there were no AI inside.
AI quality answers "is the model output good?" — factual accuracy, the LLM-as-a-judge pass rate, human flag rate, cost per call. These are inputs to product health, not substitutes for it.
The trap is reporting AI quality and calling it product success. An 87 percent quality pass rate means nothing if users are not completing sessions. Keep the scorecards separate and connect them deliberately.
Build a north-star tree
A north star alone is not actionable — you cannot directly move "completed sessions per week." So I decompose it into a tree of input metrics, each with an owner and a lever:
- North star: completed sessions per learner per week.
- Branch: activation (completed a session in the first 48 hours).
- Branch: completion rate (driven by push timing, session length, voice quality).
- Branch: return rate (driven by streaks, topic relevance).
- Branch: content supply health (is a session available for the user's current topic).
When a leaf moves, you know which bet to pull. The tree turns a vague goal into a set of things you can actually act on.
The anti-metrics I refuse to optimize
Naming what you will not optimize is as important as your north star:
- Downloads and signups. Vanity. Uncorrelated with whether anyone learned anything.
- Time in app. For a commute-learning product, less time for the same progress is a win. Optimizing time-in-app would push me to make the product worse.
- Library size / feature count. Breadth competes with the daily loop. More is often less.
- Raw model accuracy. Accuracy on a benchmark is not accuracy on your users' real distribution. Measure on your traffic, not a leaderboard.
Instrument the outcome, not the output
The single most useful reframe: measure what the feature was supposed to achieve, not what it produced. A summarization feature's metric is not "summaries generated" — it is whether users acted on the summary instead of reading the source. An AI feature ships with the metric it must move, defined before launch, or it ships blind.
The takeaway
Good AI metrics are boringly product-first. Two scorecards, kept separate and connected by a north-star tree. A short list of anti-metrics you defend. And every feature instrumented on the outcome it promised, not the output it emitted. The model numbers are real and they matter — as inputs. The product is still the point.