This is different from the confidence an investigation shows in the channel while an incident is live — that’s its own
real-time read on how well-evidenced its current theory is, covered in Building
conviction. Accuracy is measured afterwards, against
what turned out to be true.
How we grade accuracy
Once an incident is closed and its real cause is known, we compare what the investigation concluded against the ground truth — the cause and the key findings your responders established. We grade by simulation: if a responder mid-incident had taken the investigation’s headline diagnosis and acted on it, where would they have ended up, compared with what actually needed fixing? That gives a score on a four-point scale:| Score | Grade | What it means |
|---|---|---|
| 100% | Bullseye | Acting on the diagnosis leads straight to the fix — right component, right mechanism, nothing the ground truth rules out. |
| 65% | One hop short | Right place, but responders still have to discover one thing it missed before they reach the fix. |
| 35% | Right area, wrong diagnosis | Names the right area, but acting on it would set responders to work that wouldn’t have fixed the incident. |
| 0% | Wrong area | Points responders at a different part of the system entirely. |
Why these thresholds
The scale measures one thing: how useful the diagnosis would have been to a responder who acted on it. Each grade answers a question, and the grades build on each other from the bottom up.Did it point at the right part of the system? — 0%
If the diagnosis sends responders somewhere unrelated to the actual fault, nothing else about it can help. A wrong
location, whatever its reasoning, scores 0%.
Given the right area, would acting on it move toward the fix, or away from it? — 35%
Naming the right area isn’t enough on its own. A diagnosis can identify the right component but explain it in a way
that sets responders to work that wouldn’t have resolved the incident — reverting the wrong change, remediating the
wrong layer. The right area with the wrong work to do scores 35%.
If it does move toward the fix, is anything still missing? — 65%
A diagnosis can lead to the fix while leaving responders one step to find themselves — a trigger it didn’t identify,
a final hop in the chain. They reach the fix, just not immediately. Right place, one piece missing, scores 65%.
Why accuracy comes first
Accuracy matters because it’s the precondition for everything else — though it isn’t the whole of the value, and the difference is worth being precise about. An AI SRE product earns its keep through engagement: responders reading what it surfaces, checking what it suggests, steering it, and acting on it. None of that happens if the information can’t be trusted — nobody engages with a system that’s usually wrong. Accuracy is what makes engaging with an investigation worth a responder’s time. In our experience, 70% is the inflection point: below it, responders treat an investigation as a maybe and re-check everything themselves; above it, they find it’s almost always telling them something useful — if not the exact answer, then a genuine head start — and engagement takes hold. From there, a lot of the value comes from the moments an investigation creates, rather than from being right end-to-end. A heads-up that turns out not to be the cause can still be the thing that reminds a responder to check something they hadn’t — and advances the incident. So we pay as much attention to how teams actually engage with investigations as we do to the accuracy score itself, because that engagement is where the return ultimately shows up. So when you evaluate an investigation or AI SRE system — ours or anyone’s — treat accuracy as the entry bar rather than the finish line. Ask how it’s measured, on which incidents, and whether it clears the threshold that makes engagement worthwhile. A system below the bar never earns the engagement that creates value; one above it should be judged on the value it goes on to create.Beyond accuracy
Accuracy is the headline, but not the whole picture. We track a few supporting metrics so we can see why an investigation scored the way it did.Reach
Whether the investigation actually got to the evidence it needed. Reach separates a reasoning problem from an access problem: a low accuracy score paired with low reach usually means the answer was somewhere the investigation couldn’t get to, not that it thought poorly. For example, if the real cause was connection-pool exhaustion visible only in a database the investigation wasn’t connected to, it can reason perfectly and still miss — its reach was the limit, not its judgement. That tells us connecting that database would do more for this account than any change to the model.Findings quality
How well the investigation’s individual findings match the ones your responders established, measured as precision and recall:- Precision — of the findings it surfaced, how many were real. If it raised four findings and three held up while one was a red herring, that’s 75% precision.
- Recall — of the findings that mattered, how many it found. If five findings were key to the incident and it surfaced three of them, that’s 60% recall.
Overall signal
A single measure that combines accuracy and findings quality, so we can track the trend for an account in one number — whether investigations are getting better or worse month to month, and whether a backtested change moved things in the right direction overall.Scoring every investigation
Every investigation on a closed incident is graded this way, automatically — so we, and you, have a continuous, up-to-date picture of how investigations are performing in your account, rather than a one-off sample. Those scores aren’t just a report card — they’re how we find and fix weak spots:- Spotting opportunities. Patterns in the scores tell us where investigations in your account underperform — a telemetry source it isn’t querying well, a kind of incident it consistently struggles with, evidence it keeps missing.
- Driving improvements. Those opportunities feed changes to the things that move accuracy: the prompts, the telemetry guidance and memory that teach investigations your stack, the models, and internal tuning.
- Specific to your systems. Because grading uses your incidents and your ground truth, the opportunities we find are specific to your environment — not a generic average.
Backtesting
Continual scoring tells us how investigations are doing today. Backtesting lets us ask “what if?” — by re-running investigations across a set of your past incidents, where the real cause is already known, and scoring the results the same way. We use it for two things.Catching regressions
We don’t ship a change to investigations and hope. As we evolve the system — new models, new prompts, internal tuning — we backtest against historical incidents and compare the scores against the previous run, with each incident investigated several times so run-to-run variation doesn’t skew the result. A change that would lower accuracy shows up here, in a backtest, rather than in your live incidents. Run regularly, this is how we keep pushing accuracy up while making sure investigation performance never quietly degrades as the system changes underneath it.Testing a change before you make it
Backtesting works just as well for changes on your side. If you’re weighing whether to connect a new telemetry source, or to improve the runbooks you’ve already connected, we can re-run investigations over your past incidents — with and without the change — and show you the difference in accuracy. You see how we’d have performed on real incidents you’ve already lived through, before committing to anything. Backtests aren’t self-serve — get in touch and we’ll run one for you. Backtest investigations run entirely in the background, and never post anything into your incident channels.FAQs
What counts as ground truth?
What counts as ground truth?
The cause and the key findings your responders established for an incident — what actually turned out to be true.
Because humans sometimes stop digging once an incident is mitigated, the ground truth can be incomplete; where it’s
silent, we don’t penalise the investigation for committing to a plausible explanation.
Does a low score mean the investigation failed?
Does a low score mean the investigation failed?
Not on its own. A low accuracy score with low reach usually means the investigation couldn’t get to the evidence it
needed — often a missing or under-connected data source — rather than that it reasoned badly. That’s exactly the
kind of opportunity these scores surface.
Is my data used to grade other customers?
Is my data used to grade other customers?
No. Grading and backtesting use your incidents and your ground truth, scoped to your account.
How is this different from the confidence shown during an incident?
How is this different from the confidence shown during an incident?
The confidence you see in the channel is the investigation’s real-time assessment of how well its current theory is
evidenced — see Building conviction. Accuracy is
measured after the fact, against the cause that was eventually established.
Related
How investigations work
The process behind a result, and how an investigation builds conviction in real time.
Trust and safety
How investigations stay under your control, auditable, and honest about what they know.
Telemetry memory
How investigations learn to query your stack better over time.