Skip to main content
We hold investigations to the same standard you would: would acting on this have actually fixed the incident? To answer that at scale, we grade past investigations against the cause your responders established, score them for accuracy and a handful of supporting metrics, and use the results to find where investigations in your account can improve.
This is different from the confidence an investigation shows in the channel while an incident is live — that’s its own real-time read on how well-evidenced its current theory is, covered in Building conviction. Accuracy is measured afterwards, against what turned out to be true.

How we grade accuracy

Once an incident is closed and its real cause is known, we compare what the investigation concluded against the ground truth — the cause and the key findings your responders established. We grade by simulation: if a responder mid-incident had taken the investigation’s headline diagnosis and acted on it, where would they have ended up, compared with what actually needed fixing? That gives a score on a four-point scale:
ScoreGradeWhat it means
100%BullseyeActing on the diagnosis leads straight to the fix — right component, right mechanism, nothing the ground truth rules out.
65%One hop shortRight place, but responders still have to discover one thing it missed before they reach the fix.
35%Right area, wrong diagnosisNames the right area, but acting on it would set responders to work that wouldn’t have fixed the incident.
0%Wrong areaPoints responders at a different part of the system entirely.

Why these thresholds

The scale measures one thing: how useful the diagnosis would have been to a responder who acted on it. Each grade answers a question, and the grades build on each other from the bottom up.
1

Did it point at the right part of the system? — 0%

If the diagnosis sends responders somewhere unrelated to the actual fault, nothing else about it can help. A wrong location, whatever its reasoning, scores 0%.
2

Given the right area, would acting on it move toward the fix, or away from it? — 35%

Naming the right area isn’t enough on its own. A diagnosis can identify the right component but explain it in a way that sets responders to work that wouldn’t have resolved the incident — reverting the wrong change, remediating the wrong layer. The right area with the wrong work to do scores 35%.
3

If it does move toward the fix, is anything still missing? — 65%

A diagnosis can lead to the fix while leaving responders one step to find themselves — a trigger it didn’t identify, a final hop in the chain. They reach the fix, just not immediately. Right place, one piece missing, scores 65%.
4

And when nothing's missing — 100%

The right component, the right mechanism, and nothing in the diagnosis that the ground truth rules out: acting on it leads straight to the fix. That’s a bullseye, 100%.
The line that matters most sits between 65% and 35%. Both look close to the answer, but only a 65% actually leads a responder to the fix — there the cost is time. A 35% leads them to work that wouldn’t have resolved the incident. It’s a small gap in score, but it marks a real gap in usefulness: the difference between a delay and a wrong turn. Grading is also deliberately fair to the investigation. A claim only counts against it if the ground truth actually rules it out; where the ground truth is silent, the investigation gets the benefit of the doubt. And a wrong file name or an imperfect fix suggestion never lowers the score when the diagnosed cause itself is right — we grade the diagnosis, not the prescription.

Why accuracy comes first

Accuracy matters because it’s the precondition for everything else — though it isn’t the whole of the value, and the difference is worth being precise about. An AI SRE product earns its keep through engagement: responders reading what it surfaces, checking what it suggests, steering it, and acting on it. None of that happens if the information can’t be trusted — nobody engages with a system that’s usually wrong. Accuracy is what makes engaging with an investigation worth a responder’s time. In our experience, 70% is the inflection point: below it, responders treat an investigation as a maybe and re-check everything themselves; above it, they find it’s almost always telling them something useful — if not the exact answer, then a genuine head start — and engagement takes hold. From there, a lot of the value comes from the moments an investigation creates, rather than from being right end-to-end. A heads-up that turns out not to be the cause can still be the thing that reminds a responder to check something they hadn’t — and advances the incident. So we pay as much attention to how teams actually engage with investigations as we do to the accuracy score itself, because that engagement is where the return ultimately shows up. So when you evaluate an investigation or AI SRE system — ours or anyone’s — treat accuracy as the entry bar rather than the finish line. Ask how it’s measured, on which incidents, and whether it clears the threshold that makes engagement worthwhile. A system below the bar never earns the engagement that creates value; one above it should be judged on the value it goes on to create.

Beyond accuracy

Accuracy is the headline, but not the whole picture. We track a few supporting metrics so we can see why an investigation scored the way it did.

Reach

Whether the investigation actually got to the evidence it needed. Reach separates a reasoning problem from an access problem: a low accuracy score paired with low reach usually means the answer was somewhere the investigation couldn’t get to, not that it thought poorly. For example, if the real cause was connection-pool exhaustion visible only in a database the investigation wasn’t connected to, it can reason perfectly and still miss — its reach was the limit, not its judgement. That tells us connecting that database would do more for this account than any change to the model.

Findings quality

How well the investigation’s individual findings match the ones your responders established, measured as precision and recall:
  • Precision — of the findings it surfaced, how many were real. If it raised four findings and three held up while one was a red herring, that’s 75% precision.
  • Recall — of the findings that mattered, how many it found. If five findings were key to the incident and it surfaced three of them, that’s 60% recall.
The two pull against each other: an investigation that lists every possible factor scores high on recall but low on precision — it buries the real findings in noise — while one that only commits to its single surest finding scores high on precision but low on recall, because it misses things. We combine them into an F1 score, which only rewards doing both well: finding what matters without padding it with what doesn’t.

Overall signal

A single measure that combines accuracy and findings quality, so we can track the trend for an account in one number — whether investigations are getting better or worse month to month, and whether a backtested change moved things in the right direction overall.

Scoring every investigation

Every investigation on a closed incident is graded this way, automatically — so we, and you, have a continuous, up-to-date picture of how investigations are performing in your account, rather than a one-off sample. Those scores aren’t just a report card — they’re how we find and fix weak spots:
  • Spotting opportunities. Patterns in the scores tell us where investigations in your account underperform — a telemetry source it isn’t querying well, a kind of incident it consistently struggles with, evidence it keeps missing.
  • Driving improvements. Those opportunities feed changes to the things that move accuracy: the prompts, the telemetry guidance and memory that teach investigations your stack, the models, and internal tuning.
  • Specific to your systems. Because grading uses your incidents and your ground truth, the opportunities we find are specific to your environment — not a generic average.

Backtesting

Continual scoring tells us how investigations are doing today. Backtesting lets us ask “what if?” — by re-running investigations across a set of your past incidents, where the real cause is already known, and scoring the results the same way. We use it for two things.

Catching regressions

We don’t ship a change to investigations and hope. As we evolve the system — new models, new prompts, internal tuning — we backtest against historical incidents and compare the scores against the previous run, with each incident investigated several times so run-to-run variation doesn’t skew the result. A change that would lower accuracy shows up here, in a backtest, rather than in your live incidents. Run regularly, this is how we keep pushing accuracy up while making sure investigation performance never quietly degrades as the system changes underneath it.

Testing a change before you make it

Backtesting works just as well for changes on your side. If you’re weighing whether to connect a new telemetry source, or to improve the runbooks you’ve already connected, we can re-run investigations over your past incidents — with and without the change — and show you the difference in accuracy. You see how we’d have performed on real incidents you’ve already lived through, before committing to anything. Backtests aren’t self-serve — get in touch and we’ll run one for you. Backtest investigations run entirely in the background, and never post anything into your incident channels.

FAQs

The cause and the key findings your responders established for an incident — what actually turned out to be true. Because humans sometimes stop digging once an incident is mitigated, the ground truth can be incomplete; where it’s silent, we don’t penalise the investigation for committing to a plausible explanation.
Not on its own. A low accuracy score with low reach usually means the investigation couldn’t get to the evidence it needed — often a missing or under-connected data source — rather than that it reasoned badly. That’s exactly the kind of opportunity these scores surface.
No. Grading and backtesting use your incidents and your ground truth, scoped to your account.
The confidence you see in the channel is the investigation’s real-time assessment of how well its current theory is evidenced — see Building conviction. Accuracy is measured after the fact, against the cause that was eventually established.

How investigations work

The process behind a result, and how an investigation builds conviction in real time.

Trust and safety

How investigations stay under your control, auditable, and honest about what they know.

Telemetry memory

How investigations learn to query your stack better over time.