Measuring accuracy

We hold investigations to the same standard you would: would acting on this have actually fixed the incident? To answer that at scale, we grade past investigations against the cause your responders established, score them for accuracy and a handful of supporting metrics, and use the results to find where investigations in your account can improve.

This is different from the confidence an investigation shows in the channel while an incident is live. That’s its own real-time read on how well-evidenced its current theory is, covered in Building conviction. Accuracy is measured afterwards, against what turned out to be true.

How we grade accuracy

Once an incident is closed and its real cause is known, we compare what the investigation concluded against the ground truth: the cause and the key findings your responders established. We grade by simulation: if a responder mid-incident had taken the investigation’s headline diagnosis and acted on it, where would they have ended up, compared with what actually needed fixing? That gives a score on a four-point scale:

Score	Grade	What it means
100%	Bullseye	Correct diagnosis. Acting on it would directly fix the incident.
65%	On target	Correct, but incomplete. Responders would need to dig further to fix the incident.
35%	Miss	Names the right area, but acting on the diagnosis wouldn’t have fixed the incident.
0%	Nowhere near	Points responders at a different part of the system entirely.

Why these thresholds

The scale measures how useful the diagnosis would have been to a responder who acted on it. Each grade answers a question, and the grades build on each other from the bottom up.

Did it point at the right part of the system? (0%)

If the diagnosis sends responders somewhere unrelated to the actual fault, nothing else about it can help. A wrong location, whatever its reasoning, scores 0%.

Given the right area, would acting on it move toward the fix, or away from it? (35%)

Naming the right area isn’t enough on its own. A diagnosis can identify the right component but explain it in a way that sets responders to work that wouldn’t have resolved the incident, like reverting the wrong change or remediating the wrong layer. The right area with the wrong course of action scores 35%.

If it does move toward the fix, is anything still missing? (65%)

A diagnosis can lead to the fix while leaving responders one step to find themselves: a trigger it didn’t identify, a final hop in the chain. They reach the fix, just not immediately. Right place but one piece missing scores 65%.

And when nothing's missing (100%)

The right component, the right mechanism, and nothing in the diagnosis that the ground truth rules out mean acting on it leads straight to the fix. That’s a bullseye, 100%.

What makes an investigation good is the jump from 35% to 65%. Both look close to the answer, but only a 65% actually leads responders to the fix. They still have to spend time finding the final piece, but it gets them there. A 35% on the other hand leads them to work that wouldn’t have resolved the incident. It’s a small gap in score, but marks a large gap in usefulness: the difference between a delay and a wrong turn. Grading is also deliberately fair to the investigation. A claim only counts against it if the ground truth actually rules it out; where the ground truth is silent, the investigation gets the benefit of the doubt. And a wrong file name or an imperfect fix suggestion never lowers the score when the diagnosed cause itself is right: we grade the diagnosis, not the prescription.

Why accuracy comes first

Accuracy matters because it’s the precondition for everything else, though it isn’t the whole of the value, and the difference is worth being precise about. An AI SRE product provides value through engagement: responders reading what it surfaces, checking what it suggests, steering it, and acting on it. None of that happens if the information can’t be trusted; nobody engages with a system that’s usually wrong. Accuracy is what makes engaging with an investigation worth a responder’s time. In our experience, 65% is the inflection point: below it, responders treat an investigation as a maybe and re-check everything themselves; above it, they find it’s almost always telling them something useful (if not the exact answer, then a genuine head start) and engagement takes hold. From there, a lot of the value comes from the moments an investigation creates, rather than from being right end-to-end. A heads-up that turns out not to be the cause can still be the thing that reminds a responder to check something they hadn’t, and advances the incident. So we pay as much attention to how teams actually engage with investigations as we do to the accuracy score itself, because that engagement is where the return ultimately shows up. So when you evaluate an investigation or AI SRE system, whether ours or anyone else’s, treat accuracy as the entry bar rather than the finish line. Ask how it’s measured, on which incidents, and whether it clears the threshold that makes it worth engaging with. Any system that doesn’t meet that bar won’t see the engagement that drives value; one that clears it should be judged on the value it goes on to create.

Beyond accuracy

Accuracy is the headline, but not the whole picture. We track a few supporting metrics so we can see why an investigation scored the way it did.

Reach

Whether the investigation actually got to the evidence it needed. Reach separates a reasoning problem from an access problem: a low accuracy score paired with low reach usually means the answer was somewhere the investigation couldn’t get to, not that it thought poorly. For example, if the real cause was connection-pool exhaustion visible only in a database the investigation wasn’t connected to, it can reason perfectly and still miss, purely due to lack of reach. That tells us connecting that database would do more for your account than any change to the model.

Findings quality

How well the investigation’s individual findings match the ones your responders established, measured as precision and recall:

Precision: of the findings it surfaced, how many were real. If it raised four findings and three held up while one was a red herring, that’s 75% precision.
Recall: of the findings that mattered, how many it found. If five findings were key to the incident and it surfaced three of them, that’s 60% recall.

The two pull against each other: an investigation that lists every possible factor scores high on recall but low on precision as it buries real findings in noise. Conversely, one that only commits to its single surest finding scores high on precision but low on recall, because it misses things. We combine them into an F1 score, which only rewards doing both well: finding what matters, without padding it with noise.

Overall signal

A single measure that combines accuracy and findings quality, so we can track the trend for an account in one number: whether investigations are getting better or worse month to month, and whether a backtested change moved things in the right direction overall.

Scoring every investigation

Every investigation on a closed incident is graded this way, automatically, so we, and you, have a continuous, up-to-date picture of how investigations are performing in your account, rather than a one-off sample. Those scores aren’t just a report card; they’re how we find and fix weak spots:

Spotting opportunities. Patterns in the scores tell us where investigations in your account underperform: a telemetry source it isn’t querying well, a kind of incident it consistently struggles with, evidence it keeps missing.
Driving improvements. Those opportunities feed changes to the things that drive accuracy: the prompts, the telemetry guidance and memory that teach investigations your stack, the models, and internal tuning.
Specific to your systems. Because grading uses your incidents and your ground truth, the opportunities we find are specific to your environment rather than a generic average.

Backtesting

Continual scoring tells us how investigations are doing today. Backtesting lets us ask “what if?”. We re-run investigations across a set of your past incidents where the real cause is already known, and score the results the same way. We use it for two things.

Catching regressions

We don’t ship a change to investigations and hope. As we evolve the system through new models, new prompts, and internal tuning, we backtest against historical incidents. We compare the scores against the previous run, with each incident investigated several times so run-to-run variation doesn’t skew the result. A change that would lower accuracy shows up here, in a backtest, rather than in your live incidents. Run regularly, this is how we keep pushing accuracy up while making sure investigation performance never quietly degrades as the system evolves.

Testing a change before you make it

Backtesting works just as well for changes on your side. If you’re weighing whether to connect a new telemetry source, or to improve the runbooks you’ve already connected, we can re-run investigations over your past incidents, with and without the change, and show you the difference in accuracy. You see how we’d have performed on real incidents you’ve already handled, before committing to anything. Backtests aren’t self-serve, so get in touch and we’ll run one for you. Backtest investigations run entirely in the background, and never post anything into your incident channels.

FAQs

What counts as ground truth?

The real cause and the key findings your responders established for an incident. Because humans sometimes stop digging once an incident is mitigated, the ground truth can be incomplete; where it’s silent, we don’t penalise the investigation for committing to a plausible explanation.

Does a low score mean the investigation failed?

Not on its own. A low accuracy score with low reach usually means the investigation couldn’t get to the evidence it needed (often a missing or under-connected data source) rather than that it reasoned badly. Low scores often highlight such opportunities.

Is my data used to grade other customers?

No. Grading and backtesting use your incidents and your ground truth, scoped to your account.

How is this different from the confidence shown during an incident?

The confidence you see in the channel is the investigation’s real-time assessment of how well its current theory is evidenced. See Building conviction. Accuracy is measured after the fact, against the cause that was eventually established.

How investigations work

The process behind a result, and how an investigation builds conviction in real time.

Measuring engagement

How we measure the way responders actually use an investigation, from reading it to acting on it.

Trust and safety

How investigations stay under your control, auditable, and honest about what they know.

Telemetry memory

How investigations learn to query your stack better over time.

Getting started

Alerts

On-call

Incident response

Post-incident

Status pages

Investigations

AI features

Catalog

Workflows

Insights

Integrations

Administration

Need more help?

How we grade accuracy

Why these thresholds

Why accuracy comes first

Beyond accuracy

Reach

Findings quality

Overall signal

Scoring every investigation

Backtesting

Catching regressions

Testing a change before you make it

FAQs

How investigations work

Measuring engagement

Trust and safety

Telemetry memory

​How we grade accuracy

​Why these thresholds

​Why accuracy comes first

​Beyond accuracy

​Reach

​Findings quality

​Overall signal

​Scoring every investigation

​Backtesting

​Catching regressions

​Testing a change before you make it

​FAQs

​Related

How investigations work

Measuring engagement

Trust and safety

Telemetry memory

How we grade accuracy

Why these thresholds

Why accuracy comes first

Beyond accuracy

Reach

Findings quality

Overall signal

Scoring every investigation

Backtesting

Catching regressions

Testing a change before you make it

FAQs

Related