The summary message
When an investigation starts, it posts a single summary message to the channel and keeps it up to date in place. It’s kept short and scannable, with headers for the things a responder most wants to know:- What’s going on? — the situation in plain language.
- What caused it? — the current best hypothesis, with its confidence shown inline.
- What can I do next? — concrete next steps, linked to the evidence behind them.
What’s going on? The Redis instance in production is sustaining CPU above 50% (measured at 55.3%), which triggered an operational alert. What caused it? (medium confidence) The elevated Redis CPU is plausibly linked to increased worker queue load, particularly from thestatuspage-workerprocess — a pattern seen in past incidents. What can I do next?
- Correlate the Redis CPU spike with
statuspage-workerqueue metrics in Grafana for the incident window- If they line up, temporarily gate the event subscriptions driving Redis load
- Keep watching Redis CPU and workload metrics over the next 30 minutes
Progress in the thread
As the investigation works, it posts updates into the thread beneath the summary, so you can watch what it’s doing as it does it. You don’t need to read these to follow along — the summary always reflects the latest — but they’re there when you want the detail. You’ll see things like:- Hypothesis updates when its thinking changes — labeled so you can see the shift at a glance, such as “New hypothesis”, “Hypothesis strengthened”, or “Hypothesis weakened”.
- Check results as each piece of work completes — a short summary of what it found, with links to the source.
New hypothesis
- I’m now looking at upstream rate limiting from the payments API — sustained 429s with flat database metrics
- Shifted away from the earlier database contention theory
- Next: checking whether the payments API quota was changed recently
Querying telemetryLinks: Grafana dashboard, PR #54586
- A core deploy shipped 11 minutes before the first error (PR #54586), and the build SHA matches
- 107 successful responses vs 4 server errors over 4 hours, with no latency spike
Heads-up messages
The summary and thread are there whenever you choose to look. But sometimes the investigation works out something important that you probably don’t know yet — and waiting for you to check back isn’t good enough. In that case it posts a heads-up message to the channel, with the detail in a thread.Heads up: I think this could be database connection pool exhaustion. The auth-gateway connection pool hit saturation (50/50) at 14:23 UTC — exactly when errors started spiking.Heads-ups are deliberately quiet. The investigation only posts one when there’s a genuine shift worth your attention — a code change that explains the error, a past incident with the same fingerprint, a third-party outage — so they read as progress, not noise. The thread carries the supporting evidence and links, including any similar past incidents.
Ask and steer with @incident
The investigation isn’t a one-way broadcast. At any point you can talk to it in the channel by tagging@incident.
Ask about the investigation
Ask questions about what it’s found or why it thinks what it thinks, and it answers from everything the investigation knows.@incident why do you think this is a Redis problem and not Postgres?
@incident has anything like this happened before?
Steer it
If you know something the investigation doesn’t — the real cause, a misleading signal, or a wrong turn it’s taking — tell it what to focus on instead. It feeds that in and re-assesses its hypothesis within a few minutes, and your input is attributed in the channel so everyone can see where the change in direction came from.@incident the investigation is wrong — it’s Redis, not Postgres. Connections have been at 100% since 14:32.
@incident focus on the 14:30 deploy of payment-service v2.3.1 — the errors started right after it. The integration warnings are unrelated noise.
Related
How investigations work
The process behind what you see in the channel.
Chatbot
Everything else you can ask
@incident during an incident.