Skip to main content
Most of an investigation happens in your incident channel, in Slack or Microsoft Teams. This page covers what you’ll see there — the summary message, the progress updates, and the heads-ups — and how to ask questions or steer the investigation as it runs.

The summary message

When an investigation starts, it posts a single summary message to the channel and keeps it up to date in place. It’s kept short and scannable, with headers for the things a responder most wants to know:
  • What’s going on? — the situation in plain language.
  • What caused it? — the current best hypothesis, with its confidence shown inline.
  • What can I do next? — concrete next steps, linked to the evidence behind them.
While the investigation is still working, the message shows its progress and what it’s still checking. As it learns more, the same message updates — so the channel always shows current thinking, never a stale first guess.
What’s going on? The Redis instance in production is sustaining CPU above 50% (measured at 55.3%), which triggered an operational alert. What caused it? (medium confidence) The elevated Redis CPU is plausibly linked to increased worker queue load, particularly from the statuspage-worker process — a pattern seen in past incidents. What can I do next?
  • Correlate the Redis CPU spike with statuspage-worker queue metrics in Grafana for the incident window
  • If they line up, temporarily gate the event subscriptions driving Redis load
  • Keep watching Redis CPU and workload metrics over the next 30 minutes
Confidence is shown right next to the hypothesis — for example “(medium confidence)” or “(medium confidence, still investigating)” — so you can weigh a suggestion before acting on it.

Progress in the thread

As the investigation works, it posts updates into the thread beneath the summary, so you can watch what it’s doing as it does it. You don’t need to read these to follow along — the summary always reflects the latest — but they’re there when you want the detail. You’ll see things like:
  • Hypothesis updates when its thinking changes — labeled so you can see the shift at a glance, such as “New hypothesis”, “Hypothesis strengthened”, or “Hypothesis weakened”.
  • Check results as each piece of work completes — a short summary of what it found, with links to the source.
New hypothesis
  • I’m now looking at upstream rate limiting from the payments API — sustained 429s with flat database metrics
  • Shifted away from the earlier database contention theory
  • Next: checking whether the payments API quota was changed recently
Querying telemetry
  • A core deploy shipped 11 minutes before the first error (PR #54586), and the build SHA matches
  • 107 successful responses vs 4 server errors over 4 hours, with no latency spike
Links: Grafana dashboard, PR #54586

Heads-up messages

The summary and thread are there whenever you choose to look. But sometimes the investigation works out something important that you probably don’t know yet — and waiting for you to check back isn’t good enough. In that case it posts a heads-up message to the channel, with the detail in a thread.
Heads up: I think this could be database connection pool exhaustion. The auth-gateway connection pool hit saturation (50/50) at 14:23 UTC — exactly when errors started spiking.
Heads-ups are deliberately quiet. The investigation only posts one when there’s a genuine shift worth your attention — a code change that explains the error, a past incident with the same fingerprint, a third-party outage — so they read as progress, not noise. The thread carries the supporting evidence and links, including any similar past incidents.

Ask and steer with @incident

The investigation isn’t a one-way broadcast. At any point you can talk to it in the channel by tagging @incident.

Ask about the investigation

Ask questions about what it’s found or why it thinks what it thinks, and it answers from everything the investigation knows.
@incident why do you think this is a Redis problem and not Postgres?
@incident has anything like this happened before?

Steer it

If you know something the investigation doesn’t — the real cause, a misleading signal, or a wrong turn it’s taking — tell it what to focus on instead. It feeds that in and re-assesses its hypothesis within a few minutes, and your input is attributed in the channel so everyone can see where the change in direction came from.
@incident the investigation is wrong — it’s Redis, not Postgres. Connections have been at 100% since 14:32.
@incident focus on the 14:30 deploy of payment-service v2.3.1 — the errors started right after it. The integration warnings are unrelated noise.
You can also steer from the investigation message directly using Provide guidance, or from the incident in the dashboard. Engineers working in a local coding agent can steer it too — see Investigate alongside the agent.

How investigations work

The process behind what you see in the channel.

Chatbot

Everything else you can ask @incident during an incident.