Write the On-Call Handoff Before You Put Down the Pager

The outgoing engineer types "all clear" into Slack and logs off. The incoming engineer picks up the pager fifteen minutes later, looks at the same channel, and has no idea whether "all clear" means nothing happened or nothing resolved.

Handoffs fail when they rely on memory. A verbal "it was a quiet week" at shift changeover is not a handoff. It is an invitation for the incoming engineer to rediscover everything the outgoing engineer already knew.

This is the specific thing AI can fix - and largely hasn't yet.

The structural problem is well understood. The biggest conceptual mistake in on-call handoff design is assuming the purpose is to tell the next engineer what happened during the previous shift. That information can be useful, but it is not the center of the problem. The real question is more urgent: what can hurt us next, and how prepared is the incoming engineer to see it early and respond correctly?

A handoff document that chronicles the last eight hours of quiet is almost useless at 4 a.m. What matters is the half-resolved incident from Tuesday, the deploy that went out at 6 p.m. that the team hasn't stress-tested under load, and the known flaky service that pages every third weekend.

Handoffs that only cover active incidents miss broader operational context. Incoming engineers need to understand the full operational picture, not just current fires.

The gap widens on follow-the-sun teams. Distributed teams face higher coordination complexity, and every shift transition magnifies the risk of missing critical context. The APAC engineer taking over from Europe doesn't share office space, can't catch a colleague in the hallway, and is working from whatever text was left in a channel.

Where AI is actually landing right now is in two places: alert noise reduction and postmortem drafting.

The difference between 200 alerts and 3 meaningful alerts is the difference between panic and focus. incident.io's AI SRE cuts downtime by starting investigations instantly - instead of spending 15 minutes context-switching between Datadog, GitHub, and Slack to correlate a deployment with an error spike, the AI surfaces that correlation in 30 seconds.

That's real. So is automated postmortem drafting. Most teams write postmortems by hand. Most postmortems are late, short, and read by no one. The reason is unsentimental: writing a good postmortem takes hours of reconstruction work, on top of an incident that has already drained the on-call's day. Tools like Rootly, incident.io, and FireHydrant now generate draft postmortems from the incident channel, timeline, and linked alerts. AI changes the authoring cost, not the purpose. The blameless postmortem discipline Google and Etsy codified still applies; it just starts with a draft instead of a blank page.

But postmortem generation solves the past. The handoff problem is about the near future.

The incoming engineer doesn't need a minute-by-minute reconstruction. They need the minimum context that allows competent action.

In practice, a good handoff message is often shorter than a bad one because it is structured around decisions instead of chronology. Chronological storytelling tends to expand while becoming less useful. The incoming engineer does not need a minute-by-minute reconstruction unless the incident itself is under formal review. They need the minimum context that allows competent action.

This is where a Slack-native tool - something like Beagle, or the incident-channel integrations in Rootly - has a natural role: watching what was said, what was escalated, and what was left unresolved, then drafting a structured handoff note that surfaces the three things that actually matter rather than the full thread scroll.

The alert fatigue problem compounds everything. A 2025 study by Splunk showed 73% of organisations experienced outages linked to ignored alerts. When engineers start treating pages as background noise because a significant portion are not actionable, you have created the conditions for a missed P1.

Toil rose 30% in 2025, the first increase in five years. The irony is that more tooling contributed to it. Teams added monitoring layers, each with its own alerting logic, and nobody deleted the alerts that stopped being actionable. The Google SRE Workbook recommends a maximum of two actionable incidents per shift as a sustainable baseline. If your team is consistently seeing eight to ten, you don't have an on-call problem - you have an alerting problem.

AI that clusters and de-duplicates alerts addresses this directly. PagerDuty uses AI-powered alert grouping to reduce alert noise by up to 98%, with automatic pattern recognition grouping relevant warnings so on-call engineers receive fewer notifications. Whether those numbers hold in practice varies by stack, but the direction is right.

There's an honest limitation worth naming. IBM Research's ITBench benchmark, published as an ICML 2025 spotlight paper, tested 94 real-world IT automation scenarios across SRE, FinOps, and CISO domains. State-of-the-art models resolved only 13.8% of SRE scenarios autonomously. The AI SRE that fully replaces human judgment in production isn't here. What is here is AI that reduces the cognitive setup cost at shift start - summarizing what happened, flagging what's uncertain, and noting what still needs eyes.

Teams should be careful about where handoffs live. The incoming engineer should be free to disagree with earlier choices, but they should not have to reverse-engineer them from logs and vibes.

That last phrase is exact. Logs are complete records of what the system did. They are not records of what the on-call engineer was thinking when they made a call at midnight. The decision context - "we rolled back because we suspected the cache, not the deploy" - lives in Slack or in nobody's head. Structured handoffs surface patterns across rotations, such as recurring alerts, repeated escalations, and knowledge gaps, that would otherwise stay invisible until they cause a major incident.

The platform landscape is also shifting. Atlassian ended OpsGenie new sales on June 4, 2025, with full shutdown scheduled for April 5, 2027. Teams migrating now are picking between tools with different philosophies: PagerDuty as the deep-integration alerting engine, incident.io as the strongest Slack-native on-call experience, with AI-generated post-mortems , and newer entrants like Rootly, which is the AI-first pick for fast-growing mid-market and enterprise teams that live in Slack or Teams.

The migration is an opportunity. Whichever tool replaces OpsGenie, the handoff workflow shouldn't be rebuilt to match the old one. The old one was probably a Confluence page updated once a week by whoever had the energy.

Build the handoff first. Then pick the tool that makes it automatic.