The rotation spreadsheet is not the problem. Every SRE team eventually figures this out the hard way: they redesign the schedule, add a secondary, move to follow-the-sun, shorten the shifts — and the burnout continues. The pager keeps going off. People keep leaving.
The actual problem is the alerts themselves.
Most on-call problems are signal quality problems wearing the disguise of scheduling problems. Redesigning your rotation without fixing your alerts is rearranging the furniture. That framing is uncomfortable because schedule problems feel solvable. You can draw a Gantt chart. You can argue about weekends fairly. Alert quality is harder — it requires admitting that a lot of what your monitoring stack has been screaming about for months shouldn't require a human at all.
The numbers make this embarrassing to ignore.
PagerDuty's 2025 State of Digital Operations report found that the average on-call engineer receives roughly 50 alerts per week, but only 2–5% of those require human intervention. That is not a monitoring strategy. That is a system that has learned to cry wolf, and the engineers responsible for answering it have learned to stop believing it. When teams are desensitized by noise, a critical alert can easily be overlooked — a "boy who cried wolf" effect that can let a minor issue escalate into a severe, customer-impacting outage.
The human cost is not abstract. Engineers talk about "sleeping with one eye open," waking multiple times a night to check alerts that resolve themselves by morning. Stress builds. Over time, people leave. And when they leave, they are usually the engineers who have accumulated the most institutional knowledge about the systems being monitored — which makes the next on-call rotation worse.
The Catchpoint SRE Report 2025 found that nearly 70% of SREs say on-call stress has impacted burnout and attrition on their teams. Unplanned downtime costs organizations an average of $5,600 per minute. So there is real money attached to letting this drift.
Where AI is actually helping
The application of AI here is not glamorous. It is not an agent that resolves incidents autonomously while engineers sleep. It is, mostly, correlation and suppression — taking the flood of events and collapsing it into something a person can actually read.
Alert volumes actively degrade reliability. When engineers see 500–1,200 alerts per day, they start tuning out. AIOps platforms address this by grouping related events, identifying which alerts are symptoms of the same root cause, and suppressing the redundant ones before they hit a pager. AI-powered observability platforms are cutting alert volumes by up to 95% and reducing mean time to resolution by 40–58%, per vendor and case-study examples. Even at the lower end of those ranges, the effect on trust is meaningful. An engineer who gets paged five times a week and finds a real problem four of those times is a different person than one who gets paged fifty times and finds a real problem twice.
Beyond suppression, the smarter systems are starting to route differently. By analyzing an alert's payload — including its source, service tags, and severity — the platform can bypass the default schedule and route the incident directly to subject matter experts. A critical P0 incident can automatically page a senior engineer and manager, while a low-priority P2 issue can create a ticket for review during business hours. That distinction — interrupt someone's sleep versus queue it for morning — is something a calendar-based rotation cannot make. The rotation doesn't know what kind of alert it is. The AI does.
SRE teams spend more time debating on-call scheduling than almost any other operational decision, while often overlooking a bigger cause of burnout: the time spent at 3 AM toggling between PagerDuty, Datadog, Slack, and a Google Doc just to assemble the team. A teammate like Beagle, sitting inside Slack, can at least reduce the context-gathering overhead once someone is paged — surfacing recent incident threads, prior postmortem notes, or who last touched a service — so the person who woke up isn't also responsible for remembering everything.
What hasn't changed
AI can reduce noise volume. It can route smarter. It cannot fix the upstream decisions that created the noise in the first place — services that alert on conditions that don't actually represent user impact, thresholds set conservatively years ago and never revisited, monitoring that was added "just in case" and never cleaned up.
The rotation schedule did not cause this. An alert fired that should never have existed, and a human being lost two hours of sleep because nobody had ever stopped to ask: does this alert represent a real customer-impacting problem, or is it noise we have learned to live with?
The Catchpoint SRE Report 2025 found that the median time spent on operations activities had risen to 30% in 2025, up from 25% in 2024. That's time not spent on reliability engineering, automation, or building better systems. It's reactive firefighting instead of proactive engineering. AI that suppresses bad alerts is useful. But it is a second-best answer to the question of why the bad alerts existed in the first place.
The honest version of this conversation is not "how do we use AI to handle our alert volume" but "why do we have alert volume that requires AI to manage it." Those are different conversations, and most teams are only having the first one.
Before reaching for an AIOps layer, spend one sprint auditing the last thirty alerts that paged someone after midnight. Ask: what would have had to be true for this to never fire? The answers usually point to a small set of services that account for most of the noise.
Tracking cumulative on-call hours per engineer across every rotation they belong to — and making that visible before it becomes a retention problem — is at least a starting point for the equity side of the equation. But equity in a broken system is still a broken system.
The teams making real progress are fixing both sides: they are using AI to reduce the noise they cannot immediately eliminate, while running down the root causes of that noise systematically. The schedule is the last thing they touch, because by the time they get there, the pager has gotten quiet enough that the rotation almost doesn't matter.
That is what sustainable on-call actually looks like. It is not a clever spreadsheet. It is fewer things that shouldn't have fired in the first place.