It is 2:47 PM on a Thursday. An alert fires, a dedicated incident channel spins up, and eight people start typing. Among them is Maya, three weeks into the job, added to the channel because she is on the shadow rotation. She reads every message. She has no idea what she is allowed to do, what she is supposed to be learning, or where the runbook lives. An hour later the incident closes. She learned that Datadog can be very loud.
This is the default state of shadow on-call for most engineering teams. The practice exists - before any engineer carries the pager independently, they should shadow an experienced colleague through real incidents, and this shadow layer should be added to the rotation schedule explicitly, not as informal mentorship, but as a structured step in on-call readiness
- but the structure stops at the schedule. Nobody tells Maya what to watch for, and nobody follows up afterward.
The gap between "being in the channel" and "learning incident response" is almost entirely a coordination problem. It does not require more senior-engineer time. It requires someone to do three small things: brief the shadow before the shift, give them a job during the incident, and debrief them when it is over. That someone is usually nobody, which is why the practice fails.
Before the shift starts
A shadow engineer joining a rotation cold is like starting a film in the third act. The new hire cannot page the right team at 11 PM because the escalation path is not documented or easily accessible. Tool sprawl creates a loop where each switch between PagerDuty, Datadog, Slack, and Jira costs orientation time. And the anxiety compounds: every fumbled alert makes the next shift harder to approach calmly, which directly increases MTTR.
The briefing before a shadow shift does not need to be long. Fifteen minutes, or a written note sent the night before, covering four things:
- What services are likely to page this week and why
- Where the relevant runbooks live (specific URLs, not "check Confluence")
- Who the primary is and what the shadow's role actually is - read-only, or allowed to ask questions in-thread
- One previous incident from the past month to read as context
That last item matters more than it sounds. Every decision, command output, and status update lives in a single thread that you can scroll later for the RCA - no more bouncing between monitoring dashboards, email, and spreadsheets just to piece the story together. A new engineer who has read one real incident thread before sitting in on a live one will orient roughly three times faster.
During the incident
Give the shadow a job. Not a critical one - but a real one.
A shadow shift is when the new engineer observes an experienced colleague handling a real incident without taking any action themselves. A reverse shadow shift flips the dynamic: the new engineer drives the response while the senior colleague observes and guides without taking over. That progression is sound, but even in a pure shadow shift, "no action" is too vague. The shadow ends up scrolling and feeling useless.
A better framing: the shadow owns the running notes. Not the incident doc - that belongs to the incident commander. But a private scratchpad, or a thread reply, logging their own working understanding in real time: what triggered this, what is the current hypothesis, what just changed. This forces active observation instead of passive spectating. It also produces something useful for the debrief.
A second option that works well for P2 and lower incidents: let the shadow post the internal status update - the message to #incidents-status that tells non-technical stakeholders what is happening and when the next update will come. It is low-stakes, it requires understanding the incident well enough to summarize it plainly, and it is a skill they will use constantly once they are on primary rotation.
The debrief nobody does
Most teams wait until a 90-day review to ask new hires how something went. By then the details are fuzzy and the window for improvement has closed. The same logic applies here, but the timeline is even shorter - incident details decay fast.
A debrief does not need to be a meeting. A three-question message sent within 24 hours is enough:
- What was the thing you did not understand at the time that you understand now?
- What is the thing you still do not understand?
- If this incident happened again and you were primary, what would you do differently?
The answers inform the next briefing and, over several shifts, build a map of exactly where the new engineer's knowledge is thin. That map is more useful than any onboarding doc.
Structured on-call programs reduce the new-hire mentoring burden, returning senior SRE capacity to proactive reliability work instead of hand-holding during incidents. The irony is that teams skip the structure because they think it costs senior time. It costs far less senior time than an unstructured shadow who goes primary too early and pages the wrong person at midnight.
Where an AI teammate fits
The briefing and the debrief are both templated, repeatable, and easy to forget under load. A teammate like Beagle can watch for the pattern - new engineer added to the on-call shadow layer, shift starting tomorrow - and send the briefing note automatically, pulling the relevant runbook links and the most recent closed incident from the right channel. After the incident closes, it sends the debrief prompt and logs the response somewhere the team lead can see it.
None of that is complex. All of it is currently manual, which is why it does not happen.
Some scheduling tools now support a dedicated shadow layer: new engineers observe real incidents, build familiarity with runbooks, and develop confidence before carrying the pager solo. If your rotation tool supports this, use it - it makes the shadow enrollment automatic and removes the manual step of adding the right person to the right channel.
The incident channel is one of the densest learning environments on an engineering team. A new engineer who goes through three or four well-structured shadow shifts arrives at their first primary rotation with a real mental model of how the system fails. One who just watches eight people type fast arrives knowing only that Datadog can be very loud.