Keeping on-call calm
Being on-call can be stressful and disruptive. We need to have a reasonable on-call process that balances the needs of the business with the well-being of our team.
At Bellroy we have found the key elements to our on-call process are:
We treat our team with kindness We rotate on-call responsibilities to distribute the workload. We run a two week shift with at least a month between the end of one shift and the start of the next one. Team members get extra time off as compensation for the imposition of taking an on-call shift. If an out of hours response is necessary, extra time off is also allocated. Never expect someone who’s been up half the night dealing with fires to complete a full day of work the next day.
Our alerts are actually critical enough to disturb someone. Ensure that a critical severity alert is actually critical. Waking someone up at 3am for a failed background process that can wait for the next day should not happen. Paging someone should be for a problem that is causing (or could cause) significant business disruption. Every calculated alarm should have some fault tolerance in it; a single bad data point should not trigger a page event. Alarms should be clear, informative and actionable by the team member.
Our documentation is accessible and up to date Team members should have access to clear and up-to-date documentation so they can quickly identify and resolve issues. We use backstage and we’ve found that storing the documentation in the git repository with the code helps. Keeping documentation up to date is a struggle as it will drift from reality. We include reviewing and updating documentation into our regular dependency update cycle. We also maintain a knowledge base around incidents and alerts that we’ve seen more than once.
We have clear expectations and escalation paths Team members should understand their responsibilities and know how long they have to respond to an incident. In the case that an incident is not acknowledged or can’t be solved by the on-call person, there should be a clear escalation path.
We train our team Make sure all team members have access to adequate training so they are able to actually fix problems as they arise.
We work towards continuous improvement No process is perfect and there is always room for improvement. After each significant incident we write a post mortem to record the resolution and root cause. Then we make recommendations on how to mitigate similar events in the future.
On-call is a necessary evil, but we can make it much less of a burden. Like most online businesses we need support 24/7 to minimise the impact of incidents on our systems. By implementing these elements we have developed a fair process that ensures support for our critical systems and a healthy work-life balance for our team.