Beyond the beep and saving sleep: optimizing the On-Call experience

Behind the curtains of any application there’s a team of on-call engineers keeping the lights on. A bad on-call experience for any team could become a strong driver for employee unhappiness and churn, and result in increased downtime.

Here follow some lessons I’ve learned from the trenches in small start-ups and larger engineering teams, to improve your on-call shift experience and remediation time for production issues and make sure you’re spending on-call efforts on what has the most impact.

Identifying the signal and the noise

In an effective alerting system, every alert paging your on-call engineer should be

  • actionable: it’s apparent what action needs to be done, and a manual action is warranted
  • Not noisy: it has the right notification level and frequency matching the urgency, and it’s not introducing alert fatigue.
  • can be muted for a specific issue. When the issue is known and resolution is already in-progress, there’s potentially no added value from notifying again within a certain time window

The chapter in Google’s free SRE handbook on Monitoring alerts is a fantastic resource on these concepts and also introduces the ‘4 golden metrics’ (error rate, latency, traffic, saturation) that you can use as a starting point for creating alerts.

Ideally your team spots issues before your customers are impacted or the impact increases. On one end of the pendulum, you’re not raising the right alerts to be made aware of issues affecting your customers and you need to track the signal better. Pay close attention in post-mortems to questions such as ‘how could we have noticed this issue sooner?’ and which alerts did not work well or were missing.

If you find yourself on the other side of the alerting pendulum, acting more on many noisy non-actionable events, you probably need to alert more on impact to business flows and less on ‘infrastructure events’ instead. Some infrastructure events are great indicators of issues only if you know the system can not recover automatically. Measure closely that you’re not swinging the pendulum back too far when making adjustments.

Measuring business-impact

At Customaite, the main source of alerts began as the equivalent of ‘a wild error log has appeared in the logs’, which demanded the same urgency for breaking bugs as well as a single service request time-out. This works well for the early-stage of a product, but it fails to scale as too much attention will be spent on noise and introduce alert fatigue. Instead of alerting on any (low-level) error, you will need to track when an operation important to a business flow has failed instead. This might be an API failure rate, or require more custom metrics to be set up. Blackbox monitoring is another useful technique for testing whether a flow still works end-to-end, by end-to-end testing a flow in production that’s hidden from other users.

To help you define what’s important to alert on, you will need to set and periodically review your Service Level Objectives (SLOs) with the product stakeholders. What’s the allowed availability and performance targets for flows you deem critical? Measure your current state, translate to what it means in minutes of downtime and failing flows, and adjust as necessary.

Treat any deviation from these SLOs as a high priority bug. Issues that happen within those percentages consume from the ‘error budget’ and allow you to focus on developing your product. Start with those targets for the main flows, and work down from there what that means for the target metrics for APIs that flow depends on. Often flows are set too dependent on other APIs (breaking when the dependency breaks), when it could degrade gracefully instead (such as omitting certain fields or components).

On my previous team at Uber, the error rates were left mostly at the default setting and were not updated for a couple of years. This sometimes resulted in too much attention going to the error rates of some of these APIs, especially when they were outdated and no longer actively supported. For APIs with low throughput, a single error may ensure you can not achieve previous targets and raise alerts during the night time. Regularly review your SLOs, and ensure the availability targets match what’s expected. 

Getting out of a bad on-call experience

If your team scores low on on-call topics during team health retros, or team members are often woken up during the night for noisy errors, or busy chasing red herrings during the day —  you may be suffering from chronic bad on-call-itis. With the critical flows defined in SLOs, and guidelines set in your team on how to tune alerts, you’re ready to start to turn the ship around.

Your observability process will need constant tuning to keep up with the application that’s being updated, or how the behavior patterns of your users change. The on-call engineer should therefore have dedicated time for fixing the ‘run’ of systems: tuning alert descriptions and thresholds, removing or adding alerts, applying fixes and reliability to the systems … 
If they are also given feature work, ensure that continuous improvement work does not disappear in the face of feature work deadlines that did not account for a structural on-call budget.

Measure regularly on how your on-call experience evolves, such as non-actionable pages during the night-time. On-call shifts are experienced differently (some may find those noisy sleepy hour wake-ups acceptable or have grown used to them), I’ve had mixed results with recording a general mood indicator as part of on-call hand-over. Because of the rotational nature of the job, improvements may also not be visible to the rest of the team. Highlight what’s changed based on the numbers and effort from team mates. 

To gain some traction with the backlog of alerts and set a good example, it may be beneficial to start off with a reduced rotation. It comes at a cost for those engineers however, especially if they are following up on other projects at the same time. On the other hand, it may be the quickest way to gain traction on this important topic and set the right example.

In Conclusion

Don’t give up, and don’t accept a less optimal situation just because it’s always been done that way. Keep iterating, and you should be able to lower alert fatigue, improve incident response times, and foster a healthier work environment.

Thanks for reading! If you liked this article, you may also like one of the most popular posts: How to make software architecture trade-off decisions or How to get started with Threat Modeling, before you get hacked. or scroll through the history below or on the main page.