Welcome to the real world, where things don’t always go your way. You’ve designed your systems to be highly available, scalable, and resilient, and yet sometimes they fail anyway. These failures, if used correctly, can be a powerful lever for gaining a deep understanding of how your system actually works, and a tool for spreading knowledge through your engineering community. In this session we will cover some of AWS’ favourite techniques for defining and reviewing metrics – watching the systems before they fail – as well as how to do an effective post-mortem that drives both learning and meaningful improvement.
Becky Weiss, Senior Principal Engineer, Amazon Web Services