Skip to main content

Chaos Engineering: Falling over without falling over

As applications move online, and automation extends to control more of the world around us, software failures have an increasing impact on business outcomes and safety. We need to develop more resilient systems, and that can’t be left as an operational concern. Resilience needs to be architected into the application code and operability is one of the most important attributes of a resilient system. We’ve seen many examples of failures escalating as a small initial problem causes poorly designed and tested error handling code and procedures to fail in ways that magnify the problem and take out the whole system. What can we do about this? To start with, it’s a shared responsibility to build and operate systems that are observable, controllable, and resilient. With the integration of roles from DevOps practices, and the automation provided by cloud providers, we need to adapt common concepts and terminology that already exist in resilient systems design, for cloud native architectures.

Adrian Cockcroft, VP Cloud Architecture Strategy, Amazon Web Services