Amazon Web Services delivered a detailed post mortem on its widespread outage this week and perhaps the biggest takeaway is that automation can go wrong.

On Monday, many websites and apps were down or severely hampered over a DynamoDB bug that resulted in multiple connection failures across AWS. Simply put, automation caused the outage and made it harder to recover. Typically, automation is what keeps AWS' complex system running.

The lessons here are likely to be valuable given the tech sector is bullish on the promise of AI agents, which will presumably make decisions autonomously and continually optimize processes. What happens when these automations wind up amplifying issues when they go bad?

According to AWS, the root cause of the outage was the cloud provider's automated DNS management system for DynamoDB. AWS has a system called DNS Planner and Enactor to automatically publish DNS records for internal and external endpoints.

A "race condition" in the automation caused the system to write an empty or incorrect DNS record for DynamoDB's main endpoint in the US-East-1 region. The system that was supposed to notice the issue didn't trigger due to the same automation logic.

DynamoDB then went down for the count. The outage brought down multiple services including AWS Lambda, CloudFormation and EC2. Load balancers designed to automatically replace faulty EC2 instances failed. More automation made load issues worse. Simply put, self-healing did just the opposite.

AWS was forced to stop automation jobs to recover. Automation typically keeps AWS running seamlessly, but in this case it scaled problems because the underlying logic was wrong.

For now, AWS has disabled the DNS Planner and Enactor automation until it adds safeguards. AWS is also adding rate limiters to prevent automated systems from changing too much too fast and building in manual reviews. EC2 will get an additional test suite to augment AWS' existing scale testing. 

Lessons learned:

  • Start planning for what happens if automation goes bad.
  • Design for redundancy.
  • Think through automation resiliency.
  • AWS outage may wind up being just the first time we notice the downside of automation.

Business Research Themes