AWS delivers outage post mortem: When automation bites back

Published October 24, 2025

Amazon Web Services delivered a detailed post mortem on its widespread outage this week and perhaps the biggest takeaway is that automation can go wrong.

On Monday, many websites and apps were down or severely hampered over a DynamoDB bug that resulted in multiple connection failures across AWS. Simply put, automation caused the outage and made it harder to recover. Typically, automation is what keeps AWS' complex system running.

The lessons here are likely to be valuable given the tech sector is bullish on the promise of AI agents, which will presumably make decisions autonomously and continually optimize processes. What happens when these automations wind up amplifying issues when they go bad?

According to AWS, the root cause of the outage was the cloud provider's automated DNS management system for DynamoDB. AWS has a system called DNS Planner and Enactor to automatically publish DNS records for internal and external endpoints.

A "race condition" in the automation caused the system to write an empty or incorrect DNS record for DynamoDB's main endpoint in the US-East-1 region. The system that was supposed to notice the issue didn't trigger due to the same automation logic.

DynamoDB then went down for the count. The outage brought down multiple services including AWS Lambda, CloudFormation and EC2. Load balancers designed to automatically replace faulty EC2 instances failed. More automation made load issues worse. Simply put, self-healing did just the opposite.

AWS was forced to stop automation jobs to recover. Automation typically keeps AWS running seamlessly, but in this case it scaled problems because the underlying logic was wrong.

For now, AWS has disabled the DNS Planner and Enactor automation until it adds safeguards. AWS is also adding rate limiters to prevent automated systems from changing too much too fast and building in manual reviews. EC2 will get an additional test suite to augment AWS' existing scale testing.

Lessons learned:

Start planning for what happens if automation goes bad.
Design for redundancy.
Think through automation resiliency.
AWS outage may wind up being just the first time we notice the downside of automation.

Larry Dignan

Editor in Chief of Constellation Insights
Constellation Research

Larry Dignan is Editor in Chief of Constellation Insights at Constellation Research, where he leads editorial coverage focused on enterprise technology, digital transformation, and emerging trends shaping the future of business. He oversees research-driven news, analysis, interviews, and event coverage designed to help technology buyers and vendors navigate complex markets with clarity and context. ...

Insight News

March 09, 2026

HPE delivers mixed Q1, strong Q2 outlook

Tech Optimization

HPE delivered a mixed first quarter, but said revenue in the second quarter will be better than expected....

Larry Dignan

Insight News

March 09, 2026

Boomi adds data readiness, activation to its enterprise platform

Data to Decisions

Boomi has expanded the Boomi Enterprise Platform with a bevy of new data activation features. The launch was part of a broader move to be positioned as quot;The Data Activation Com...

Larry Dignan

Insight News

March 09, 2026

Microsoft launches new E7 suite to integrate AI agents, Work IQ

Data to Decisions

Microsoft launched a new suite for AI agents, Copilot Cowork in partnership with Anthropic and said Agent 365 will be available May 1. The moves are part of Microsoft's effort to d...

Larry Dignan

Insight News

March 08, 2026

Disruption coming for ERP and not from where you'd think

Data to Decisions

The enterprise resource planning software category is about to get a lot more interesting over the next 12 to 18 months amid new entrants, margin compression, new experiences and A...

Larry Dignan

Insight News

March 06, 2026

OpenAI GPT-5.4 aims to close Anthropic enterprise gap

Data to Decisions

OpenAI's launch of GPT-5.4, a model designed for professional work, on the surface is about closing the buzz gap with Anthropic, which has captured spending with Claude models and ...

Larry Dignan

Insight News

March 05, 2026

AWS launches Amazon Health Connect

Data to Decisions

Amazon Web Services launched Amazon Connect Health, an AI agentic suite designed to better connect patients and healthcare providers. ...

Larry Dignan

Published

October 24, 2025

Author

Larry Dignan

AWS delivers outage post mortem: When automation bites back

HPE delivers mixed Q1, strong Q2 outlook

Boomi adds data readiness, activation to its enterprise platform

Microsoft launches new E7 suite to integrate AI agents, Work IQ

Disruption coming for ERP and not from where you'd think

OpenAI GPT-5.4 aims to close Anthropic enterprise gap

AWS launches Amazon Health Connect

Published

Author

Research

Analyst Services

Videos

Communities

Events

Insights Live

AWS delivers outage post mortem: When automation bites back

Results

Published

Author

Business Themes

Audience Role

Related Blog Posts