Simple DNS Error Caused Major AWS Outage

aws

AWS shares more details about last week’s outage that had global repercussions. A single error in the DNS management system of DynamoDB resulted in an unfortunate chain reaction.

On October 20, nearly half of the internet went down due to an outage at AWS. Many popular websites and applications such as Roblox, Reddit, Asana, Signal, and the web store of parent company Amazon were impacted. Although the outage occurred in a U.S. data center region (us-east-1), the effects were felt globally.

AWS promised to share more information and has released an incident report. The cause of the outage is remarkably simple. A single software error in a DNS management system was enough to temporarily cripple multiple AWS services. The impact affected DynamoDB, EC2, Lambda, Redshift, and the AWS Support Center, among others.

The cause was a race condition in the automated DNS management system of DynamoDB, AWS writes in a report. This led to an outdated DNS plan being applied, after which the system automatically deleted the active plan. As a result, customers and other AWS services could no longer connect to the DynamoDB endpoint in the affected region.

Domino Effect

The outage had a domino effect on other AWS services that was felt worldwide. EC2 instances that were already running continued to operate, but new launches failed because the DropletWorkflow Manager (DWFM), which relies on DynamoDB, could no longer maintain leases.

After DynamoDB was restored, DWFM had to reconnect with thousands of droplets. Due to delays, the system stalled, which was only resolved after restarting DWFM hosts.

At the same time, the network delay caused issues with the Network Load Balancer (NLB). New EC2 instances could not be correctly added to the NLB, causing health checks to fail. This led to erroneous failovers and increased connection errors.

Other services such as Lambda, Redshift, ECS, EKS, Fargate, and the support service also experienced disruptions. Redshift clusters could not authenticate IAM users, and some clusters became unusable due to failed recovery actions. ECS, EKS, and Fargate experienced delays in starting containers. The AWS Support Center became temporarily unreachable due to erroneous metadata.

One Bug, Big Impact

The issue highlights how essential the services of major cloud providers are today. Even an outage in a single regional data center can have global consequences. According to initial estimates by experts, the financial impact of the outage will amount to billions of dollars. Downtime costs money.

Amazon has disabled DNS automation worldwide and is working on a structural solution to the problem. Amazon promises additional measures for the other affected services to further improve the resilience and recovery speed of their infrastructure.