When DynamoDB Sneezed, the Cloud Caught a Cold . A deep dive Inside the AWS Outage of October 20, 2025

Cover Image

DNS Caused it!

 A ripple turned into a wave when Amazon Web Services (AWS), the backbone for much of the modern internet suffered a massive outage. For hours, users worldwide couldn’t access their go-to apps like Netflix, Snapchat, Reddit, Duolingo, or even operate smart home devices like Ring or Alexa. The incident exposed a stark truth: the infrastructure that supports our digital lives is far more fragile and centralized  than many of us realize.

AWS systems started reporting increased error rates, particularly around the DynamoDB API endpoints in its us-east-1 (Northern Virginia) region. As DNS resolution failed, many client services could not locate essential AWS internal endpoints, triggering cascading failures across multiple AWS subsystems.

What Went Wrong?

The trigger was a DNS resolution failure for regional DynamoDB service endpoints in us-east-1.

Amazon DynamoDB is a fully managed, distributed NoSQL database service. AWS doesn’t just offer DynamoDB to customers but it also uses DynamoDB internally as a foundational data store for many of its own services.

In simple terms: DynamoDB is the “internal memory” of many AWS systems.

Across AWS, hundreds of services use DynamoDB as a control-plane data store, to track configurations, session states, tokens, routing data, workloads, and user metadata. When a service starts, scales, or interacts with another service, it often needs to read or write to DynamoDB. For instance, when you start an EC2 instance, the request metadata (what AMI, region, key pairs, security groups) passes through the control plane APIs that query DynamoDB. When you invoke a Lambda, the system checks configuration and version data which is stored in DynamoDB.

If DynamoDB’s API or DNS endpoints become unavailable, those control plane operations stall across dozens of dependent services.

Why It Becomes a Single Point of Failure

us-east-1 dominance: historically, this region is AWS’s control hub where global metadata for multiple services is housed (for legacy and cost reasons). So when DynamoDB us-east-1 falters, its ripple effect can be global.

DNS (for endpoint resolution) + DynamoDB (for service state) form an interlinked core. A failure in either can trigger a chain reaction.

Tight coupling at the metadata level: many AWS services can’t boot, deploy, or scale without reading their configuration state from DynamoDB.


How This Played Out

The root cause was a DNS resolution failure that made DynamoDB endpoints in us-east-1 unreachable.

Many internal AWS services  relying on DynamoDB for metadata and orchestration suddenly couldn’t read or write their configuration data.


This meant: 

  • Lambda couldn’t fetch invocation configs.
    API Gateway couldn’t serve routing rules.
    CloudFormation stacks paused mid-deployment.
    CloudWatch alarms froze.
    ECS / EKS couldn’t synchronize the container state.

Even unrelated public apps (Snapchat, Venmo, etc.) went down because their backends depended on those AWS services.
That’s why a single outage in DynamoDB caused such a massive internet-wide cascade. When it sneezes, much of AWS catches a cold.

Lessons & Risks Revealed

This outage underscores many themes that tech organizations, businesses, and societies must reckon with:

Single points / chokepoints of failure
Even though AWS is distributed, its heavy concentration in us-east-1 and heavy dependency on a few internal services (DNS, DynamoDB) means a single failure there can have large global ripple effects.

Dependency / transitive risk
Many companies don’t manage or even fully understand their chain of dependencies, for example “You rely on X, which relies on AWS, which relies on Y”. A failure in an underlying cloud service can cascade across layers.

Overcentralization of digital infrastructure
A handful of cloud providers now power large swathes of the internet. Failures at these providers affect many “independent” services simultaneously.

Need for better isolation, blast radius control, redundancy
Architectural strategies (circuit breakers, microservice isolation, multi-region fallback, graceful degradation) become critical.

Chaos engineering
Having tooling to detect anomalies early, run “what if” failures, simulate outages, and ensure graceful fallback is essential.

What Should AWS & the Cloud Ecosystem Do?

  • Encourage customers to adopt multi-region, multi-cloud strategies and provide tooling to ease cross-cloud portability.

  • Use more aggressive failover fallback routes (e.g. alternate DNS resolvers, redundant endpoints).

  • Improve change management / configuration correctness via more rigorous review, testing, “canary” deployments.

  • Support hybrid / edge / decentralized models to reduce dependency on a few mega cloud hubs.

  • Invest in stronger isolation between core subsystems so that cascading failures are harder (e.g. decoupling DNS / routing from data plane).

This AWS outage was a brutal reminder that no matter how robust or ubiquitous, even the world’s largest cloud infrastructure is vulnerable. For millions globally, it revealed how dependent our digital lives are on a handful of systems and providers. But rather than despair, this moment can be a catalyst for change in pushing architects, businesses, and cloud operators to build more resilient, decentralized, and transparent systems.

For any team that relies on cloud services: revisit your dependencies, test your failure modes, and never underestimate a “small” config change in a critical system. The next outage may not give you a second chance.


Glass Image

Have something in mind?
Let’s bring it to life.

We’d love to hear what you’re building. Whether you’re early or ready to launch, reach out

Contact Us: