AWS Outages 2023 and 2025: When the Internet Backbone Faltered

October 25, 2025

AWS Outages 2023 and 2025: When the Internet Backbone Faltered

Amazon Web Services (AWS) powers an extraordinary portion of the internet's infrastructure. From streaming services and social media platforms to banking systems and government portals, AWS quietly underpins the digital experiences of billions of users worldwide. It's the invisible engine that keeps modern computing running—until it doesn't.

In the span of just over two years, AWS experienced two major outages that exposed the fragility lurking beneath the cloud's promised reliability. The first, in June 2023, stemmed from a software defect that had never been triggered before. The second, beginning late on October 19, 2025 and extending through October 20, resulted from a race condition between automated systems. Both disrupted millions of users and reminded the world of a critical truth: even the most sophisticated infrastructure can fail, and when it does, the ripple effects are global.

This is the story of what happened, what went wrong, and what it means for the future of cloud computing.


June 13, 2023: The Lambda Capacity Crisis

What Happened

On June 13, 2023, at 11:49 AM PDT, AWS's US-EAST-1 region in Northern Virginia experienced a catastrophic failure. Lambda function invocations began returning to normal at 1:45 PM PDT, and all affected services had fully recovered by 3:37 PM PDT — roughly 3 hours and 48 minutes after the incident began.1 AWS identified a Lambda capacity-management subsystem as the source, with cascading impact across more than 100 AWS services per third-party analysis (ThousandEyes).2

Major organizations and apps reported as impacted by news outlets and AWS-customer postings included:

  • The Boston Globe — digital publishing operations disrupted (around 2:45 PM EDT)2
  • New York MTA — transit information systems affected2
  • The Associated Press — news-distribution workflows interrupted
  • McDonald's, Taco Bell, and Burger King mobile apps — order processing failed
  • Delta Air Lines — website and app degraded
  • Capital One, Fortnite, IMDB, Crunchyroll, 1Password, and others — service degradation reported

The Real Cause: A Hidden Software Defect

Contrary to early speculation about DNS issues, the root cause was far more subtle and technical. AWS's official post-incident report revealed that the outage stemmed from a latent software defect in AWS Lambda's capacity management subsystem.

Here's what happened under the hood:

Lambda's Frontend fleet is responsible for allocating execution environments for customer functions. As usage grew throughout the morning, the fleet reached an unprecedented capacity threshold—a level that had "never been reached within a single cell" in Lambda's operational history. When this threshold was crossed, a dormant bug activated.

The defect caused the system to allocate execution environments without properly utilizing them. Think of it like a restaurant that keeps seating customers at tables but never sends waiters to serve them. The resources existed, but the coordination system broke down. This created a cascading resource exhaustion that rippled through Lambda and into dependent services.

The bug had existed in the codebase for an unknown period, quietly waiting for the right conditions to manifest. It was a time bomb that finally went off when Lambda's growth trajectory intersected with a specific capacity threshold.

The Cascade Effect

What made this outage particularly severe was the interconnected nature of AWS services. When Lambda struggled, it affected:

  • API Gateway - Unable to trigger Lambda functions
  • DynamoDB - Stream processing failures (which initially caused confusion about DNS)
  • S3 - Event notifications delayed or failed
  • Step Functions - Workflow orchestration disrupted
  • CloudWatch - Monitoring and logging impaired

This is the reality of modern microservices architecture: failures don't stay isolated. They propagate.

AWS's Response

AWS engineers identified the issue and Lambda function invocations began returning to normal at 1:45 PM PDT per the official post-mortem.1 The fix involved:

  1. Implementing immediate throttling to prevent new Lambda invocations from hitting the buggy code path
  2. Deploying emergency capacity management logic
  3. Gradually draining affected execution environments
  4. Rolling out a permanent fix to prevent recurrence

Full service restoration — including downstream services that had been hit by the cascade — was achieved by 3:37 PM PDT, nearly four hours after the initial incident.1


October 19-20, 2025: The DNS Race Condition

What Happened

The outage began at 11:48 PM PDT on October 19, 2025 (07:48 UTC on October 20) in AWS's US-EAST-1 region.3 Primary DynamoDB recovery occurred by 2:40 AM PDT October 20, but EC2-related effects extended through 1:50 PM PDT, and AWS declared "all services returned to normal operations" later in the afternoon — a window of roughly 14-15 hours depending on service.3 Downdetector recorded over 6.5 million user reports globally, with some industry trackers reporting figures ranging up to 11-17 million across the full window.

Services affected included:

  • Reddit - Complete service unavailability
  • Snapchat - Messaging and content delivery failed
  • Canva - Design platform inaccessible
  • UK banks - Including Lloyds, Halifax, and others experiencing transaction processing issues
  • Alexa - Voice assistant functionality degraded
  • Ring - Video doorbell services disrupted
  • Amazon's own retail site - Intermittent availability issues

The Real Cause: A DynamoDB DNS Race Condition

The October 2025 outage had a different technical root cause, though DNS was indeed involved—but not in the way many initial reports suggested.

The problem originated in DynamoDB's internal DNS-management automation, which is implemented through Route 53 but is operationally separate from customer-facing Route 53 hosted zones. Per AWS's post-mortem, the system uses two redundant components — a DNS Planner (creates plans based on load-balancer health) and DNS Enactors (apply plans across multiple Availability Zones).3 One Enactor, slowed by unusual processing latencies, applied an outdated plan to the regional endpoint while another Enactor's cleanup automation deleted what it thought was an obsolete plan. The result: every IP address for the DynamoDB regional endpoint was removed.3

The result? An empty DNS record.

When services tried to connect to DynamoDB, they received no address information. DynamoDB is foundational to many AWS services, so this single point of failure cascaded through the infrastructure like dominoes:

  • Services couldn't reach their data stores
  • Health checks failed across the board
  • Automated recovery systems kicked in, but had nowhere to route traffic
  • Core AWS services including DynamoDB, EC2, Lambda, ECS, EKS, Fargate, Amazon Connect, STS, Redshift, and Network Load Balancer were directly impacted, with downstream effects on roughly 1,000 third-party platforms34

The Recovery Complication

What turned this from a bad outage into a catastrophic one was what happened during recovery. EC2's DropletWorkflow Manager (DWFM) — the subsystem that manages the physical hosts ("droplets") underneath EC2 instances and tracks their leases via DynamoDB — entered what AWS described as "congestive collapse" once DynamoDB came back online: it tried to re-establish leases across many thousands of droplets simultaneously, but each batch of work timed out before it could complete.3

This is the classic "thundering herd" pattern—imagine a stadium full of people all trying to exit through a single door at once. AWS engineers had to manually throttle incoming work and selectively restart DWFM hosts to clear the backlog before normal operations could resume.3

Lessons in Complexity

The October 2025 outage revealed how complex distributed systems can fail in unexpected ways:

  1. Automation can amplify failures: The race condition occurred between two automated systems, neither of which had proper conflict resolution.

  2. Recovery can be harder than mitigation: Fixing the DNS issue was relatively quick; safely re-establishing leases across the EC2 host fleet under throttled, controlled conditions took hours.

  3. Single points of failure persist: Despite redundancy, DynamoDB's internal DNS became a critical chokepoint.


The Pattern: Centralized Fragility

Both outages expose the same fundamental challenge: over-centralization in cloud infrastructure.

The US-EAST-1 Problem

The US-EAST-1 region (Northern Virginia) is AWS's oldest and most trafficked region. It handles an extraordinary volume of:

  • DNS requests
  • Compute instances
  • Inter-service API calls
  • Legacy workloads that haven't been migrated

Many organizations route mission-critical workloads through US-EAST-1 due to:

  • Legacy configurations - Systems built years ago when fewer regions existed
  • Latency optimization - Proximity to major internet exchange points
  • Regional service dependencies - Some AWS services launched in US-EAST-1 first

When this region experiences issues, the impact is disproportionately global.

The "It's Always DNS" Syndrome

The October 2025 outage reinforced the industry cliché: "It's always DNS."

DNS acts as the internet's address book. When DNS fails:

  • Applications can't find their databases
  • Services can't locate their dependencies
  • Traffic can't route to healthy instances
  • Even functioning servers become unreachable

It doesn't matter if your application code is perfect, your servers are running, and your data is intact. If DNS can't resolve your endpoints, you're offline.


What AWS Has Actually Done to Improve Resilience

Between 2023 and 2025, AWS made genuine investments in infrastructure resilience. Here's what actually happened (with correct names and dates):

1. Geographic Expansion (Verified)

AWS expanded from 26 regions in 2021 to 33+ regions by late 2025:

  • Malaysia Region — Launched August 22, 2024 (US$6.2 billion investment through 2038)5
  • Thailand Region — Launched January 8, 2025 (US$5 billion+ investment over 15 years)6
  • New Zealand Region — Launched September 1, 2025 (NZ$7.5 billion investment)7
  • Spain Region — Launched November 15, 2022 (predates the 2023-2025 timeframe but still part of the broader expansion arc)8

These new regions provide geographic redundancy and reduce dependency on US-EAST-1 for international customers.

2. Route 53 Profiles (Not "Multi-Network DNS")

In 2024, AWS announced Route 53 Profiles, which unifies DNS management across Virtual Private Clouds (VPCs) and accounts. This simplifies multi-region DNS configurations and reduces configuration errors—though it wouldn't have prevented the October 2025 outage, which occurred in internal infrastructure.

3. Enhanced Health Dashboard (Actually from 2022)

AWS unified its Service Health Dashboard and Personal Health Dashboard in February 2022, providing better visibility into service status and personalized impact assessments. This wasn't a 2023-2025 improvement, but it has helped customers respond faster to outages.

4. Generative AI Observability for CloudWatch (Not "AI-Assisted Monitoring")

In October 2025, AWS announced Generative AI observability for Amazon CloudWatch. This helps monitor AI/ML applications, not use AI to assist with general infrastructure monitoring. It's a valuable tool for a specific use case, but not the broad "AI-Assisted Monitoring" sometimes described.

5. Ongoing Infrastructure Improvements

AWS has invested in:

  • Automated deployment guardrails to prevent configuration errors
  • Enhanced chaos engineering testing
  • Improved isolation between service control planes
  • Better throttling mechanisms to prevent cascading failures

These improvements are real and meaningful, even if they haven't prevented all outages.


Regulatory Scrutiny: The World Takes Notice

The repeated AWS outages have intensified regulatory examination of cloud concentration risks.

United States: FTC Investigation (Verified)

On March 22, 2023, the Federal Trade Commission issued an official Request for Information examining cloud computing business practices.9 The FTC specifically investigated:

  • The impact of enterprise reliance on a small number of cloud providers
  • Competitive dynamics in cloud computing
  • Potential security risks from concentration
  • Single points of failure in critical infrastructure

The FTC received 102 public comments and published a "What we heard and learned" summary in November 2023.10 The Commission continues examining:

  • Software licensing practices
  • Egress fees that lock customers in
  • Minimum spend contracts
  • Systemic risk to digital commerce and national security

United Kingdom: Sovereign Cloud Push (Verified)

The UK has been particularly active in addressing cloud dependency:

  • September 2024: UK designated data centers as Critical National Infrastructure
  • 31 July 2025: The Competition and Markets Authority published the final decision of its Cloud Services Market Investigation, recommending Microsoft and AWS be subject to "targeted interventions" via Strategic Market Status (SMS) designation11
  • August 2025: Microsoft admitted it cannot guarantee sovereignty of Office 365 data stored in UK data centers, acknowledging that personnel from 105 countries (including China) can access it

A survey found that 83% of UK IT leaders worry about geopolitical impacts on data access. The government is exploring options for government-specific cloud infrastructure.

European Union: Cloud Sovereignty Concerns (Verified)

The EU has taken meaningful steps on cloud dependency:

  • 2025 Strategic Foresight Report (June 2025) flags cloud and data infrastructure as one of the EU's persistent strategic dependencies, noting roughly 70% of the EU's cloud market is controlled by three US firms12
  • Cloud and AI Development Act (CADA) — Commission consultation closed June 4, 2025; legislative proposal scheduled for Q1 2026 per the 2026 Commission work programme13
  • Ongoing emphasis on digital sovereignty and data residency

The EU's broader digital strategy includes provisions to reduce dependency on non-European cloud providers, though implementation remains gradual.

The Global Context

Beyond the US, UK, and EU:

  • China enforces data localization through the Data Security Law
  • India requires payment data and certain government data to remain within national borders
  • Australia has strengthened Critical Infrastructure Protection regulations
  • Brazil is developing sovereign cloud requirements

The pattern is clear: governments worldwide are reconsidering their dependence on a handful of global cloud providers.


The Market Reality: Concentration Continues

Despite concerns, cloud market concentration remains high. According to multiple industry sources:

  • Synergy Research Group: AWS, Microsoft Azure, and Google Cloud control 63-68% of the global cloud market, depending on segment
    • 63% for all cloud infrastructure services
    • 68% for public cloud IaaS/PaaS
    • 72% for IaaS only (per Gartner)

This concentration creates inherent systemic risk. When one provider experiences issues, millions of organizations and billions of users feel the impact.


Practical Lessons: Building for Resilience

For organizations depending on cloud infrastructure, these outages provide critical lessons.

1. Multi-Region Is Not Optional

Critical workloads must span multiple AWS regions—or even multiple cloud providers. AWS provides tools to support this:

  • Route 53 Health Checks: Automatically route traffic away from unhealthy endpoints
  • Amazon RDS Multi-AZ: Synchronous replication across availability zones
  • S3 Cross-Region Replication: Automatic data replication for disaster recovery
  • AWS Backup: Centralized backup across regions

Example architecture:

Primary:      US-EAST-1 (Virginia)
Secondary:    US-WEST-2 (Oregon)
Tertiary:     EU-WEST-1 (Ireland)
Failover:     Automatic via Route 53 health checks
Data Sync:    Continuous via S3 CRR and RDS replication

2. Multi-Cloud Strategies Are Growing

Many enterprises now distribute workloads across multiple providers:

  • AWS for compute and storage
  • Microsoft Azure for enterprise applications and Active Directory integration
  • Google Cloud Platform for data analytics and AI/ML workloads

This approach is expensive and complex—managing multiple clouds requires:

  • Different APIs and tooling
  • Multiple vendor relationships
  • Diverse security models
  • Cross-cloud networking

But incidents like these justify the investment. When AWS goes down, your Azure workloads keep running.

3. Test Your DNS Dependencies

The October 2025 outage proved that DNS failures are uniquely destructive. Organizations should:

  • Map all DNS dependencies in their architecture
  • Implement DNS health monitoring
  • Configure multiple DNS providers (e.g., Route 53 + Cloudflare)
  • Test DNS failover regularly
  • Use DNS caching strategically

Python example for DNS monitoring:

import boto3
import time
from datetime import datetime

def monitor_dns_health():
    """Monitor DNS resolution for critical endpoints"""
    route53 = boto3.client('route53')
    
    critical_endpoints = [
        'api.yourcompany.com',
        'db.yourcompany.com',
        'auth.yourcompany.com'
    ]
    
    for endpoint in critical_endpoints:
        try:
            # Check if DNS resolves
            resolved = route53.test_dns_answer(
                HostedZoneId='YOUR_ZONE_ID',
                RecordName=endpoint,
                RecordType='A'
            )
            
            if not resolved:
                alert_oncall(f"DNS failure for {endpoint}")
                
        except Exception as e:
            log_error(f"DNS check failed: {e}")
            
monitor_dns_health()

4. Practice Chaos Engineering

Chaos engineering helps expose hidden dependencies before they cause outages. AWS provides tools:

  • AWS Fault Injection Simulator: Inject controlled failures into your infrastructure
  • AWS Resilience Hub: Assess and improve application resilience
  • Third-party tools: Gremlin, Chaos Monkey, LitmusChaos

Example chaos experiment:

# Simulate DynamoDB unavailability
experiment:
  name: "DynamoDB Outage Simulation"
  actions:
    - type: "aws:dynamodb:deny-access"
      targets:
        - table: "critical-data-table"
      duration: "PT10M"  # 10 minutes
  
  hypothesis: "Application gracefully degrades with cached data"
  
  success_criteria:
    - "Error rate < 5%"
    - "P99 latency < 2000ms"
    - "No cascading failures to dependent services"

5. Implement Graceful Degradation

Applications should continue functioning (in reduced capacity) when dependencies fail:

  • Circuit breakers: Stop calling failing services
  • Fallback strategies: Use cached data when databases are unavailable
  • Feature flags: Disable non-critical features during incidents
  • Queue-based async processing: Defer work that can wait

Example circuit breaker pattern:

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
def call_external_api():
    """Call external service with circuit breaker protection"""
    response = requests.get('https://api.partner.com/data')
    
    if response.status_code != 200:
        raise Exception("API call failed")
    
    return response.json()

def get_data_with_fallback():
    """Get data with fallback to cache"""
    try:
        return call_external_api()
    except CircuitBreakerError:
        # Circuit is open, use cached data
        return get_from_cache()
    except Exception as e:
        log_error(f"API call failed: {e}")
        return get_from_cache()

6. Invest in Observability

You can't fix what you can't see. Modern observability requires:

  • Real-time monitoring: CloudWatch, Datadog, Grafana, Prometheus
  • Distributed tracing: AWS X-Ray, Jaeger, Honeycomb
  • Log aggregation: CloudWatch Logs Insights, Elasticsearch, Splunk
  • Custom metrics: Track business KPIs, not just infrastructure metrics

Example: AWS Health API integration

import boto3

def check_aws_service_health():
    """Monitor AWS service health in real-time"""
    health = boto3.client('health')
    
    response = health.describe_events(
        filter={
            'regions': ['us-east-1', 'us-west-2'],
            'services': ['EC2', 'RDS', 'LAMBDA', 'DYNAMODB'],
            'eventStatusCodes': ['open', 'upcoming']
        }
    )
    
    for event in response.get('events', []):
        severity = event.get('eventTypeCategory')
        service = event.get('service')
        status = event.get('statusCode')
        
        if severity == 'issue':
            send_alert(
                f"AWS {service} issue detected: {status}"
            )
            
        print(f"{event['eventTypeCode']} - {status}")

# Run every 5 minutes
check_aws_service_health()

7. Document and Test Your Disaster Recovery Plan

Everyone has a disaster recovery plan until disaster strikes. Regular testing reveals gaps:

  • Monthly tabletop exercises: Walk through scenarios
  • Quarterly DR drills: Actually fail over to secondary regions
  • Annual full-scale tests: Simulate complete regional failure
  • Post-incident reviews: Learn from real outages

DR Plan Checklist:

  • RTO (Recovery Time Objective) defined for each service
  • RPO (Recovery Point Objective) defined for each data store
  • Runbooks documented and accessible
  • Automated failover tested
  • Manual failover procedures validated
  • Communication plan established
  • Third-party dependencies identified
  • Data restoration tested

The Bigger Picture: What This Means for the Future

Cloud Concentration Isn't Decreasing

Despite regulatory scrutiny and public concern, cloud market concentration continues to increase. Why?

  1. Network effects: The more services AWS offers, the stickier it becomes
  2. Switching costs: Migrating from AWS is expensive and risky
  3. Innovation pace: The hyperscalers outpace smaller competitors
  4. Price competition: Volume discounts favor large deployments

The Resilience Paradox

Cloud providers promise resilience through redundancy, but create fragility through concentration. This is the fundamental paradox of modern infrastructure.

Traditional architecture:

  • Many small single points of failure
  • Localized impact when failures occur
  • Difficult to manage at scale

Cloud architecture:

  • Few large single points of failure
  • Global impact when failures occur
  • Easy to manage until failure happens

The Path Forward

The future of cloud resilience likely involves:

  1. Hybrid and multi-cloud becoming standard: Not just for risk mitigation, but as operational best practice

  2. Edge computing reducing dependence: Processing data closer to users reduces reliance on centralized cloud regions

  3. Open source alternatives gaining traction: Projects like Kubernetes enable cloud-agnostic architectures

  4. Regulatory requirements forcing diversification: Government workloads may be required to use multiple providers

  5. Cloud providers investing in resilience: Competition and regulation will drive continued infrastructure improvements


Conclusion: Embracing Realistic Expectations

The AWS outages of 2023 and 2025 weren't failures of technology—they were revelations of reality. Complex distributed systems fail. Scale introduces emergent problems. Automation can amplify issues as quickly as it solves them.

The lesson isn't that cloud computing is fundamentally flawed. It's that resilience isn't automatic—it must be engineered, tested, and continuously improved.

For organizations depending on cloud infrastructure:

  • Accept that outages will happen
  • Design systems that gracefully degrade
  • Distribute risk across regions and providers
  • Invest in observability and chaos engineering
  • Test disaster recovery plans regularly

For cloud providers:

  • Continue improving isolation between services
  • Invest in chaos engineering at scale
  • Enhance transparency during incidents
  • Design for graceful degradation by default

For regulators:

  • Balance innovation with systemic risk concerns
  • Require transparency without stifling competition
  • Encourage multi-cloud architectures for critical infrastructure

The internet's backbone is stronger than ever, but it's not unbreakable. Understanding its limitations is the first step toward building systems that can withstand them.


Additional Resources

AWS Official Documentation:

Industry Reports:

  • Synergy Research Group - Cloud Market Analysis
  • Gartner Cloud Infrastructure Market Share Reports
  • ThousandEyes AWS Outage Analysis (June 2023, October 2025)

Regulatory Documents:

Technical Deep Dives — AWS Official Post-Mortems (canonical sources):


References

Footnotes

  1. AWS Service Health, "Summary of the AWS Lambda Service Event in Northern Virginia (US-EAST-1) Region" — official post-event summary, June 13, 2023. https://aws.amazon.com/message/061323/ 2 3

  2. ThousandEyes Internet Report, "AWS Outage Analysis: June 13, 2023." Identified more than 100 AWS services experiencing impact and named affected customer-facing services including The Boston Globe and the New York MTA. 2 3

  3. AWS Service Health, "Summary of the Amazon DynamoDB Service Disruption in the Northern Virginia (US-EAST-1) Region" — official post-event summary, October 19-20, 2025. https://aws.amazon.com/message/101925/ 2 3 4 5 6 7

  4. ThousandEyes / The Register / Pragmatic Engineer analyses (October 22-25, 2025) document the cascading impact across roughly 1,000 third-party platforms tracked by Downdetector during the incident.

  5. AWS press release, "AWS Launches Infrastructure Region in Malaysia," August 22, 2024. https://press.aboutamazon.com/2024/8/aws-launches-infrastructure-region-in-malaysia

  6. AWS press release, "AWS Launches Infrastructure Region in Thailand," January 8, 2025.

  7. AWS Blog, "Now Open — AWS Asia Pacific (New Zealand) Region," September 1, 2025. https://aws.amazon.com/blogs/aws/now-open-aws-asia-pacific-new-zealand-region/

  8. AWS Blog, "Now Open — AWS Region in Spain," November 15, 2022.

  9. U.S. Federal Trade Commission press release, "FTC Seeks Comment on Business Practices of Cloud Computing Providers that Could Impact Competition and Data Security," March 22, 2023. https://www.ftc.gov/news-events/news/press-releases/2023/03/ftc-seeks-comment-business-practices-cloud-computing-providers-could-impact-competition-data

  10. U.S. Federal Trade Commission, Tech@FTC, "Cloud Computing RFI: What we heard and learned," November 2023. https://www.ftc.gov/policy/advocacy-research/tech-at-ftc/2023/11/cloud-computing-rfi-what-we-heard-learned

  11. UK Competition and Markets Authority, "Cloud services market investigation — Summary of final decision," 31 July 2025. https://www.gov.uk/cma-cases/cloud-services-market-investigation

  12. European Commission, "2025 Strategic Foresight Report — Choosing the European Way to Resilience 2.0" (June 2025). https://commission.europa.eu/strategy-and-policy/strategic-foresight/2025-strategic-foresight-report_en

  13. European Commission, Call for Evidence and consultation on the Cloud and AI Development Act (April-June 2025); Commission Work Programme 2026 indicates a Q1 2026 legislative proposal.


FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.