AWS Outages 2023 and 2025: When the Internet Backbone Faltered

October 25, 2025

#AWS #cloud computing #outage #infrastructure #DevOps #resilience #Amazon #technology #AWS US-EAST-1 #cloud outage #disaster recovery

AWS Outages 2023 and 2025: When the Internet Backbone Faltered

Amazon Web Services (AWS) powers an extraordinary portion of the internet's infrastructure. From streaming services and social media platforms to banking systems and government portals, AWS quietly underpins the digital experiences of billions of users worldwide. It's the invisible engine that keeps modern computing running—until it doesn't.

In the span of just over two years, AWS experienced two major outages that exposed the fragility lurking beneath the cloud's promised reliability. The first, in June 2023, stemmed from a software defect that had never been triggered before. The second, in October 2025, resulted from a race condition between automated systems. Both disrupted millions of users and reminded the world of a critical truth: even the most sophisticated infrastructure can fail, and when it does, the ripple effects are global.

This is the story of what happened, what went wrong, and what it means for the future of cloud computing.

June 13, 2023: The Lambda Capacity Crisis

What Happened

On June 13, 2023, at 11:49 AM PDT, AWS's US-EAST-1 region in Northern Virginia experienced a catastrophic failure that lasted 3 hours and 48 minutes. The outage affected over 104 AWS services, creating a cascading failure that disrupted major platforms and services across the internet.

Major organizations impacted included:

The Boston Globe - Unable to publish digital content
Southwest Airlines - Flight operations disrupted
McDonald's mobile app - Order processing failed
Taco Bell app - Service unavailable
New York MTA - Transit information systems affected
The Associated Press - News distribution interrupted

The Real Cause: A Hidden Software Defect

Contrary to early speculation about DNS issues, the root cause was far more subtle and technical. AWS's official post-incident report revealed that the outage stemmed from a latent software defect in AWS Lambda's capacity management subsystem.

Here's what happened under the hood:

Lambda's Frontend fleet is responsible for allocating execution environments for customer functions. As usage grew throughout the morning, the fleet reached an unprecedented capacity threshold—a level that had "never been reached within a single cell" in Lambda's operational history. When this threshold was crossed, a dormant bug activated.

The defect caused the system to allocate execution environments without properly utilizing them. Think of it like a restaurant that keeps seating customers at tables but never sends waiters to serve them. The resources existed, but the coordination system broke down. This created a cascading resource exhaustion that rippled through Lambda and into dependent services.

The bug had existed in the codebase for an unknown period, quietly waiting for the right conditions to manifest. It was a time bomb that finally went off when Lambda's growth trajectory intersected with a specific capacity threshold.

The Cascade Effect

What made this outage particularly severe was the interconnected nature of AWS services. When Lambda struggled, it affected:

API Gateway - Unable to trigger Lambda functions
DynamoDB - Stream processing failures (which initially caused confusion about DNS)
S3 - Event notifications delayed or failed
Step Functions - Workflow orchestration disrupted
CloudWatch - Monitoring and logging impaired

This is the reality of modern microservices architecture: failures don't stay isolated. They propagate.

AWS's Response

AWS engineers identified the issue within the first hour and implemented emergency mitigation by 2:45 PM PDT. The fix involved:

Implementing immediate throttling to prevent new Lambda invocations from hitting the buggy code path
Deploying emergency capacity management logic
Gradually draining affected execution environments
Rolling out a permanent fix to prevent recurrence

Full service restoration was achieved by 3:37 PM PDT, nearly four hours after the initial incident.

October 20, 2025: The DNS Race Condition

What Happened

On October 20, 2025, AWS suffered another major outage in US-EAST-1, this time lasting between 7 to 15 hours depending on the service. The disruption was even more widespread than 2023, generating 6.5 million outage reports on Downdetector.

Services affected included:

Reddit - Complete service unavailability
Snapchat - Messaging and content delivery failed
Canva - Design platform inaccessible
UK banks - Including Lloyds, Halifax, and others experiencing transaction processing issues
Alexa - Voice assistant functionality degraded
Ring - Video doorbell services disrupted
Amazon's own retail site - Intermittent availability issues

The Real Cause: A DynamoDB DNS Race Condition

The October 2025 outage had a different technical root cause, though DNS was indeed involved—but not in the way many initial reports suggested.

The problem originated in DynamoDB's internal infrastructure, not in Route 53 (AWS's customer-facing DNS service). Two automated systems simultaneously attempted to update the same internal DNS entry for DynamoDB API endpoints. This created a race condition where both systems thought they were authoritative for the update.

The result? An empty DNS record.

When services tried to connect to DynamoDB, they received no address information. DynamoDB is foundational to many AWS services, so this single point of failure cascaded through the infrastructure like dominoes:

Services couldn't reach their data stores
Health checks failed across the board
Automated recovery systems kicked in, but had nowhere to route traffic
113 AWS services were impacted either directly or indirectly

The Recovery Complication

What turned this from a bad outage into a catastrophic one was what happened during recovery. When AWS engineers resolved the DNS race condition and DynamoDB came back online, EC2 (Elastic Compute Cloud) attempted to restart all affected instances simultaneously.

This created a "thundering herd" problem—imagine a stadium full of people all trying to exit through a single door at once. The sudden load overwhelmed DynamoDB again, extending the outage significantly. Engineers had to implement gradual, throttled restarts to bring services back safely.

Lessons in Complexity

The October 2025 outage revealed how complex distributed systems can fail in unexpected ways:

Automation can amplify failures: The race condition occurred between two automated systems, neither of which had proper conflict resolution.
Recovery can be harder than mitigation: Fixing the DNS issue was quick; safely restarting millions of instances took hours.
Single points of failure persist: Despite redundancy, DynamoDB's internal DNS became a critical chokepoint.

The Pattern: Centralized Fragility

Both outages expose the same fundamental challenge: over-centralization in cloud infrastructure.

The US-EAST-1 Problem

The US-EAST-1 region (Northern Virginia) is AWS's oldest and most trafficked region. It handles an extraordinary volume of:

DNS requests
Compute instances
Inter-service API calls
Legacy workloads that haven't been migrated

Many organizations route mission-critical workloads through US-EAST-1 due to:

Legacy configurations - Systems built years ago when fewer regions existed
Latency optimization - Proximity to major internet exchange points
Regional service dependencies - Some AWS services launched in US-EAST-1 first

When this region experiences issues, the impact is disproportionately global.

The "It's Always DNS" Syndrome

The October 2025 outage reinforced the industry cliché: "It's always DNS."

DNS acts as the internet's address book. When DNS fails:

Applications can't find their databases
Services can't locate their dependencies
Traffic can't route to healthy instances
Even functioning servers become unreachable

It doesn't matter if your application code is perfect, your servers are running, and your data is intact. If DNS can't resolve your endpoints, you're offline.

What AWS Has Actually Done to Improve Resilience

Between 2023 and 2025, AWS made genuine investments in infrastructure resilience. Here's what actually happened (with correct names and dates):

1. Geographic Expansion (Verified)

AWS expanded from 26 regions in 2021 to 33+ regions by late 2025:

Malaysia Region - Launched August 22, 2024 ($6.2 billion investment)
Thailand Region - Launched January 8, 2025 ($5 billion investment)
New Zealand Region - Launched August 29, 2025 (NZ$7.5 billion investment)
Spain Region - Launched November 15, 2022 (predating the claimed timeframe)

These new regions provide geographic redundancy and reduce dependency on US-EAST-1 for international customers.

2. Route 53 Profiles (Not "Multi-Network DNS")

In 2024, AWS announced Route 53 Profiles, which unifies DNS management across Virtual Private Clouds (VPCs) and accounts. This simplifies multi-region DNS configurations and reduces configuration errors—though it wouldn't have prevented the October 2025 outage, which occurred in internal infrastructure.

3. Enhanced Health Dashboard (Actually from 2022)

AWS unified its Service Health Dashboard and Personal Health Dashboard in February 2022, providing better visibility into service status and personalized impact assessments. This wasn't a 2023-2025 improvement, but it has helped customers respond faster to outages.

4. Generative AI Observability for CloudWatch (Not "AI-Assisted Monitoring")

In October 2025, AWS announced Generative AI observability for Amazon CloudWatch. This helps monitor AI/ML applications, not use AI to assist with general infrastructure monitoring. It's a valuable tool for a specific use case, but not the broad "AI-Assisted Monitoring" sometimes described.

5. Ongoing Infrastructure Improvements

AWS has invested in:

Automated deployment guardrails to prevent configuration errors
Enhanced chaos engineering testing
Improved isolation between service control planes
Better throttling mechanisms to prevent cascading failures

These improvements are real and meaningful, even if they haven't prevented all outages.

Regulatory Scrutiny: The World Takes Notice

The repeated AWS outages have intensified regulatory examination of cloud concentration risks.

United States: FTC Investigation (Verified)

On March 22, 2023, the Federal Trade Commission issued an official Request for Information examining cloud computing business practices. The FTC specifically investigated:

The impact of enterprise reliance on a small number of cloud providers
Competitive dynamics in cloud computing
Potential security risks from concentration
Single points of failure in critical infrastructure

The FTC received 102 public comments and published findings in November 2023. The investigation continues, with ongoing examination of:

Software licensing practices
Egress fees that lock customers in
Minimum spend contracts
Systemic risk to digital commerce and national security

United Kingdom: Sovereign Cloud Push (Verified)

The UK has been particularly active in addressing cloud dependency:

September 2024: UK designated data centers as Critical National Infrastructure
July 2025: Competition and Markets Authority concluded that Microsoft and AWS require "targeted interventions"
August 2025: Microsoft admitted it cannot guarantee sovereignty of Office 365 data stored in UK data centers, acknowledging that personnel from 105 countries (including China) can access it

A survey found that 83% of UK IT leaders worry about geopolitical impacts on data access. The government is exploring options for government-specific cloud infrastructure.

European Union: Cloud Sovereignty Concerns (Partially Verified)

While no "2025 Cloud Infrastructure Resilience Report" from the European Commission exists, the EU has taken meaningful steps:

2025 Strategic Foresight Report addresses cloud dependency as a strategic risk
Cloud and AI Development Act in preparation (expected Q4 2025/Q1 2026)
Ongoing emphasis on digital sovereignty and data residency

The EU's Digital Single Market strategy includes provisions to reduce dependency on non-European cloud providers, though implementation remains gradual.

The Global Context

Beyond the US, UK, and EU:

China enforces data localization through the Data Security Law
India requires payment data and certain government data to remain within national borders
Australia has strengthened Critical Infrastructure Protection regulations
Brazil is developing sovereign cloud requirements

The pattern is clear: governments worldwide are reconsidering their dependence on a handful of global cloud providers.

The Market Reality: Concentration Continues

Despite concerns, cloud market concentration remains high. According to multiple industry sources:

Synergy Research Group: AWS, Microsoft Azure, and Google Cloud control 63-68% of the global cloud market, depending on segment
- 63% for all cloud infrastructure services
- 68% for public cloud IaaS/PaaS
- 72% for IaaS only (per Gartner)

This concentration creates inherent systemic risk. When one provider experiences issues, millions of organizations and billions of users feel the impact.

Practical Lessons: Building for Resilience

For organizations depending on cloud infrastructure, these outages provide critical lessons.

1. Multi-Region Is Not Optional

Critical workloads must span multiple AWS regions—or even multiple cloud providers. AWS provides tools to support this:

Route 53 Health Checks: Automatically route traffic away from unhealthy endpoints
Amazon RDS Multi-AZ: Synchronous replication across availability zones
S3 Cross-Region Replication: Automatic data replication for disaster recovery
AWS Backup: Centralized backup across regions

Example architecture:

Primary:      US-EAST-1 (Virginia)
Secondary:    US-WEST-2 (Oregon)
Tertiary:     EU-WEST-1 (Ireland)
Failover:     Automatic via Route 53 health checks
Data Sync:    Continuous via S3 CRR and RDS replication

2. Multi-Cloud Strategies Are Growing

Many enterprises now distribute workloads across multiple providers:

AWS for compute and storage
Microsoft Azure for enterprise applications and Active Directory integration
Google Cloud Platform for data analytics and AI/ML workloads

This approach is expensive and complex—managing multiple clouds requires:

Different APIs and tooling
Multiple vendor relationships
Diverse security models
Cross-cloud networking

But incidents like these justify the investment. When AWS goes down, your Azure workloads keep running.

3. Test Your DNS Dependencies

The October 2025 outage proved that DNS failures are uniquely destructive. Organizations should:

Map all DNS dependencies in their architecture
Implement DNS health monitoring
Configure multiple DNS providers (e.g., Route 53 + Cloudflare)
Test DNS failover regularly
Use DNS caching strategically

Python example for DNS monitoring:

import boto3
import time
from datetime import datetime

def monitor_dns_health():
    """Monitor DNS resolution for critical endpoints"""
    route53 = boto3.client('route53')
    
    critical_endpoints = [
        'api.yourcompany.com',
        'db.yourcompany.com',
        'auth.yourcompany.com'
    ]
    
    for endpoint in critical_endpoints:
        try:
            # Check if DNS resolves
            resolved = route53.test_dns_answer(
                HostedZoneId='YOUR_ZONE_ID',
                RecordName=endpoint,
                RecordType='A'
            )
            
            if not resolved:
                alert_oncall(f"DNS failure for {endpoint}")
                
        except Exception as e:
            log_error(f"DNS check failed: {e}")
            
monitor_dns_health()

4. Practice Chaos Engineering

Chaos engineering helps expose hidden dependencies before they cause outages. AWS provides tools:

AWS Fault Injection Simulator: Inject controlled failures into your infrastructure
AWS Resilience Hub: Assess and improve application resilience
Third-party tools: Gremlin, Chaos Monkey, LitmusChaos

Example chaos experiment:

# Simulate DynamoDB unavailability
experiment:
  name: "DynamoDB Outage Simulation"
  actions:
    - type: "aws:dynamodb:deny-access"
      targets:
        - table: "critical-data-table"
      duration: "PT10M"  # 10 minutes
  
  hypothesis: "Application gracefully degrades with cached data"
  
  success_criteria:
    - "Error rate < 5%"
    - "P99 latency < 2000ms"
    - "No cascading failures to dependent services"

5. Implement Graceful Degradation

Applications should continue functioning (in reduced capacity) when dependencies fail:

Circuit breakers: Stop calling failing services
Fallback strategies: Use cached data when databases are unavailable
Feature flags: Disable non-critical features during incidents
Queue-based async processing: Defer work that can wait

Example circuit breaker pattern:

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
def call_external_api():
    """Call external service with circuit breaker protection"""
    response = requests.get('https://api.partner.com/data')
    
    if response.status_code != 200:
        raise Exception("API call failed")
    
    return response.json()

def get_data_with_fallback():
    """Get data with fallback to cache"""
    try:
        return call_external_api()
    except CircuitBreakerError:
        # Circuit is open, use cached data
        return get_from_cache()
    except Exception as e:
        log_error(f"API call failed: {e}")
        return get_from_cache()

6. Invest in Observability

You can't fix what you can't see. Modern observability requires:

Real-time monitoring: CloudWatch, Datadog, Grafana, Prometheus
Distributed tracing: AWS X-Ray, Jaeger, Honeycomb
Log aggregation: CloudWatch Logs Insights, Elasticsearch, Splunk
Custom metrics: Track business KPIs, not just infrastructure metrics

Example: AWS Health API integration

import boto3

def check_aws_service_health():
    """Monitor AWS service health in real-time"""
    health = boto3.client('health')
    
    response = health.describe_events(
        filter={
            'regions': ['us-east-1', 'us-west-2'],
            'services': ['EC2', 'RDS', 'LAMBDA', 'DYNAMODB'],
            'eventStatusCodes': ['open', 'upcoming']
        }
    )
    
    for event in response.get('events', []):
        severity = event.get('eventTypeCategory')
        service = event.get('service')
        status = event.get('statusCode')
        
        if severity == 'issue':
            send_alert(
                f"AWS {service} issue detected: {status}"
            )
            
        print(f"{event['eventTypeCode']} - {status}")

# Run every 5 minutes
check_aws_service_health()

7. Document and Test Your Disaster Recovery Plan

Everyone has a disaster recovery plan until disaster strikes. Regular testing reveals gaps:

Monthly tabletop exercises: Walk through scenarios
Quarterly DR drills: Actually fail over to secondary regions
Annual full-scale tests: Simulate complete regional failure
Post-incident reviews: Learn from real outages

DR Plan Checklist:

RTO (Recovery Time Objective) defined for each service
RPO (Recovery Point Objective) defined for each data store
Runbooks documented and accessible
Automated failover tested
Manual failover procedures validated
Communication plan established
Third-party dependencies identified
Data restoration tested

The Bigger Picture: What This Means for the Future

Cloud Concentration Isn't Decreasing

Despite regulatory scrutiny and public concern, cloud market concentration continues to increase. Why?

Network effects: The more services AWS offers, the stickier it becomes
Switching costs: Migrating from AWS is expensive and risky
Innovation pace: The hyperscalers outpace smaller competitors
Price competition: Volume discounts favor large deployments

The Resilience Paradox

Cloud providers promise resilience through redundancy, but create fragility through concentration. This is the fundamental paradox of modern infrastructure.

Traditional architecture:

Many small single points of failure
Localized impact when failures occur
Difficult to manage at scale

Cloud architecture:

Few large single points of failure
Global impact when failures occur
Easy to manage until failure happens

The Path Forward

The future of cloud resilience likely involves:

Hybrid and multi-cloud becoming standard: Not just for risk mitigation, but as operational best practice
Edge computing reducing dependence: Processing data closer to users reduces reliance on centralized cloud regions
Open source alternatives gaining traction: Projects like Kubernetes enable cloud-agnostic architectures
Regulatory requirements forcing diversification: Government workloads may be required to use multiple providers
Cloud providers investing in resilience: Competition and regulation will drive continued infrastructure improvements

Conclusion: Embracing Realistic Expectations

The AWS outages of 2023 and 2025 weren't failures of technology—they were revelations of reality. Complex distributed systems fail. Scale introduces emergent problems. Automation can amplify issues as quickly as it solves them.

The lesson isn't that cloud computing is fundamentally flawed. It's that resilience isn't automatic—it must be engineered, tested, and continuously improved.

For organizations depending on cloud infrastructure:

Accept that outages will happen
Design systems that gracefully degrade
Distribute risk across regions and providers
Invest in observability and chaos engineering
Test disaster recovery plans regularly

For cloud providers:

Continue improving isolation between services
Invest in chaos engineering at scale
Enhance transparency during incidents
Design for graceful degradation by default

For regulators:

Balance innovation with systemic risk concerns
Require transparency without stifling competition
Encourage multi-cloud architectures for critical infrastructure

The internet's backbone is stronger than ever, but it's not unbreakable. Understanding its limitations is the first step toward building systems that can withstand them.

Additional Resources

AWS Official Documentation:

Industry Reports:

Synergy Research Group - Cloud Market Analysis
Gartner Cloud Infrastructure Market Share Reports
ThousandEyes AWS Outage Analysis (June 2023, October 2025)

Regulatory Documents:

FTC Cloud Computing RFI (March 2023)
UK CMA Cloud Services Market Investigation
EU Digital Markets Act and Cloud Strategy

Technical Deep Dives:

AWS Lambda June 2023 Post-Incident Report
CNN - October 2025 AWS Outage Technical Analysis
Various third-party postmortem analyses

AWS Outages 2023 and 2025: When the Internet Backbone Faltered

Related Posts

Amazon EC2 M8a Instances: Power, Price, and Performance Perfected

Docker vs Kubernetes: The Real-World Guide to Containers and Orchestration

Cloud GPU Pricing Comparison 2026: AWS vs GCP vs Azure for AI Training

Building a Modern Monitoring Strategy That Actually Works

Stay on the Nerd Track