AWS Outages 2023 and 2025: When the Internet Backbone Faltered
October 25, 2025
Amazon Web Services (AWS) powers an extraordinary portion of the internet's infrastructure. From streaming services and social media platforms to banking systems and government portals, AWS quietly underpins the digital experiences of billions of users worldwide. It's the invisible engine that keeps modern computing running—until it doesn't.
In the span of just over two years, AWS experienced two major outages that exposed the fragility lurking beneath the cloud's promised reliability. The first, in June 2023, stemmed from a software defect that had never been triggered before. The second, in October 2025, resulted from a race condition between automated systems. Both disrupted millions of users and reminded the world of a critical truth: even the most sophisticated infrastructure can fail, and when it does, the ripple effects are global.
This is the story of what happened, what went wrong, and what it means for the future of cloud computing.
June 13, 2023: The Lambda Capacity Crisis
What Happened
On June 13, 2023, at 11:49 AM PDT, AWS's US-EAST-1 region in Northern Virginia experienced a catastrophic failure that lasted 3 hours and 48 minutes. The outage affected over 104 AWS services, creating a cascading failure that disrupted major platforms and services across the internet.
Major organizations impacted included:
- The Boston Globe - Unable to publish digital content
- Southwest Airlines - Flight operations disrupted
- McDonald's mobile app - Order processing failed
- Taco Bell app - Service unavailable
- New York MTA - Transit information systems affected
- The Associated Press - News distribution interrupted
The Real Cause: A Hidden Software Defect
Contrary to early speculation about DNS issues, the root cause was far more subtle and technical. AWS's official post-incident report revealed that the outage stemmed from a latent software defect in AWS Lambda's capacity management subsystem.
Here's what happened under the hood:
Lambda's Frontend fleet is responsible for allocating execution environments for customer functions. As usage grew throughout the morning, the fleet reached an unprecedented capacity threshold—a level that had "never been reached within a single cell" in Lambda's operational history. When this threshold was crossed, a dormant bug activated.
The defect caused the system to allocate execution environments without properly utilizing them. Think of it like a restaurant that keeps seating customers at tables but never sends waiters to serve them. The resources existed, but the coordination system broke down. This created a cascading resource exhaustion that rippled through Lambda and into dependent services.
The bug had existed in the codebase for an unknown period, quietly waiting for the right conditions to manifest. It was a time bomb that finally went off when Lambda's growth trajectory intersected with a specific capacity threshold.
The Cascade Effect
What made this outage particularly severe was the interconnected nature of AWS services. When Lambda struggled, it affected:
- API Gateway - Unable to trigger Lambda functions
- DynamoDB - Stream processing failures (which initially caused confusion about DNS)
- S3 - Event notifications delayed or failed
- Step Functions - Workflow orchestration disrupted
- CloudWatch - Monitoring and logging impaired
This is the reality of modern microservices architecture: failures don't stay isolated. They propagate.
AWS's Response
AWS engineers identified the issue within the first hour and implemented emergency mitigation by 2:45 PM PDT. The fix involved:
- Implementing immediate throttling to prevent new Lambda invocations from hitting the buggy code path
- Deploying emergency capacity management logic
- Gradually draining affected execution environments
- Rolling out a permanent fix to prevent recurrence
Full service restoration was achieved by 3:37 PM PDT, nearly four hours after the initial incident.
October 20, 2025: The DNS Race Condition
What Happened
On October 20, 2025, AWS suffered another major outage in US-EAST-1, this time lasting between 7 to 15 hours depending on the service. The disruption was even more widespread than 2023, generating 6.5 million outage reports on Downdetector.
Services affected included:
- Reddit - Complete service unavailability
- Snapchat - Messaging and content delivery failed
- Canva - Design platform inaccessible
- UK banks - Including Lloyds, Halifax, and others experiencing transaction processing issues
- Alexa - Voice assistant functionality degraded
- Ring - Video doorbell services disrupted
- Amazon's own retail site - Intermittent availability issues
The Real Cause: A DynamoDB DNS Race Condition
The October 2025 outage had a different technical root cause, though DNS was indeed involved—but not in the way many initial reports suggested.
The problem originated in DynamoDB's internal infrastructure, not in Route 53 (AWS's customer-facing DNS service). Two automated systems simultaneously attempted to update the same internal DNS entry for DynamoDB API endpoints. This created a race condition where both systems thought they were authoritative for the update.
The result? An empty DNS record.
When services tried to connect to DynamoDB, they received no address information. DynamoDB is foundational to many AWS services, so this single point of failure cascaded through the infrastructure like dominoes:
- Services couldn't reach their data stores
- Health checks failed across the board
- Automated recovery systems kicked in, but had nowhere to route traffic
- 113 AWS services were impacted either directly or indirectly
The Recovery Complication
What turned this from a bad outage into a catastrophic one was what happened during recovery. When AWS engineers resolved the DNS race condition and DynamoDB came back online, EC2 (Elastic Compute Cloud) attempted to restart all affected instances simultaneously.
This created a "thundering herd" problem—imagine a stadium full of people all trying to exit through a single door at once. The sudden load overwhelmed DynamoDB again, extending the outage significantly. Engineers had to implement gradual, throttled restarts to bring services back safely.
Lessons in Complexity
The October 2025 outage revealed how complex distributed systems can fail in unexpected ways:
-
Automation can amplify failures: The race condition occurred between two automated systems, neither of which had proper conflict resolution.
-
Recovery can be harder than mitigation: Fixing the DNS issue was quick; safely restarting millions of instances took hours.
-
Single points of failure persist: Despite redundancy, DynamoDB's internal DNS became a critical chokepoint.
The Pattern: Centralized Fragility
Both outages expose the same fundamental challenge: over-centralization in cloud infrastructure.
The US-EAST-1 Problem
The US-EAST-1 region (Northern Virginia) is AWS's oldest and most trafficked region. It handles an extraordinary volume of:
- DNS requests
- Compute instances
- Inter-service API calls
- Legacy workloads that haven't been migrated
Many organizations route mission-critical workloads through US-EAST-1 due to:
- Legacy configurations - Systems built years ago when fewer regions existed
- Latency optimization - Proximity to major internet exchange points
- Regional service dependencies - Some AWS services launched in US-EAST-1 first
When this region experiences issues, the impact is disproportionately global.
The "It's Always DNS" Syndrome
The October 2025 outage reinforced the industry cliché: "It's always DNS."
DNS acts as the internet's address book. When DNS fails:
- Applications can't find their databases
- Services can't locate their dependencies
- Traffic can't route to healthy instances
- Even functioning servers become unreachable
It doesn't matter if your application code is perfect, your servers are running, and your data is intact. If DNS can't resolve your endpoints, you're offline.
What AWS Has Actually Done to Improve Resilience
Between 2023 and 2025, AWS made genuine investments in infrastructure resilience. Here's what actually happened (with correct names and dates):
1. Geographic Expansion (Verified)
AWS expanded from 26 regions in 2021 to 33+ regions by late 2025:
- Malaysia Region - Launched August 22, 2024 ($6.2 billion investment)
- Thailand Region - Launched January 8, 2025 ($5 billion investment)
- New Zealand Region - Launched August 29, 2025 (NZ$7.5 billion investment)
- Spain Region - Launched November 15, 2022 (predating the claimed timeframe)
These new regions provide geographic redundancy and reduce dependency on US-EAST-1 for international customers.
2. Route 53 Profiles (Not "Multi-Network DNS")
In 2024, AWS announced Route 53 Profiles, which unifies DNS management across Virtual Private Clouds (VPCs) and accounts. This simplifies multi-region DNS configurations and reduces configuration errors—though it wouldn't have prevented the October 2025 outage, which occurred in internal infrastructure.
3. Enhanced Health Dashboard (Actually from 2022)
AWS unified its Service Health Dashboard and Personal Health Dashboard in February 2022, providing better visibility into service status and personalized impact assessments. This wasn't a 2023-2025 improvement, but it has helped customers respond faster to outages.
4. Generative AI Observability for CloudWatch (Not "AI-Assisted Monitoring")
In October 2025, AWS announced Generative AI observability for Amazon CloudWatch. This helps monitor AI/ML applications, not use AI to assist with general infrastructure monitoring. It's a valuable tool for a specific use case, but not the broad "AI-Assisted Monitoring" sometimes described.
5. Ongoing Infrastructure Improvements
AWS has invested in:
- Automated deployment guardrails to prevent configuration errors
- Enhanced chaos engineering testing
- Improved isolation between service control planes
- Better throttling mechanisms to prevent cascading failures
These improvements are real and meaningful, even if they haven't prevented all outages.
Regulatory Scrutiny: The World Takes Notice
The repeated AWS outages have intensified regulatory examination of cloud concentration risks.
United States: FTC Investigation (Verified)
On March 22, 2023, the Federal Trade Commission issued an official Request for Information examining cloud computing business practices. The FTC specifically investigated:
- The impact of enterprise reliance on a small number of cloud providers
- Competitive dynamics in cloud computing
- Potential security risks from concentration
- Single points of failure in critical infrastructure
The FTC received 102 public comments and published findings in November 2023. The investigation continues, with ongoing examination of:
- Software licensing practices
- Egress fees that lock customers in
- Minimum spend contracts
- Systemic risk to digital commerce and national security
United Kingdom: Sovereign Cloud Push (Verified)
The UK has been particularly active in addressing cloud dependency:
- September 2024: UK designated data centers as Critical National Infrastructure
- July 2025: Competition and Markets Authority concluded that Microsoft and AWS require "targeted interventions"
- August 2025: Microsoft admitted it cannot guarantee sovereignty of Office 365 data stored in UK data centers, acknowledging that personnel from 105 countries (including China) can access it
A survey found that 83% of UK IT leaders worry about geopolitical impacts on data access. The government is exploring options for government-specific cloud infrastructure.
European Union: Cloud Sovereignty Concerns (Partially Verified)
While no "2025 Cloud Infrastructure Resilience Report" from the European Commission exists, the EU has taken meaningful steps:
- 2025 Strategic Foresight Report addresses cloud dependency as a strategic risk
- Cloud and AI Development Act in preparation (expected Q4 2025/Q1 2026)
- Ongoing emphasis on digital sovereignty and data residency
The EU's Digital Single Market strategy includes provisions to reduce dependency on non-European cloud providers, though implementation remains gradual.
The Global Context
Beyond the US, UK, and EU:
- China enforces data localization through the Data Security Law
- India requires payment data and certain government data to remain within national borders
- Australia has strengthened Critical Infrastructure Protection regulations
- Brazil is developing sovereign cloud requirements
The pattern is clear: governments worldwide are reconsidering their dependence on a handful of global cloud providers.
The Market Reality: Concentration Continues
Despite concerns, cloud market concentration remains high. According to multiple industry sources:
- Synergy Research Group: AWS, Microsoft Azure, and Google Cloud control 63-68% of the global cloud market, depending on segment
- 63% for all cloud infrastructure services
- 68% for public cloud IaaS/PaaS
- 72% for IaaS only (per Gartner)
This concentration creates inherent systemic risk. When one provider experiences issues, millions of organizations and billions of users feel the impact.
Practical Lessons: Building for Resilience
For organizations depending on cloud infrastructure, these outages provide critical lessons.
1. Multi-Region Is Not Optional
Critical workloads must span multiple AWS regions—or even multiple cloud providers. AWS provides tools to support this:
- Route 53 Health Checks: Automatically route traffic away from unhealthy endpoints
- Amazon RDS Multi-AZ: Synchronous replication across availability zones
- S3 Cross-Region Replication: Automatic data replication for disaster recovery
- AWS Backup: Centralized backup across regions
Example architecture:
Primary: US-EAST-1 (Virginia)
Secondary: US-WEST-2 (Oregon)
Tertiary: EU-WEST-1 (Ireland)
Failover: Automatic via Route 53 health checks
Data Sync: Continuous via S3 CRR and RDS replication
2. Multi-Cloud Strategies Are Growing
Many enterprises now distribute workloads across multiple providers:
- AWS for compute and storage
- Microsoft Azure for enterprise applications and Active Directory integration
- Google Cloud Platform for data analytics and AI/ML workloads
This approach is expensive and complex—managing multiple clouds requires:
- Different APIs and tooling
- Multiple vendor relationships
- Diverse security models
- Cross-cloud networking
But incidents like these justify the investment. When AWS goes down, your Azure workloads keep running.
3. Test Your DNS Dependencies
The October 2025 outage proved that DNS failures are uniquely destructive. Organizations should:
- Map all DNS dependencies in their architecture
- Implement DNS health monitoring
- Configure multiple DNS providers (e.g., Route 53 + Cloudflare)
- Test DNS failover regularly
- Use DNS caching strategically
Python example for DNS monitoring:
import boto3
import time
from datetime import datetime
def monitor_dns_health():
"""Monitor DNS resolution for critical endpoints"""
route53 = boto3.client('route53')
critical_endpoints = [
'api.yourcompany.com',
'db.yourcompany.com',
'auth.yourcompany.com'
]
for endpoint in critical_endpoints:
try:
# Check if DNS resolves
resolved = route53.test_dns_answer(
HostedZoneId='YOUR_ZONE_ID',
RecordName=endpoint,
RecordType='A'
)
if not resolved:
alert_oncall(f"DNS failure for {endpoint}")
except Exception as e:
log_error(f"DNS check failed: {e}")
monitor_dns_health()
4. Practice Chaos Engineering
Chaos engineering helps expose hidden dependencies before they cause outages. AWS provides tools:
- AWS Fault Injection Simulator: Inject controlled failures into your infrastructure
- AWS Resilience Hub: Assess and improve application resilience
- Third-party tools: Gremlin, Chaos Monkey, LitmusChaos
Example chaos experiment:
# Simulate DynamoDB unavailability
experiment:
name: "DynamoDB Outage Simulation"
actions:
- type: "aws:dynamodb:deny-access"
targets:
- table: "critical-data-table"
duration: "PT10M" # 10 minutes
hypothesis: "Application gracefully degrades with cached data"
success_criteria:
- "Error rate < 5%"
- "P99 latency < 2000ms"
- "No cascading failures to dependent services"
5. Implement Graceful Degradation
Applications should continue functioning (in reduced capacity) when dependencies fail:
- Circuit breakers: Stop calling failing services
- Fallback strategies: Use cached data when databases are unavailable
- Feature flags: Disable non-critical features during incidents
- Queue-based async processing: Defer work that can wait
Example circuit breaker pattern:
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
def call_external_api():
"""Call external service with circuit breaker protection"""
response = requests.get('https://api.partner.com/data')
if response.status_code != 200:
raise Exception("API call failed")
return response.json()
def get_data_with_fallback():
"""Get data with fallback to cache"""
try:
return call_external_api()
except CircuitBreakerError:
# Circuit is open, use cached data
return get_from_cache()
except Exception as e:
log_error(f"API call failed: {e}")
return get_from_cache()
6. Invest in Observability
You can't fix what you can't see. Modern observability requires:
- Real-time monitoring: CloudWatch, Datadog, Grafana, Prometheus
- Distributed tracing: AWS X-Ray, Jaeger, Honeycomb
- Log aggregation: CloudWatch Logs Insights, Elasticsearch, Splunk
- Custom metrics: Track business KPIs, not just infrastructure metrics
Example: AWS Health API integration
import boto3
def check_aws_service_health():
"""Monitor AWS service health in real-time"""
health = boto3.client('health')
response = health.describe_events(
filter={
'regions': ['us-east-1', 'us-west-2'],
'services': ['EC2', 'RDS', 'LAMBDA', 'DYNAMODB'],
'eventStatusCodes': ['open', 'upcoming']
}
)
for event in response.get('events', []):
severity = event.get('eventTypeCategory')
service = event.get('service')
status = event.get('statusCode')
if severity == 'issue':
send_alert(
f"AWS {service} issue detected: {status}"
)
print(f"{event['eventTypeCode']} - {status}")
# Run every 5 minutes
check_aws_service_health()
7. Document and Test Your Disaster Recovery Plan
Everyone has a disaster recovery plan until disaster strikes. Regular testing reveals gaps:
- Monthly tabletop exercises: Walk through scenarios
- Quarterly DR drills: Actually fail over to secondary regions
- Annual full-scale tests: Simulate complete regional failure
- Post-incident reviews: Learn from real outages
DR Plan Checklist:
- RTO (Recovery Time Objective) defined for each service
- RPO (Recovery Point Objective) defined for each data store
- Runbooks documented and accessible
- Automated failover tested
- Manual failover procedures validated
- Communication plan established
- Third-party dependencies identified
- Data restoration tested
The Bigger Picture: What This Means for the Future
Cloud Concentration Isn't Decreasing
Despite regulatory scrutiny and public concern, cloud market concentration continues to increase. Why?
- Network effects: The more services AWS offers, the stickier it becomes
- Switching costs: Migrating from AWS is expensive and risky
- Innovation pace: The hyperscalers outpace smaller competitors
- Price competition: Volume discounts favor large deployments
The Resilience Paradox
Cloud providers promise resilience through redundancy, but create fragility through concentration. This is the fundamental paradox of modern infrastructure.
Traditional architecture:
- Many small single points of failure
- Localized impact when failures occur
- Difficult to manage at scale
Cloud architecture:
- Few large single points of failure
- Global impact when failures occur
- Easy to manage until failure happens
The Path Forward
The future of cloud resilience likely involves:
-
Hybrid and multi-cloud becoming standard: Not just for risk mitigation, but as operational best practice
-
Edge computing reducing dependence: Processing data closer to users reduces reliance on centralized cloud regions
-
Open source alternatives gaining traction: Projects like Kubernetes enable cloud-agnostic architectures
-
Regulatory requirements forcing diversification: Government workloads may be required to use multiple providers
-
Cloud providers investing in resilience: Competition and regulation will drive continued infrastructure improvements
Conclusion: Embracing Realistic Expectations
The AWS outages of 2023 and 2025 weren't failures of technology—they were revelations of reality. Complex distributed systems fail. Scale introduces emergent problems. Automation can amplify issues as quickly as it solves them.
The lesson isn't that cloud computing is fundamentally flawed. It's that resilience isn't automatic—it must be engineered, tested, and continuously improved.
For organizations depending on cloud infrastructure:
- Accept that outages will happen
- Design systems that gracefully degrade
- Distribute risk across regions and providers
- Invest in observability and chaos engineering
- Test disaster recovery plans regularly
For cloud providers:
- Continue improving isolation between services
- Invest in chaos engineering at scale
- Enhance transparency during incidents
- Design for graceful degradation by default
For regulators:
- Balance innovation with systemic risk concerns
- Require transparency without stifling competition
- Encourage multi-cloud architectures for critical infrastructure
The internet's backbone is stronger than ever, but it's not unbreakable. Understanding its limitations is the first step toward building systems that can withstand them.
Additional Resources
AWS Official Documentation:
- AWS Post-Event Summaries
- AWS Well-Architected Framework - Reliability Pillar
- AWS Resilience Hub Documentation
Industry Reports:
- Synergy Research Group - Cloud Market Analysis
- Gartner Cloud Infrastructure Market Share Reports
- ThousandEyes AWS Outage Analysis (June 2023, October 2025)
Regulatory Documents:
- FTC Cloud Computing RFI (March 2023)
- UK CMA Cloud Services Market Investigation
- EU Digital Markets Act and Cloud Strategy
Technical Deep Dives:
- AWS Lambda June 2023 Post-Incident Report
- CNN - October 2025 AWS Outage Technical Analysis
- Various third-party postmortem analyses