Architecture Patterns & System Design
High Availability & Disaster Recovery Patterns
Designing for failure is a core competency for cloud architects. This lesson covers patterns for building resilient systems that meet availability and recovery objectives.
Availability Concepts
Measuring Availability
| Nines | Availability | Downtime/Year | Downtime/Month |
|---|---|---|---|
| Two 9s | 99% | 3.65 days | 7.3 hours |
| Three 9s | 99.9% | 8.77 hours | 43.8 minutes |
| Four 9s | 99.99% | 52.6 minutes | 4.4 minutes |
| Five 9s | 99.999% | 5.26 minutes | 26 seconds |
SLA Composition
For services in series:
Total Availability = Service1 × Service2 × ... × ServiceN
Example:
Load Balancer (99.99%) × App Server (99.9%) × Database (99.95%)
= 0.9999 × 0.999 × 0.9995
= 99.84%
For services in parallel:
Total Availability = 1 - (1 - Service1) × (1 - Service2)
Example: Two databases (99.95% each):
= 1 - (1 - 0.9995)²
= 1 - 0.00000025
= 99.999975%
Interview Question: Achieving 99.99%
Q: "How would you design a web application to achieve 99.99% availability?"
A: Multi-AZ architecture with redundancy at every layer:
Architecture:
Route 53 (100% SLA)
↓
CloudFront (99.9%)
↓
ALB - Multi-AZ (99.99%)
↓
EC2 ASG - Multi-AZ (99.99% with redundancy)
↓
Aurora Multi-AZ (99.99%)
Key Practices:
1. No single points of failure
2. Auto-scaling for capacity
3. Health checks and auto-replacement
4. Stateless application tier
5. Database failover < 30 seconds
High Availability Patterns
Active-Active
Multiple instances serve traffic simultaneously.
┌─────────────────┐
│ Load Balancer │
└────────┬────────┘
┌───────────┼───────────┐
▼ ▼ ▼
[App AZ-a] [App AZ-b] [App AZ-c]
│ │ │
└───────────┴───────────┘
▼
[Database Cluster]
Characteristics:
- All instances handle requests
- Load distributed across instances
- Immediate failover (already serving)
- Higher cost (all resources active)
Active-Passive
Primary handles traffic; standby waits.
[Active Region - us-east-1] [Passive Region - us-west-2]
│ │
[Application Servers] [Standby Servers]
│ │
[Primary Database] ──── Replication ───► [Replica]
Characteristics:
- Standby consumes resources but doesn't serve
- Failover requires promotion (minutes)
- Lower cost than active-active
- RPO depends on replication lag
Interview Question: Active-Active vs Active-Passive
Q: "When would you choose active-passive over active-active?"
A: Consider these factors:
| Factor | Active-Active | Active-Passive |
|---|---|---|
| RTO Required | < 1 minute | Minutes acceptable |
| Cost Sensitivity | Can afford 2x capacity | Need cost optimization |
| Data Consistency | Can handle replication lag | Need strong consistency |
| Traffic Distribution | Benefits from geographic distribution | Centralized traffic OK |
| Complexity Tolerance | Can manage multi-region state | Prefer simplicity |
Disaster Recovery Strategies
DR Classification
| Strategy | RTO | RPO | Cost | Example |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Restore from S3 |
| Pilot Light | Minutes | Minutes | $$ | Minimal running infrastructure |
| Warm Standby | Minutes | Seconds | $$$ | Scaled-down duplicate |
| Hot Standby | Seconds | Zero | $$$$ | Full duplicate |
Pilot Light Pattern
Keep critical systems running minimally:
Primary Region: DR Region:
├── Full App Servers ├── (None - scale from zero)
├── Full Database ├── Database Replica (running)
└── Full Cache └── AMIs ready to launch
Recovery Steps:
- Database replica is already synced
- Launch app servers from AMIs
- Scale to required capacity
- Update DNS
Warm Standby Pattern
Reduced-capacity duplicate:
Primary (100% capacity): DR (20% capacity):
├── 10 App Servers ├── 2 App Servers
├── db.r6g.4xlarge ├── db.r6g.large
└── Full infrastructure └── Minimal infrastructure
Recovery Steps:
- Scale up DR app servers
- Promote database replica (or resize)
- Redirect traffic (DNS or Route 53 health check)
Interview Question: DR Strategy Selection
Q: "A healthcare company needs DR for their patient records system. RTO: 15 minutes, RPO: 5 minutes. Which strategy?"
A: Warm Standby is appropriate:
Why:
- 15-minute RTO allows time for scaling
- 5-minute RPO achievable with async replication (< 1 min lag typical)
- Healthcare doesn't need instant failover (not real-time trading)
- Cost-effective for the requirements
Implementation:
Primary (us-east-1):
- Aurora Primary (Multi-AZ)
- ECS Fargate (10 tasks)
DR (us-west-2):
- Aurora Read Replica (promoted on failover)
- ECS Fargate (2 tasks, scaled on failover)
Monitoring:
- Route 53 health checks
- Automated runbook in Systems Manager
- Regular DR testing (quarterly)
Resilience Patterns
Circuit Breaker
Prevent cascading failures:
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=30):
self.failure_count = 0
self.threshold = failure_threshold
self.timeout = timeout
self.state = 'CLOSED' # CLOSED, OPEN, HALF_OPEN
def call(self, func):
if self.state == 'OPEN':
if time_since_open > self.timeout:
self.state = 'HALF_OPEN'
else:
raise CircuitOpenError()
try:
result = func()
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
Bulkhead Pattern
Isolate failures to prevent total system failure:
Thread Pool Bulkhead:
├── Pool 1: Payment Service (10 threads)
├── Pool 2: Inventory Service (10 threads)
└── Pool 3: Shipping Service (5 threads)
If Payment Service exhausts threads,
Inventory and Shipping continue functioning.
Retry with Exponential Backoff
Handle transient failures:
def retry_with_backoff(func, max_retries=5, base_delay=1):
for attempt in range(max_retries):
try:
return func()
except TransientError:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
Interview Question: Resilience Strategy
Q: "Your service calls three downstream APIs. How do you make it resilient?"
A: Implement layered resilience:
1. Timeouts: Fail fast (1-5 seconds)
2. Circuit Breaker: Stop calling failing services
3. Retry: With exponential backoff + jitter
4. Bulkhead: Isolated thread/connection pools
5. Fallback: Graceful degradation
Configuration Example:
Payment API:
- Timeout: 3s
- Circuit: Open after 5 failures, 30s recovery
- Retry: 3 attempts, exponential backoff
- Fallback: Queue for later processing
Inventory API:
- Timeout: 2s
- Circuit: Open after 3 failures, 15s recovery
- Retry: 2 attempts
- Fallback: Return cached inventory
Architecture Principle: Design for failure. Every external dependency will eventually fail. Your system's resilience determines user experience during failures.
This concludes the Architecture Patterns module. Test your knowledge with the module quiz. :::