High Availability & Disaster Recovery Patterns

Designing for failure is a core competency for cloud architects. This lesson covers patterns for building resilient systems that meet availability and recovery objectives.

Availability Concepts

Measuring Availability

Nines	Availability	Downtime/Year	Downtime/Month
Two 9s	99%	3.65 days	7.3 hours
Three 9s	99.9%	8.77 hours	43.8 minutes
Four 9s	99.99%	52.6 minutes	4.4 minutes
Five 9s	99.999%	5.26 minutes	26 seconds

SLA Composition

For services in series:

Total Availability = Service1 × Service2 × ... × ServiceN

Example:
  Load Balancer (99.99%) × App Server (99.9%) × Database (99.95%)
  = 0.9999 × 0.999 × 0.9995
  = 99.84%

For services in parallel:

Total Availability = 1 - (1 - Service1) × (1 - Service2)

Example: Two databases (99.95% each):
  = 1 - (1 - 0.9995)²
  = 1 - 0.00000025
  = 99.999975%

Interview Question: Achieving 99.99%

Q: "How would you design a web application to achieve 99.99% availability?"

A: Multi-AZ architecture with redundancy at every layer:

Architecture:
  Route 53 (100% SLA)
       ↓
  CloudFront (99.9%)
       ↓
  ALB - Multi-AZ (99.99%)
       ↓
  EC2 ASG - Multi-AZ (99.99% with redundancy)
       ↓
  Aurora Multi-AZ (99.99%)

Key Practices:
1. No single points of failure
2. Auto-scaling for capacity
3. Health checks and auto-replacement
4. Stateless application tier
5. Database failover < 30 seconds

High Availability Patterns

Active-Active

Multiple instances serve traffic simultaneously.

          ┌─────────────────┐
          │  Load Balancer  │
          └────────┬────────┘
       ┌───────────┼───────────┐
       ▼           ▼           ▼
   [App AZ-a]  [App AZ-b]  [App AZ-c]
       │           │           │
       └───────────┴───────────┘
                   ▼
           [Database Cluster]

Characteristics:

All instances handle requests
Load distributed across instances
Immediate failover (already serving)
Higher cost (all resources active)

Active-Passive

Primary handles traffic; standby waits.

   [Active Region - us-east-1]       [Passive Region - us-west-2]
          │                                    │
   [Application Servers]              [Standby Servers]
          │                                    │
   [Primary Database] ──── Replication ───► [Replica]

Characteristics:

Standby consumes resources but doesn't serve
Failover requires promotion (minutes)
Lower cost than active-active
RPO depends on replication lag

Interview Question: Active-Active vs Active-Passive

Q: "When would you choose active-passive over active-active?"

A: Consider these factors:

Factor	Active-Active	Active-Passive
RTO Required	< 1 minute	Minutes acceptable
Cost Sensitivity	Can afford 2x capacity	Need cost optimization
Data Consistency	Can handle replication lag	Need strong consistency
Traffic Distribution	Benefits from geographic distribution	Centralized traffic OK
Complexity Tolerance	Can manage multi-region state	Prefer simplicity

Disaster Recovery Strategies

DR Classification

Strategy	RTO	RPO	Cost	Example
Backup & Restore	Hours	Hours	$	Restore from S3
Pilot Light	Minutes	Minutes	$$	Minimal running infrastructure
Warm Standby	Minutes	Seconds	$$$	Scaled-down duplicate
Hot Standby	Seconds	Zero	$$$$	Full duplicate

Pilot Light Pattern

Keep critical systems running minimally:

Primary Region:                 DR Region:
  ├── Full App Servers            ├── (None - scale from zero)
  ├── Full Database               ├── Database Replica (running)
  └── Full Cache                  └── AMIs ready to launch

Recovery Steps:

Database replica is already synced
Launch app servers from AMIs
Scale to required capacity
Update DNS

Warm Standby Pattern

Reduced-capacity duplicate:

Primary (100% capacity):        DR (20% capacity):
  ├── 10 App Servers              ├── 2 App Servers
  ├── db.r6g.4xlarge              ├── db.r6g.large
  └── Full infrastructure         └── Minimal infrastructure

Recovery Steps:

Scale up DR app servers
Promote database replica (or resize)
Redirect traffic (DNS or Route 53 health check)

Interview Question: DR Strategy Selection

Q: "A healthcare company needs DR for their patient records system. RTO: 15 minutes, RPO: 5 minutes. Which strategy?"

A: Warm Standby is appropriate:

Why:

15-minute RTO allows time for scaling
5-minute RPO achievable with async replication (< 1 min lag typical)
Healthcare doesn't need instant failover (not real-time trading)
Cost-effective for the requirements

Implementation:

Primary (us-east-1):
  - Aurora Primary (Multi-AZ)
  - ECS Fargate (10 tasks)

DR (us-west-2):
  - Aurora Read Replica (promoted on failover)
  - ECS Fargate (2 tasks, scaled on failover)

Monitoring:
  - Route 53 health checks
  - Automated runbook in Systems Manager
  - Regular DR testing (quarterly)

Resilience Patterns

Circuit Breaker

Prevent cascading failures:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=30):
        self.failure_count = 0
        self.threshold = failure_threshold
        self.timeout = timeout
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN

    def call(self, func):
        if self.state == 'OPEN':
            if time_since_open > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitOpenError()

        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise

Bulkhead Pattern

Isolate failures to prevent total system failure:

Thread Pool Bulkhead:
  ├── Pool 1: Payment Service (10 threads)
  ├── Pool 2: Inventory Service (10 threads)
  └── Pool 3: Shipping Service (5 threads)

If Payment Service exhausts threads,
Inventory and Shipping continue functioning.

Retry with Exponential Backoff

Handle transient failures:

def retry_with_backoff(func, max_retries=5, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Interview Question: Resilience Strategy

Q: "Your service calls three downstream APIs. How do you make it resilient?"

A: Implement layered resilience:

1. Timeouts: Fail fast (1-5 seconds)
2. Circuit Breaker: Stop calling failing services
3. Retry: With exponential backoff + jitter
4. Bulkhead: Isolated thread/connection pools
5. Fallback: Graceful degradation

Configuration Example:
  Payment API:
    - Timeout: 3s
    - Circuit: Open after 5 failures, 30s recovery
    - Retry: 3 attempts, exponential backoff
    - Fallback: Queue for later processing

  Inventory API:
    - Timeout: 2s
    - Circuit: Open after 3 failures, 15s recovery
    - Retry: 2 attempts
    - Fallback: Return cached inventory

Architecture Principle: Design for failure. Every external dependency will eventually fail. Your system's resilience determines user experience during failures.

This concludes the Architecture Patterns module. Test your knowledge with the module quiz. :::