Architecture Patterns & System Design

High Availability & Disaster Recovery Patterns

4 min read

Designing for failure is a core competency for cloud architects. This lesson covers patterns for building resilient systems that meet availability and recovery objectives.

Availability Concepts

Measuring Availability

Nines Availability Downtime/Year Downtime/Month
Two 9s 99% 3.65 days 7.3 hours
Three 9s 99.9% 8.77 hours 43.8 minutes
Four 9s 99.99% 52.6 minutes 4.4 minutes
Five 9s 99.999% 5.26 minutes 26 seconds

SLA Composition

For services in series:

Total Availability = Service1 × Service2 × ... × ServiceN

Example:
  Load Balancer (99.99%) × App Server (99.9%) × Database (99.95%)
  = 0.9999 × 0.999 × 0.9995
  = 99.84%

For services in parallel:

Total Availability = 1 - (1 - Service1) × (1 - Service2)

Example: Two databases (99.95% each):
  = 1 - (1 - 0.9995)²
  = 1 - 0.00000025
  = 99.999975%

Interview Question: Achieving 99.99%

Q: "How would you design a web application to achieve 99.99% availability?"

A: Multi-AZ architecture with redundancy at every layer:

Architecture:
  Route 53 (100% SLA)
  CloudFront (99.9%)
  ALB - Multi-AZ (99.99%)
  EC2 ASG - Multi-AZ (99.99% with redundancy)
  Aurora Multi-AZ (99.99%)

Key Practices:
1. No single points of failure
2. Auto-scaling for capacity
3. Health checks and auto-replacement
4. Stateless application tier
5. Database failover < 30 seconds

High Availability Patterns

Active-Active

Multiple instances serve traffic simultaneously.

          ┌─────────────────┐
          │  Load Balancer  │
          └────────┬────────┘
       ┌───────────┼───────────┐
       ▼           ▼           ▼
   [App AZ-a]  [App AZ-b]  [App AZ-c]
       │           │           │
       └───────────┴───────────┘
           [Database Cluster]

Characteristics:

  • All instances handle requests
  • Load distributed across instances
  • Immediate failover (already serving)
  • Higher cost (all resources active)

Active-Passive

Primary handles traffic; standby waits.

   [Active Region - us-east-1]       [Passive Region - us-west-2]
          │                                    │
   [Application Servers]              [Standby Servers]
          │                                    │
   [Primary Database] ──── Replication ───► [Replica]

Characteristics:

  • Standby consumes resources but doesn't serve
  • Failover requires promotion (minutes)
  • Lower cost than active-active
  • RPO depends on replication lag

Interview Question: Active-Active vs Active-Passive

Q: "When would you choose active-passive over active-active?"

A: Consider these factors:

Factor Active-Active Active-Passive
RTO Required < 1 minute Minutes acceptable
Cost Sensitivity Can afford 2x capacity Need cost optimization
Data Consistency Can handle replication lag Need strong consistency
Traffic Distribution Benefits from geographic distribution Centralized traffic OK
Complexity Tolerance Can manage multi-region state Prefer simplicity

Disaster Recovery Strategies

DR Classification

Strategy RTO RPO Cost Example
Backup & Restore Hours Hours $ Restore from S3
Pilot Light Minutes Minutes $$ Minimal running infrastructure
Warm Standby Minutes Seconds $$$ Scaled-down duplicate
Hot Standby Seconds Zero $$$$ Full duplicate

Pilot Light Pattern

Keep critical systems running minimally:

Primary Region:                 DR Region:
  ├── Full App Servers            ├── (None - scale from zero)
  ├── Full Database               ├── Database Replica (running)
  └── Full Cache                  └── AMIs ready to launch

Recovery Steps:

  1. Database replica is already synced
  2. Launch app servers from AMIs
  3. Scale to required capacity
  4. Update DNS

Warm Standby Pattern

Reduced-capacity duplicate:

Primary (100% capacity):        DR (20% capacity):
  ├── 10 App Servers              ├── 2 App Servers
  ├── db.r6g.4xlarge              ├── db.r6g.large
  └── Full infrastructure         └── Minimal infrastructure

Recovery Steps:

  1. Scale up DR app servers
  2. Promote database replica (or resize)
  3. Redirect traffic (DNS or Route 53 health check)

Interview Question: DR Strategy Selection

Q: "A healthcare company needs DR for their patient records system. RTO: 15 minutes, RPO: 5 minutes. Which strategy?"

A: Warm Standby is appropriate:

Why:

  • 15-minute RTO allows time for scaling
  • 5-minute RPO achievable with async replication (< 1 min lag typical)
  • Healthcare doesn't need instant failover (not real-time trading)
  • Cost-effective for the requirements

Implementation:

Primary (us-east-1):
  - Aurora Primary (Multi-AZ)
  - ECS Fargate (10 tasks)

DR (us-west-2):
  - Aurora Read Replica (promoted on failover)
  - ECS Fargate (2 tasks, scaled on failover)

Monitoring:
  - Route 53 health checks
  - Automated runbook in Systems Manager
  - Regular DR testing (quarterly)

Resilience Patterns

Circuit Breaker

Prevent cascading failures:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=30):
        self.failure_count = 0
        self.threshold = failure_threshold
        self.timeout = timeout
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN

    def call(self, func):
        if self.state == 'OPEN':
            if time_since_open > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitOpenError()

        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise

Bulkhead Pattern

Isolate failures to prevent total system failure:

Thread Pool Bulkhead:
  ├── Pool 1: Payment Service (10 threads)
  ├── Pool 2: Inventory Service (10 threads)
  └── Pool 3: Shipping Service (5 threads)

If Payment Service exhausts threads,
Inventory and Shipping continue functioning.

Retry with Exponential Backoff

Handle transient failures:

def retry_with_backoff(func, max_retries=5, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Interview Question: Resilience Strategy

Q: "Your service calls three downstream APIs. How do you make it resilient?"

A: Implement layered resilience:

1. Timeouts: Fail fast (1-5 seconds)
2. Circuit Breaker: Stop calling failing services
3. Retry: With exponential backoff + jitter
4. Bulkhead: Isolated thread/connection pools
5. Fallback: Graceful degradation

Configuration Example:
  Payment API:
    - Timeout: 3s
    - Circuit: Open after 5 failures, 30s recovery
    - Retry: 3 attempts, exponential backoff
    - Fallback: Queue for later processing

  Inventory API:
    - Timeout: 2s
    - Circuit: Open after 3 failures, 15s recovery
    - Retry: 2 attempts
    - Fallback: Return cached inventory

Architecture Principle: Design for failure. Every external dependency will eventually fail. Your system's resilience determines user experience during failures.

This concludes the Architecture Patterns module. Test your knowledge with the module quiz. :::

Quiz

Module 4: Architecture Patterns & System Design

Take Quiz