Architecture Patterns & System Design

High Availability & Disaster Recovery Patterns

4 min read

Designing for failure is a core competency for cloud architects. This lesson covers patterns for building resilient systems that meet availability and recovery objectives.

Availability Concepts

Measuring Availability

NinesAvailabilityDowntime/YearDowntime/Month
Two 9s99%3.65 days7.3 hours
Three 9s99.9%8.77 hours43.8 minutes
Four 9s99.99%52.6 minutes4.4 minutes
Five 9s99.999%5.26 minutes26 seconds

SLA Composition

For services in series:

Total Availability = Service1 × Service2 × ... × ServiceN

Example:
  Load Balancer (99.99%) × App Server (99.9%) × Database (99.95%)
  = 0.9999 × 0.999 × 0.9995
  = 99.84%

For services in parallel:

Total Availability = 1 - (1 - Service1) × (1 - Service2)

Example: Two databases (99.95% each):
  = 1 - (1 - 0.9995)²
  = 1 - 0.00000025
  = 99.999975%

Interview Question: Achieving 99.99%

Q: "How would you design a web application to achieve 99.99% availability?"

A: Multi-AZ architecture with redundancy at every layer:

Architecture:
  Route 53 (100% SLA)
  CloudFront (99.9%)
  ALB - Multi-AZ (99.99%)
  EC2 ASG - Multi-AZ (99.99% with redundancy)
  Aurora Multi-AZ (99.99%)

Key Practices:
1. No single points of failure
2. Auto-scaling for capacity
3. Health checks and auto-replacement
4. Stateless application tier
5. Database failover < 30 seconds

High Availability Patterns

Active-Active

Multiple instances serve traffic simultaneously.

          ┌─────────────────┐
          │  Load Balancer  │
          └────────┬────────┘
       ┌───────────┼───────────┐
       ▼           ▼           ▼
   [App AZ-a]  [App AZ-b]  [App AZ-c]
       │           │           │
       └───────────┴───────────┘
           [Database Cluster]

Characteristics:

  • All instances handle requests
  • Load distributed across instances
  • Immediate failover (already serving)
  • Higher cost (all resources active)

Active-Passive

Primary handles traffic; standby waits.

   [Active Region - us-east-1]       [Passive Region - us-west-2]
          │                                    │
   [Application Servers]              [Standby Servers]
          │                                    │
   [Primary Database] ──── Replication ───► [Replica]

Characteristics:

  • Standby consumes resources but doesn't serve
  • Failover requires promotion (minutes)
  • Lower cost than active-active
  • RPO depends on replication lag

Interview Question: Active-Active vs Active-Passive

Q: "When would you choose active-passive over active-active?"

A: Consider these factors:

FactorActive-ActiveActive-Passive
RTO Required< 1 minuteMinutes acceptable
Cost SensitivityCan afford 2x capacityNeed cost optimization
Data ConsistencyCan handle replication lagNeed strong consistency
Traffic DistributionBenefits from geographic distributionCentralized traffic OK
Complexity ToleranceCan manage multi-region statePrefer simplicity

Disaster Recovery Strategies

DR Classification

StrategyRTORPOCostExample
Backup & RestoreHoursHours$Restore from S3
Pilot LightMinutesMinutes$$Minimal running infrastructure
Warm StandbyMinutesSeconds$$$Scaled-down duplicate
Hot StandbySecondsZero$$$$Full duplicate

Pilot Light Pattern

Keep critical systems running minimally:

Primary Region:                 DR Region:
  ├── Full App Servers            ├── (None - scale from zero)
  ├── Full Database               ├── Database Replica (running)
  └── Full Cache                  └── AMIs ready to launch

Recovery Steps:

  1. Database replica is already synced
  2. Launch app servers from AMIs
  3. Scale to required capacity
  4. Update DNS

Warm Standby Pattern

Reduced-capacity duplicate:

Primary (100% capacity):        DR (20% capacity):
  ├── 10 App Servers              ├── 2 App Servers
  ├── db.r6g.4xlarge              ├── db.r6g.large
  └── Full infrastructure         └── Minimal infrastructure

Recovery Steps:

  1. Scale up DR app servers
  2. Promote database replica (or resize)
  3. Redirect traffic (DNS or Route 53 health check)

Interview Question: DR Strategy Selection

Q: "A healthcare company needs DR for their patient records system. RTO: 15 minutes, RPO: 5 minutes. Which strategy?"

A: Warm Standby is appropriate:

Why:

  • 15-minute RTO allows time for scaling
  • 5-minute RPO achievable with async replication (< 1 min lag typical)
  • Healthcare doesn't need instant failover (not real-time trading)
  • Cost-effective for the requirements

Implementation:

Primary (us-east-1):
  - Aurora Primary (Multi-AZ)
  - ECS Fargate (10 tasks)

DR (us-west-2):
  - Aurora Read Replica (promoted on failover)
  - ECS Fargate (2 tasks, scaled on failover)

Monitoring:
  - Route 53 health checks
  - Automated runbook in Systems Manager
  - Regular DR testing (quarterly)

Resilience Patterns

Circuit Breaker

Prevent cascading failures:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=30):
        self.failure_count = 0
        self.threshold = failure_threshold
        self.timeout = timeout
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN

    def call(self, func):
        if self.state == 'OPEN':
            if time_since_open > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitOpenError()

        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise

Bulkhead Pattern

Isolate failures to prevent total system failure:

Thread Pool Bulkhead:
  ├── Pool 1: Payment Service (10 threads)
  ├── Pool 2: Inventory Service (10 threads)
  └── Pool 3: Shipping Service (5 threads)

If Payment Service exhausts threads,
Inventory and Shipping continue functioning.

Retry with Exponential Backoff

Handle transient failures:

def retry_with_backoff(func, max_retries=5, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Interview Question: Resilience Strategy

Q: "Your service calls three downstream APIs. How do you make it resilient?"

A: Implement layered resilience:

1. Timeouts: Fail fast (1-5 seconds)
2. Circuit Breaker: Stop calling failing services
3. Retry: With exponential backoff + jitter
4. Bulkhead: Isolated thread/connection pools
5. Fallback: Graceful degradation

Configuration Example:
  Payment API:
    - Timeout: 3s
    - Circuit: Open after 5 failures, 30s recovery
    - Retry: 3 attempts, exponential backoff
    - Fallback: Queue for later processing

  Inventory API:
    - Timeout: 2s
    - Circuit: Open after 3 failures, 15s recovery
    - Retry: 2 attempts
    - Fallback: Return cached inventory

Architecture Principle: Design for failure. Every external dependency will eventually fail. Your system's resilience determines user experience during failures.

This concludes the Architecture Patterns module. Test your knowledge with the module quiz. :::

Quick check: how does this lesson land for you?

Quiz

Module 4: Architecture Patterns & System Design

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.