CI/CD & Infrastructure as Code
Deployment Strategies: Blue-Green, Canary, and Rolling
Deployment strategy questions test your understanding of production reliability. Let's master each approach.
Deployment Strategy Comparison
| Strategy | Risk | Rollback Speed | Resource Cost | Complexity |
|---|---|---|---|---|
| Rolling | Medium | Slow | Low | Low |
| Blue-Green | Low | Instant | High (2x) | Medium |
| Canary | Lowest | Fast | Medium | High |
| Shadow | Lowest | N/A | High | Highest |
Rolling Deployment
Updates instances incrementally:
Time 0: [v1] [v1] [v1] [v1]
Time 1: [v2] [v1] [v1] [v1]
Time 2: [v2] [v2] [v1] [v1]
Time 3: [v2] [v2] [v2] [v1]
Time 4: [v2] [v2] [v2] [v2]
Kubernetes Rolling Update
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Max extra pods during update
maxUnavailable: 1 # Max pods unavailable during update
selector:
matchLabels:
app: web
template:
spec:
containers:
- name: web
image: app:v2
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Pros: Simple, low resource overhead Cons: Slower rollback, mixed versions during deployment
Blue-Green Deployment
Maintain two identical environments:
┌─────────────┐
│ Load │
│ Balancer │
└──────┬──────┘
│
┌───────────┴───────────┐
│ │
┌────▼────┐ ┌────▼────┐
│ Blue │ │ Green │
│ (v1) │ │ (v2) │
│ ACTIVE │ │ STANDBY │
└─────────┘ └─────────┘
AWS Blue-Green with ALB
# Target groups for blue and green
resource "aws_lb_target_group" "blue" {
name = "blue-tg"
port = 80
protocol = "HTTP"
vpc_id = var.vpc_id
}
resource "aws_lb_target_group" "green" {
name = "green-tg"
port = 80
protocol = "HTTP"
vpc_id = var.vpc_id
}
# Listener rule - switch between blue/green
resource "aws_lb_listener_rule" "main" {
listener_arn = aws_lb_listener.main.arn
action {
type = "forward"
target_group_arn = var.active_color == "blue" ? aws_lb_target_group.blue.arn : aws_lb_target_group.green.arn
}
condition {
path_pattern {
values = ["/*"]
}
}
}
Pros: Instant rollback, zero-downtime deployment Cons: Double infrastructure cost, database migrations complex
Canary Deployment
Gradually shift traffic to new version:
Stage 1: [v1: 95%] ────► [v2: 5%] # Test with 5%
Stage 2: [v1: 80%] ────► [v2: 20%] # Increase if healthy
Stage 3: [v1: 50%] ────► [v2: 50%] # Half-half
Stage 4: [v1: 0%] ────► [v2: 100%] # Full rollout
Kubernetes Canary with Istio
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: web-app
spec:
hosts:
- web-app
http:
- route:
- destination:
host: web-app
subset: stable
weight: 90
- destination:
host: web-app
subset: canary
weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: web-app
spec:
host: web-app
subsets:
- name: stable
labels:
version: v1
- name: canary
labels:
version: v2
Canary Analysis
# Argo Rollouts canary with analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: web-app
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 5m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 25
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 5m}
- setWeight: 100
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.95
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status=~"2.."}[5m])) /
sum(rate(http_requests_total[5m]))
Pros: Lowest risk, data-driven decisions Cons: Complex setup, requires good observability
Interview Questions
Q: "Your canary shows 2% error rate vs 0.5% for stable. What do you do?"
Answer:
- Don't panic - collect more data first
- Check if statistically significant - 5% traffic may have noise
- Examine error types - are they new errors or existing?
- Check metrics - latency, CPU, memory of canary pods
- If confirmed bad - automatic rollback or manual
- Root cause - investigate before next attempt
Q: "How do you handle database migrations in blue-green?"
Answer:
| Approach | Description |
|---|---|
| Expand-Contract | Add new schema alongside old, migrate, then remove old |
| Feature flags | Deploy code that handles both schemas |
| Read replicas | Blue reads from primary, green from replica during migration |
| Backward compatible | Ensure v2 schema works with v1 code |
-- Expand phase (works with both versions)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT false;
-- Contract phase (after v1 is gone)
ALTER TABLE users DROP COLUMN old_column;
Q: "A rolling deployment is stuck at 50%. How do you troubleshoot?"
# Check deployment status
kubectl rollout status deployment/web-app
# Check pod status
kubectl get pods -l app=web-app
kubectl describe pod <stuck-pod>
# Check events
kubectl get events --sort-by='.lastTimestamp'
# Common issues:
# - Readiness probe failing
# - Image pull errors
# - Resource limits (pending pods)
# - PodDisruptionBudget blocking
You've mastered CI/CD and IaC. Next module: Kubernetes and container orchestration—the heart of modern infrastructure. :::