Service Mesh & Networking for ML
High Availability & Disaster Recovery
3 min read
Production ML services require high availability to meet SLAs and disaster recovery capabilities for business continuity. This lesson covers multi-zone deployments, failover strategies, and backup procedures for ML platforms.
HA Architecture for ML
┌─────────────────────────────────────────────────────────────────────┐
│ Multi-Region ML Architecture │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Global Load Balancer │ │
│ │ (DNS-based or Anycast routing) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │ │
│ ┌───────────────────────┐ ┌───────────────────────┐ │
│ │ Region: US-East │ │ Region: EU-West │ │
│ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │
│ │ │ Zone A │ Zone B│ │ │ │ Zone A │ Zone B│ │ │
│ │ │ [GPU] │ [GPU] │ │ │ │ [GPU] │ [GPU] │ │ │
│ │ └─────────────────┘ │ │ └─────────────────┘ │ │
│ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │
│ │ │ Model Cache │ │←──→│ │ Model Cache │ │ │
│ │ │ (Regional) │ │ │ │ (Regional) │ │ │
│ │ └─────────────────┘ │ │ └─────────────────┘ │ │
│ └───────────────────────┘ └───────────────────────┘ │
│ │ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Replicated Model Storage (S3/GCS) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Multi-Zone Inference Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-service
namespace: ml-serving
spec:
replicas: 6
selector:
matchLabels:
app: inference
template:
metadata:
labels:
app: inference
spec:
# Spread across zones
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: inference
# Anti-affinity for GPU nodes
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: inference
topologyKey: kubernetes.io/hostname
containers:
- name: inference
image: inference:v1
resources:
limits:
nvidia.com/gpu: 1
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /v2/health/live
port: 8080
initialDelaySeconds: 60
periodSeconds: 15
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: inference-pdb
namespace: ml-serving
spec:
minAvailable: 4
selector:
matchLabels:
app: inference
Failover Configuration
# Istio locality-aware load balancing
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: inference-failover
namespace: ml-serving
spec:
host: inference-service
trafficPolicy:
connectionPool:
http:
h2UpgradePolicy: UPGRADE
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 3
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50
loadBalancer:
localityLbSetting:
enabled: true
failover:
- from: us-east1
to: us-west1
- from: europe-west1
to: us-east1
simple: ROUND_ROBIN
---
# Service with external traffic policy
apiVersion: v1
kind: Service
metadata:
name: inference-service
annotations:
service.kubernetes.io/topology-mode: Auto
spec:
type: LoadBalancer
externalTrafficPolicy: Local
selector:
app: inference
ports:
- port: 8080
targetPort: 8080
Model Backup and Recovery
# Velero backup for ML artifacts
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: ml-backup-daily
namespace: velero
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- ml-serving
- ml-training
- mlflow
includedResources:
- persistentvolumeclaims
- configmaps
- secrets
labelSelector:
matchLabels:
backup: enabled
snapshotVolumes: true
storageLocation: default
volumeSnapshotLocations:
- default
ttl: 720h # 30 days
---
# Backup for model registry
apiVersion: batch/v1
kind: CronJob
metadata:
name: model-registry-backup
namespace: ml-serving
spec:
schedule: "0 */6 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: amazon/aws-cli:2.15
command:
- /bin/sh
- -c
- |
aws s3 sync s3://model-registry s3://model-registry-backup \
--storage-class GLACIER_IR
aws s3 cp s3://model-registry-backup/manifest.json \
s3://model-registry-backup/manifest-$(date +%Y%m%d).json
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-backup-creds
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-backup-creds
key: secret-key
restartPolicy: OnFailure
Disaster Recovery Runbook
# DR ConfigMap with recovery procedures
apiVersion: v1
kind: ConfigMap
metadata:
name: dr-runbook
namespace: ml-serving
data:
recovery-steps.md: |
# ML Platform Disaster Recovery
## RTO: 30 minutes | RPO: 1 hour
### Step 1: Assess Impact
```bash
kubectl get nodes -o wide
kubectl get pods -n ml-serving -o wide
kubectl top nodes
```
### Step 2: Failover to DR Region
```bash
# Update DNS to point to DR region
aws route53 change-resource-record-sets \
--hosted-zone-id Z123 \
--change-batch file://failover-dns.json
# Scale up DR region
kubectl --context=dr-cluster scale deployment inference-service \
--replicas=10 -n ml-serving
```
### Step 3: Restore from Backup (if needed)
```bash
# List available backups
velero backup get
# Restore ML namespace
velero restore create --from-backup ml-backup-daily-20240115
# Verify models are accessible
kubectl exec -it deploy/inference-service -n ml-serving -- \
ls -la /models/
```
### Step 4: Validate Services
```bash
# Health check
curl https://inference-dr.example.com/v2/health/ready
# Test inference
curl -X POST https://inference-dr.example.com/v1/models/default:predict \
-d '{"instances": [[1,2,3,4]]}'
```
---
# Automated failover with Argo Events
apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
name: failover-trigger
spec:
dependencies:
- name: health-check-failure
eventSourceName: prometheus-alerts
eventName: inference-down
triggers:
- template:
name: trigger-failover
k8s:
operation: patch
source:
resource:
apiVersion: v1
kind: Service
metadata:
name: inference-service
parameters:
- src:
dependencyName: health-check-failure
dataKey: body.alerts[0].labels.region
dest: metadata.annotations.failover-region
Health Monitoring
# Comprehensive health checks
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ml-ha-alerts
spec:
groups:
- name: high-availability
rules:
- alert: InferenceServiceDown
expr: |
sum(up{job="inference-service"}) < 2
for: 2m
labels:
severity: critical
runbook: dr-runbook
annotations:
summary: "Inference service replicas below minimum"
- alert: ZoneFailure
expr: |
count(kube_node_status_condition{condition="Ready",status="true"}) by (topology_kubernetes_io_zone) < 1
for: 5m
labels:
severity: critical
annotations:
summary: "Zone {{ $labels.topology_kubernetes_io_zone }} has no ready nodes"
- alert: ModelStorageUnavailable
expr: |
probe_success{job="model-storage-probe"} == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Model storage is unreachable"
Next module: Production operations and GitOps for ML platforms. :::