Infrastructure & Deployment
Cloud ML Infrastructure
4 min read
Cloud-specific ML services are core interview topics. Know the trade-offs between managed services and self-managed Kubernetes.
Managed vs Self-Managed Comparison
| Aspect | AWS SageMaker | GCP Vertex AI | Self-Managed K8s |
|---|---|---|---|
| Setup time | Hours | Hours | Days/Weeks |
| Cost at scale | Higher | Higher | Lower |
| Customization | Limited | Limited | Full control |
| GPU availability | On-demand | On-demand | Reserved instances |
| Vendor lock-in | High | High | Low |
| Best for | Quick POCs | Quick POCs | Production at scale |
Interview Question: Build vs Buy
Question: "When would you use SageMaker vs self-managed Kubernetes?"
Framework Answer:
def choose_infrastructure(context):
use_managed = (
context["team_size"] < 5 and
context["ml_models"] < 10 and
context["budget"] > context["engineer_cost"] * 2 and
context["time_to_market"] == "urgent"
)
use_kubernetes = (
context["team_size"] >= 5 or
context["ml_models"] >= 10 or
context["multi_cloud"] is True or
context["compliance"] in ["HIPAA", "PCI", "SOC2"] and
context["requires_custom_controls"]
)
# Hybrid is often the answer
return {
"training": "managed" if data_stays_in_cloud else "k8s",
"serving": "kubernetes", # Lower latency, better scaling
"experimentation": "managed", # Faster iteration
}
AWS SageMaker Deep Dive
Key Components to Know:
# SageMaker interview topics
sagemaker_components:
training:
- Spot instances for 70% cost reduction
- Distributed training with parameter servers
- SageMaker Debugger for training insights
inference:
- Real-time endpoints (synchronous)
- Batch Transform (async, large datasets)
- Multi-model endpoints (cost sharing)
- Serverless inference (pay per request)
mlops:
- SageMaker Pipelines (orchestration)
- Model Registry (versioning)
- Model Monitor (drift detection)
Interview Question: "How do you reduce SageMaker inference costs?"
Answer:
- Multi-model endpoints: Load 100s of models on single endpoint
- Serverless inference: Pay only for requests, but cold start latency
- Autoscaling: Scale to zero during off-hours
- Spot instances: For training (not inference), 70% savings
- Inference optimization: Use Neuron compiler for AWS Inferentia chips
GCP Vertex AI Deep Dive
Key Differentiators:
# Vertex AI interview topics
vertex_components:
unique_features:
- AutoML for no-code model training
- Feature Store (native integration)
- Vizier for hyperparameter tuning
- Matching Engine for vector search
training:
- Custom containers on Vertex Training
- TPU support (v4 pods available)
- Distributed training with Reduction Server
serving:
- Online prediction (real-time)
- Batch prediction (large scale)
- Private endpoints (VPC-native)
Interview Question: "When would you use TPUs vs GPUs?"
Answer:
- TPUs: Large transformer training, Google-optimized (BERT, T5), batch processing
- GPUs: Inference, PyTorch-heavy, custom architectures, real-time serving
- Cost comparison: TPU v4 pods can be 3x more cost-efficient for training at scale
Multi-Cloud and Hybrid Patterns
# Multi-cloud ML architecture discussion
multi_cloud_reasons = [
"GPU availability during shortages",
"Best-of-breed services (Vertex AutoML + AWS Endpoints)",
"Regulatory requirements (data residency)",
"Vendor negotiation leverage"
]
# Key technologies for multi-cloud
multi_cloud_stack = {
"orchestration": "Kubeflow Pipelines (cloud-agnostic)",
"model_registry": "MLflow (portable)",
"serving": "Seldon Core or KServe",
"monitoring": "Prometheus + Grafana",
"infrastructure": "Terraform with cloud-specific modules"
}
Interview Insight: Companies increasingly ask about multi-cloud due to GPU shortages and cost optimization. Show you understand both managed services AND self-managed Kubernetes.
Next module covers ML Pipelines & Orchestration interview questions. :::