Introduction to MLOps
Infrastructure Patterns
3 min read
ML systems have distinct infrastructure needs for training versus serving. Understanding these patterns helps you design efficient, cost-effective systems.
Training vs Serving Infrastructure
| Aspect | Training | Serving |
|---|---|---|
| Goal | Learn from data | Make predictions |
| Compute | GPU clusters, high memory | CPU/GPU, low latency |
| Data access | Batch processing | Real-time lookup |
| Scaling | Scale up (bigger machines) | Scale out (more replicas) |
| Cost model | Burst compute (pay per run) | Always-on (pay per uptime) |
Training Infrastructure Pattern
┌──────────────────────────────────────────────────────┐
│ Training Infrastructure │
├──────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Data │────▶│ Compute │────▶│ Artifact │ │
│ │ Lake │ │ Cluster │ │ Store │ │
│ └─────────┘ │ (GPU/TPU) │ └──────────┘ │
│ │ └──────────────┘ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Orchestrator (Kubeflow) │ │
│ └──────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────┘
Key components:
- Data lake: S3, GCS, Azure Blob for raw data
- Compute cluster: Kubernetes with GPU nodes
- Artifact store: Model registry, experiment logs
- Orchestrator: Kubeflow, Airflow for pipeline management
Serving Infrastructure Pattern
┌──────────────────────────────────────────────────────┐
│ Serving Infrastructure │
├──────────────────────────────────────────────────────┤
│ │
│ Client Request │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Load │────▶│ Model │────▶│ Feature │ │
│ │ Balancer │ │ Server │ │ Store │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Monitoring │ │
│ └──────────────────┘ │
│ │
└──────────────────────────────────────────────────────┘
Key components:
- Load balancer: Distribute requests across replicas
- Model server: BentoML, TF Serving, TorchServe
- Feature store: Feast for real-time feature lookup
- Monitoring: Latency, throughput, drift detection
Batch vs Real-Time Inference
| Pattern | Use Case | Latency | Tools |
|---|---|---|---|
| Batch | Nightly recommendations | Minutes-hours | Spark, Kubeflow |
| Real-time | Search ranking | Milliseconds | BentoML, KServe |
| Near real-time | Fraud detection | Seconds | Kafka + Model |
Batch Inference
# Example: Batch prediction with Spark
from pyspark.sql import SparkSession
import mlflow
spark = SparkSession.builder.appName("batch_predict").getOrCreate()
# Load model from registry
model = mlflow.pyfunc.spark_udf(spark, "models:/churn_model/production")
# Apply to large dataset
predictions = df.withColumn("prediction", model("features"))
predictions.write.parquet("s3://bucket/predictions/")
Real-Time Inference
# Example: Real-time with BentoML
import bentoml
@bentoml.service
class FraudDetector:
model = bentoml.models.get("fraud_model:latest")
@bentoml.api
async def predict(self, transaction: dict) -> dict:
# Get features from online store
features = await self.feature_store.get_features(
transaction["user_id"]
)
# Predict
score = self.model.predict([features])[0]
return {"fraud_score": score}
Cost Optimization Patterns
| Pattern | Description | Savings |
|---|---|---|
| Spot instances | Use preemptible VMs for training | 60-80% |
| Auto-scaling | Scale serving replicas by demand | 40-60% |
| Model caching | Cache predictions for repeated inputs | Variable |
| Right-sizing | Match instance size to workload | 30-50% |
Hybrid Architectures
Many production systems combine patterns:
┌────────────────────────────────────────────────────┐
│ Hybrid Architecture │
├────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Real-Time Path (Critical) │ │
│ │ Request → Model A → Response (p99 < 50ms) │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Batch Path (Background) │ │
│ │ Data → Model B → Store → Pre-compute │ │
│ └─────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────┘
Example: E-commerce might use:
- Real-time: Search ranking (must be fast)
- Batch: Recommendation emails (can wait)
Key insight: Choose infrastructure based on latency requirements and cost constraints, not just model complexity.
In the next module, we'll dive into data and model versioning with DVC. :::