Infrastructure Patterns

ML systems have distinct infrastructure needs for training versus serving. Understanding these patterns helps you design efficient, cost-effective systems.

Training vs Serving Infrastructure

Aspect	Training	Serving
Goal	Learn from data	Make predictions
Compute	GPU clusters, high memory	CPU/GPU, low latency
Data access	Batch processing	Real-time lookup
Scaling	Scale up (bigger machines)	Scale out (more replicas)
Cost model	Burst compute (pay per run)	Always-on (pay per uptime)

Training Infrastructure Pattern

┌──────────────────────────────────────────────────────┐
│                Training Infrastructure                │
├──────────────────────────────────────────────────────┤
│                                                       │
│  ┌─────────┐     ┌──────────────┐     ┌──────────┐  │
│  │  Data   │────▶│   Compute    │────▶│  Artifact │  │
│  │  Lake   │     │   Cluster    │     │   Store   │  │
│  └─────────┘     │  (GPU/TPU)   │     └──────────┘  │
│       │          └──────────────┘          │         │
│       │                 │                  │         │
│       ▼                 ▼                  ▼         │
│  ┌──────────────────────────────────────────────┐   │
│  │           Orchestrator (Kubeflow)             │   │
│  └──────────────────────────────────────────────┘   │
│                                                       │
└──────────────────────────────────────────────────────┘

Key components:

Data lake: S3, GCS, Azure Blob for raw data
Compute cluster: Kubernetes with GPU nodes
Artifact store: Model registry, experiment logs
Orchestrator: Kubeflow, Airflow for pipeline management

Serving Infrastructure Pattern

┌──────────────────────────────────────────────────────┐
│                Serving Infrastructure                 │
├──────────────────────────────────────────────────────┤
│                                                       │
│  Client Request                                       │
│       │                                               │
│       ▼                                               │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐     │
│  │   Load   │────▶│  Model   │────▶│ Feature  │     │
│  │ Balancer │     │  Server  │     │  Store   │     │
│  └──────────┘     └──────────┘     └──────────┘     │
│                        │                              │
│                        ▼                              │
│              ┌──────────────────┐                    │
│              │    Monitoring    │                    │
│              └──────────────────┘                    │
│                                                       │
└──────────────────────────────────────────────────────┘

Key components:

Load balancer: Distribute requests across replicas
Model server: BentoML, TF Serving, TorchServe
Feature store: Feast for real-time feature lookup
Monitoring: Latency, throughput, drift detection

Batch vs Real-Time Inference

Pattern	Use Case	Latency	Tools
Batch	Nightly recommendations	Minutes-hours	Spark, Kubeflow
Real-time	Search ranking	Milliseconds	BentoML, KServe
Near real-time	Fraud detection	Seconds	Kafka + Model

Batch Inference

# Example: Batch prediction with Spark
from pyspark.sql import SparkSession
import mlflow

spark = SparkSession.builder.appName("batch_predict").getOrCreate()

# Load model from registry
model = mlflow.pyfunc.spark_udf(spark, "models:/churn_model/production")

# Apply to large dataset
predictions = df.withColumn("prediction", model("features"))
predictions.write.parquet("s3://bucket/predictions/")

Real-Time Inference

# Example: Real-time with BentoML
import bentoml

@bentoml.service
class FraudDetector:
    model = bentoml.models.get("fraud_model:latest")

    @bentoml.api
    async def predict(self, transaction: dict) -> dict:
        # Get features from online store
        features = await self.feature_store.get_features(
            transaction["user_id"]
        )
        # Predict
        score = self.model.predict([features])[0]
        return {"fraud_score": score}

Cost Optimization Patterns

Pattern	Description	Savings
Spot instances	Use preemptible VMs for training	60-80%
Auto-scaling	Scale serving replicas by demand	40-60%
Model caching	Cache predictions for repeated inputs	Variable
Right-sizing	Match instance size to workload	30-50%

Hybrid Architectures

Many production systems combine patterns:

┌────────────────────────────────────────────────────┐
│                 Hybrid Architecture                 │
├────────────────────────────────────────────────────┤
│                                                     │
│  ┌─────────────────────────────────────────────┐   │
│  │          Real-Time Path (Critical)           │   │
│  │  Request → Model A → Response (p99 < 50ms)   │   │
│  └─────────────────────────────────────────────┘   │
│                                                     │
│  ┌─────────────────────────────────────────────┐   │
│  │          Batch Path (Background)             │   │
│  │  Data → Model B → Store → Pre-compute       │   │
│  └─────────────────────────────────────────────┘   │
│                                                     │
└────────────────────────────────────────────────────┘

Example: E-commerce might use:

Real-time: Search ranking (must be fast)
Batch: Recommendation emails (can wait)

Key insight: Choose infrastructure based on latency requirements and cost constraints, not just model complexity.

In the next module, we'll dive into data and model versioning with DVC. :::