Introduction to MLOps

Infrastructure Patterns

3 min read

ML systems have distinct infrastructure needs for training versus serving. Understanding these patterns helps you design efficient, cost-effective systems.

Training vs Serving Infrastructure

Aspect Training Serving
Goal Learn from data Make predictions
Compute GPU clusters, high memory CPU/GPU, low latency
Data access Batch processing Real-time lookup
Scaling Scale up (bigger machines) Scale out (more replicas)
Cost model Burst compute (pay per run) Always-on (pay per uptime)

Training Infrastructure Pattern

┌──────────────────────────────────────────────────────┐
│                Training Infrastructure                │
├──────────────────────────────────────────────────────┤
│                                                       │
│  ┌─────────┐     ┌──────────────┐     ┌──────────┐  │
│  │  Data   │────▶│   Compute    │────▶│  Artifact │  │
│  │  Lake   │     │   Cluster    │     │   Store   │  │
│  └─────────┘     │  (GPU/TPU)   │     └──────────┘  │
│       │          └──────────────┘          │         │
│       │                 │                  │         │
│       ▼                 ▼                  ▼         │
│  ┌──────────────────────────────────────────────┐   │
│  │           Orchestrator (Kubeflow)             │   │
│  └──────────────────────────────────────────────┘   │
│                                                       │
└──────────────────────────────────────────────────────┘

Key components:

  • Data lake: S3, GCS, Azure Blob for raw data
  • Compute cluster: Kubernetes with GPU nodes
  • Artifact store: Model registry, experiment logs
  • Orchestrator: Kubeflow, Airflow for pipeline management

Serving Infrastructure Pattern

┌──────────────────────────────────────────────────────┐
│                Serving Infrastructure                 │
├──────────────────────────────────────────────────────┤
│                                                       │
│  Client Request                                       │
│       │                                               │
│       ▼                                               │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐     │
│  │   Load   │────▶│  Model   │────▶│ Feature  │     │
│  │ Balancer │     │  Server  │     │  Store   │     │
│  └──────────┘     └──────────┘     └──────────┘     │
│                        │                              │
│                        ▼                              │
│              ┌──────────────────┐                    │
│              │    Monitoring    │                    │
│              └──────────────────┘                    │
│                                                       │
└──────────────────────────────────────────────────────┘

Key components:

  • Load balancer: Distribute requests across replicas
  • Model server: BentoML, TF Serving, TorchServe
  • Feature store: Feast for real-time feature lookup
  • Monitoring: Latency, throughput, drift detection

Batch vs Real-Time Inference

Pattern Use Case Latency Tools
Batch Nightly recommendations Minutes-hours Spark, Kubeflow
Real-time Search ranking Milliseconds BentoML, KServe
Near real-time Fraud detection Seconds Kafka + Model

Batch Inference

# Example: Batch prediction with Spark
from pyspark.sql import SparkSession
import mlflow

spark = SparkSession.builder.appName("batch_predict").getOrCreate()

# Load model from registry
model = mlflow.pyfunc.spark_udf(spark, "models:/churn_model/production")

# Apply to large dataset
predictions = df.withColumn("prediction", model("features"))
predictions.write.parquet("s3://bucket/predictions/")

Real-Time Inference

# Example: Real-time with BentoML
import bentoml

@bentoml.service
class FraudDetector:
    model = bentoml.models.get("fraud_model:latest")

    @bentoml.api
    async def predict(self, transaction: dict) -> dict:
        # Get features from online store
        features = await self.feature_store.get_features(
            transaction["user_id"]
        )
        # Predict
        score = self.model.predict([features])[0]
        return {"fraud_score": score}

Cost Optimization Patterns

Pattern Description Savings
Spot instances Use preemptible VMs for training 60-80%
Auto-scaling Scale serving replicas by demand 40-60%
Model caching Cache predictions for repeated inputs Variable
Right-sizing Match instance size to workload 30-50%

Hybrid Architectures

Many production systems combine patterns:

┌────────────────────────────────────────────────────┐
│                 Hybrid Architecture                 │
├────────────────────────────────────────────────────┤
│                                                     │
│  ┌─────────────────────────────────────────────┐   │
│  │          Real-Time Path (Critical)           │   │
│  │  Request → Model A → Response (p99 < 50ms)   │   │
│  └─────────────────────────────────────────────┘   │
│                                                     │
│  ┌─────────────────────────────────────────────┐   │
│  │          Batch Path (Background)             │   │
│  │  Data → Model B → Store → Pre-compute       │   │
│  └─────────────────────────────────────────────┘   │
│                                                     │
└────────────────────────────────────────────────────┘

Example: E-commerce might use:

  • Real-time: Search ranking (must be fast)
  • Batch: Recommendation emails (can wait)

Key insight: Choose infrastructure based on latency requirements and cost constraints, not just model complexity.

In the next module, we'll dive into data and model versioning with DVC. :::

Quiz

Module 1: Introduction to MLOps

Take Quiz