Introduction to MLOps

Infrastructure Patterns

3 min read

ML systems have distinct infrastructure needs for training versus serving. Understanding these patterns helps you design efficient, cost-effective systems.

Training vs Serving Infrastructure

AspectTrainingServing
GoalLearn from dataMake predictions
ComputeGPU clusters, high memoryCPU/GPU, low latency
Data accessBatch processingReal-time lookup
ScalingScale up (bigger machines)Scale out (more replicas)
Cost modelBurst compute (pay per run)Always-on (pay per uptime)

Training Infrastructure Pattern

┌──────────────────────────────────────────────────────┐
│                Training Infrastructure                │
├──────────────────────────────────────────────────────┤
│                                                       │
│  ┌─────────┐     ┌──────────────┐     ┌──────────┐  │
│  │  Data   │────▶│   Compute    │────▶│  Artifact │  │
│  │  Lake   │     │   Cluster    │     │   Store   │  │
│  └─────────┘     │  (GPU/TPU)   │     └──────────┘  │
│       │          └──────────────┘          │         │
│       │                 │                  │         │
│       ▼                 ▼                  ▼         │
│  ┌──────────────────────────────────────────────┐   │
│  │           Orchestrator (Kubeflow)             │   │
│  └──────────────────────────────────────────────┘   │
│                                                       │
└──────────────────────────────────────────────────────┘

Key components:

  • Data lake: S3, GCS, Azure Blob for raw data
  • Compute cluster: Kubernetes with GPU nodes
  • Artifact store: Model registry, experiment logs
  • Orchestrator: Kubeflow, Airflow for pipeline management

Serving Infrastructure Pattern

┌──────────────────────────────────────────────────────┐
│                Serving Infrastructure                 │
├──────────────────────────────────────────────────────┤
│                                                       │
│  Client Request                                       │
│       │                                               │
│       ▼                                               │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐     │
│  │   Load   │────▶│  Model   │────▶│ Feature  │     │
│  │ Balancer │     │  Server  │     │  Store   │     │
│  └──────────┘     └──────────┘     └──────────┘     │
│                        │                              │
│                        ▼                              │
│              ┌──────────────────┐                    │
│              │    Monitoring    │                    │
│              └──────────────────┘                    │
│                                                       │
└──────────────────────────────────────────────────────┘

Key components:

  • Load balancer: Distribute requests across replicas
  • Model server: BentoML, TF Serving, TorchServe
  • Feature store: Feast for real-time feature lookup
  • Monitoring: Latency, throughput, drift detection

Batch vs Real-Time Inference

PatternUse CaseLatencyTools
BatchNightly recommendationsMinutes-hoursSpark, Kubeflow
Real-timeSearch rankingMillisecondsBentoML, KServe
Near real-timeFraud detectionSecondsKafka + Model

Batch Inference

# Example: Batch prediction with Spark
from pyspark.sql import SparkSession
import mlflow

spark = SparkSession.builder.appName("batch_predict").getOrCreate()

# Load model from registry
model = mlflow.pyfunc.spark_udf(spark, "models:/churn_model/production")

# Apply to large dataset
predictions = df.withColumn("prediction", model("features"))
predictions.write.parquet("s3://bucket/predictions/")

Real-Time Inference

# Example: Real-time with BentoML
import bentoml

@bentoml.service
class FraudDetector:
    model = bentoml.models.get("fraud_model:latest")

    @bentoml.api
    async def predict(self, transaction: dict) -> dict:
        # Get features from online store
        features = await self.feature_store.get_features(
            transaction["user_id"]
        )
        # Predict
        score = self.model.predict([features])[0]
        return {"fraud_score": score}

Cost Optimization Patterns

PatternDescriptionSavings
Spot instancesUse preemptible VMs for training60-80%
Auto-scalingScale serving replicas by demand40-60%
Model cachingCache predictions for repeated inputsVariable
Right-sizingMatch instance size to workload30-50%

Hybrid Architectures

Many production systems combine patterns:

┌────────────────────────────────────────────────────┐
│                 Hybrid Architecture                 │
├────────────────────────────────────────────────────┤
│                                                     │
│  ┌─────────────────────────────────────────────┐   │
│  │          Real-Time Path (Critical)           │   │
│  │  Request → Model A → Response (p99 < 50ms)   │   │
│  └─────────────────────────────────────────────┘   │
│                                                     │
│  ┌─────────────────────────────────────────────┐   │
│  │          Batch Path (Background)             │   │
│  │  Data → Model B → Store → Pre-compute       │   │
│  └─────────────────────────────────────────────┘   │
│                                                     │
└────────────────────────────────────────────────────┘

Example: E-commerce might use:

  • Real-time: Search ranking (must be fast)
  • Batch: Recommendation emails (can wait)

Key insight: Choose infrastructure based on latency requirements and cost constraints, not just model complexity.

In the next module, we'll dive into data and model versioning with DVC. :::

Quick check: how does this lesson land for you?

Quiz

Module 1: Introduction to MLOps

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.