Model Registry & Serving
Model Registry Concepts
3 min read
A model registry is a centralized hub for managing the lifecycle of ML models—from experimentation to production. It brings version control, governance, and collaboration to model management.
Why Model Registry?
Without a registry:
Models scattered across:
├── /home/alice/models/best_model_v2_final_FINAL.pkl
├── /home/bob/experiments/model_2025_01_15.h5
├── s3://bucket/models/classifier/
├── /mnt/shared/archived_models/
└── "I think the production model is in Slack somewhere..."
With a registry:
Model Registry
├── fraud-detector
│ ├── Version 1 (Staging)
│ ├── Version 2 (Production) ← Current
│ └── Version 3 (Development)
├── recommendation-engine
│ ├── Version 1 (Archived)
│ └── Version 2 (Production)
└── churn-predictor
└── Version 1 (Production)
Core Concepts
Model
A trained ML model ready for deployment:
# What gets registered
model = {
"name": "fraud-detector",
"version": 3,
"artifacts": {
"model.pkl": "s3://bucket/models/fraud/v3/model.pkl",
"preprocessor.pkl": "s3://bucket/models/fraud/v3/preprocessor.pkl"
},
"metrics": {
"accuracy": 0.95,
"f1_score": 0.93,
"auc_roc": 0.98
},
"parameters": {
"n_estimators": 100,
"max_depth": 10
},
"tags": {
"team": "risk",
"use_case": "real-time fraud detection"
}
}
Model Version
Each training run produces a new version:
fraud-detector
├── v1: accuracy=0.85, created=2025-01-01
├── v2: accuracy=0.90, created=2025-01-15
└── v3: accuracy=0.95, created=2025-01-20 ← Latest
Model Stage
Stages track where a model is in its lifecycle:
| Stage | Description | Who Can Access |
|---|---|---|
| Development | Experimental, not tested | Data scientists |
| Staging | Under testing/validation | QA team |
| Production | Live, serving traffic | Production systems |
| Archived | Deprecated, kept for audit | Compliance |
Development ──▶ Staging ──▶ Production
│
▼
Archived
Model Registry Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Model Registry │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Metadata Store │ │
│ │ • Model name, version, stage │ │
│ │ • Training parameters │ │
│ │ • Metrics and tags │ │
│ │ • Lineage (data, code, experiment) │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Artifact Store │ │
│ │ • Model files (pkl, pt, onnx, savedmodel) │ │
│ │ • Preprocessing pipelines │ │
│ │ • Configuration files │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
Training Serving CI/CD
Pipeline System Pipeline
Key Features
1. Version Control
# Register multiple versions
mlflow.register_model("runs:/abc123/model", "fraud-detector") # v1
mlflow.register_model("runs:/def456/model", "fraud-detector") # v2
mlflow.register_model("runs:/ghi789/model", "fraud-detector") # v3
2. Stage Transitions
# Promote model to production
client.transition_model_version_stage(
name="fraud-detector",
version=3,
stage="Production"
)
3. Model Lineage
Model: fraud-detector v3
├── Training Run: experiment_123/run_456
├── Dataset: s3://bucket/data/train_2025_01.parquet
├── Code: git@github.com:org/repo.git@commit_abc
├── Environment: python=3.11, sklearn=1.4.0
└── Parent Model: fraud-detector v2
4. Access Control
| Role | Permissions |
|---|---|
| Data Scientist | Create, read models |
| ML Engineer | Promote to staging |
| DevOps | Promote to production |
| Admin | Delete, archive models |
Model Metadata
What to Track
| Category | Examples |
|---|---|
| Identity | Name, version, aliases |
| Performance | Accuracy, latency, throughput |
| Training | Hyperparameters, dataset version |
| Lineage | Experiment ID, code commit |
| Operational | Owner, team, SLA requirements |
Example Metadata
model:
name: fraud-detector
version: 3
stage: Production
metrics:
accuracy: 0.95
f1_score: 0.93
latency_p99_ms: 15
throughput_qps: 1000
training:
experiment_id: exp_123
run_id: run_456
dataset_version: v2.1
training_date: "2025-01-20"
parameters:
algorithm: XGBoost
n_estimators: 100
max_depth: 10
learning_rate: 0.1
tags:
team: risk
owner: alice@company.com
compliance: SOC2
Model Registry Options
| Tool | Type | Best For |
|---|---|---|
| MLflow | Open-source | General purpose |
| Weights & Biases | Managed | Experiment tracking + registry |
| Neptune | Managed | MLOps teams |
| SageMaker | Cloud | AWS ecosystem |
| Vertex AI | Cloud | GCP ecosystem |
Best Practices
| Practice | Why |
|---|---|
| One model per use case | Clear ownership and versioning |
| Meaningful version descriptions | Know what changed |
| Automate stage transitions | Reduce human error |
| Enforce approval workflows | Governance and compliance |
| Track all metadata | Full reproducibility |
Key insight: A model registry transforms model management from ad-hoc file sharing to a governed, auditable process—essential for production ML at scale.
Next, we'll explore MLflow Model Registry in depth. :::