Model Registry & Serving
Model Serving with BentoML
3 min read
BentoML simplifies turning ML models into production-ready APIs. It handles packaging, containerization, and deployment with minimal boilerplate.
Installation
pip install bentoml
# Verify installation
bentoml --version
Saving Models to BentoML
Save from Training
import bentoml
from sklearn.ensemble import RandomForestClassifier
# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Save to BentoML model store
saved_model = bentoml.sklearn.save_model(
"fraud_detector",
model,
signatures={"predict": {"batchable": True}},
labels={"team": "risk", "version": "1.0"},
metadata={"accuracy": 0.95}
)
print(f"Saved: {saved_model.tag}")
# Output: fraud_detector:abc123xyz
Framework Support
# PyTorch
import bentoml
import torch
bentoml.pytorch.save_model("my_pytorch_model", pytorch_model)
# TensorFlow/Keras
bentoml.keras.save_model("my_keras_model", keras_model)
# XGBoost
bentoml.xgboost.save_model("my_xgb_model", xgb_model)
# ONNX
bentoml.onnx.save_model("my_onnx_model", onnx_model)
# Custom models
bentoml.picklable_model.save_model("custom_model", my_model)
Creating a Service
Basic Service Definition
# service.py
import bentoml
import numpy as np
@bentoml.service(
resources={"cpu": "2", "memory": "4Gi"},
traffic={"timeout": 30}
)
class FraudDetector:
def __init__(self):
# Load model from BentoML store
self.model = bentoml.sklearn.load_model("fraud_detector:latest")
@bentoml.api
def predict(self, features: np.ndarray) -> np.ndarray:
"""Predict fraud probability."""
return self.model.predict_proba(features)
@bentoml.api
def classify(self, features: np.ndarray) -> list[int]:
"""Classify as fraud or not."""
return self.model.predict(features).tolist()
With Pydantic Validation
# service.py
import bentoml
from pydantic import BaseModel
import numpy as np
class TransactionInput(BaseModel):
amount: float
merchant_category: int
hour_of_day: int
is_international: bool
class PredictionOutput(BaseModel):
is_fraud: bool
confidence: float
@bentoml.service(
resources={"cpu": "2", "memory": "4Gi"}
)
class FraudDetector:
def __init__(self):
self.model = bentoml.sklearn.load_model("fraud_detector:latest")
@bentoml.api
def predict(self, transaction: TransactionInput) -> PredictionOutput:
features = np.array([[
transaction.amount,
transaction.merchant_category,
transaction.hour_of_day,
int(transaction.is_international)
]])
proba = self.model.predict_proba(features)[0]
is_fraud = proba[1] > 0.5
return PredictionOutput(
is_fraud=is_fraud,
confidence=float(proba[1]) if is_fraud else float(proba[0])
)
Running the Service
Development Mode
# Start development server
bentoml serve service:FraudDetector --reload
# Access the API
curl -X POST http://localhost:3000/predict \
-H "Content-Type: application/json" \
-d '{"amount": 1500.0, "merchant_category": 5, "hour_of_day": 3, "is_international": true}'
API Documentation
BentoML auto-generates OpenAPI docs at http://localhost:3000/docs.
Building a Bento
Create bentofile.yaml
# bentofile.yaml
service: "service:FraudDetector"
labels:
team: risk
environment: production
include:
- "*.py"
python:
packages:
- scikit-learn==1.4.0
- numpy>=1.24.0
- pydantic>=2.0.0
Build the Bento
# Build packaged bento
bentoml build
# List bentos
bentoml list
# Output:
# Tag Size Created
# fraud_detector:v1_abc123 45.2 MiB 2025-01-15
Containerization
Build Docker Image
# Build container from bento
bentoml containerize fraud_detector:latest
# Tag for registry
bentoml containerize fraud_detector:latest -t myregistry/fraud-detector:v1
# Run container
docker run -p 3000:3000 fraud_detector:latest
Custom Dockerfile
# bentofile.yaml
service: "service:FraudDetector"
docker:
python_version: "3.11"
system_packages:
- libgomp1
env:
- name: MODEL_TIMEOUT
value: "30"
python:
packages:
- scikit-learn==1.4.0
Deployment Options
Deploy to BentoCloud
# Login to BentoCloud
bentoml cloud login
# Deploy
bentoml deploy fraud_detector:latest
# Check status
bentoml deployment list
Kubernetes Deployment
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-detector
spec:
replicas: 3
selector:
matchLabels:
app: fraud-detector
template:
metadata:
labels:
app: fraud-detector
spec:
containers:
- name: fraud-detector
image: myregistry/fraud-detector:v1
ports:
- containerPort: 3000
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
---
apiVersion: v1
kind: Service
metadata:
name: fraud-detector-svc
spec:
selector:
app: fraud-detector
ports:
- port: 80
targetPort: 3000
type: LoadBalancer
kubectl apply -f deployment.yaml
Batching and Performance
Adaptive Batching
@bentoml.service(
traffic={
"timeout": 60,
"max_batch_size": 100,
"max_latency_ms": 500
}
)
class FraudDetector:
@bentoml.api(batchable=True)
def predict_batch(self, features: np.ndarray) -> np.ndarray:
"""Process batched requests for efficiency."""
return self.model.predict_proba(features)
Async Endpoints
import asyncio
@bentoml.service
class AsyncFraudDetector:
@bentoml.api
async def predict(self, features: np.ndarray) -> np.ndarray:
# Async processing for I/O-bound operations
result = await asyncio.to_thread(
self.model.predict_proba, features
)
return result
Multi-Model Services
@bentoml.service
class EnsemblePredictor:
def __init__(self):
self.rf_model = bentoml.sklearn.load_model("random_forest:latest")
self.xgb_model = bentoml.xgboost.load_model("xgboost:latest")
@bentoml.api
def predict(self, features: np.ndarray) -> dict:
rf_pred = self.rf_model.predict_proba(features)
xgb_pred = self.xgb_model.predict_proba(features)
# Ensemble average
ensemble = (rf_pred + xgb_pred) / 2
return {
"ensemble": ensemble.tolist(),
"random_forest": rf_pred.tolist(),
"xgboost": xgb_pred.tolist()
}
Comparison: BentoML vs Alternatives
| Feature | BentoML | FastAPI + Docker | TorchServe |
|---|---|---|---|
| Setup complexity | Low | Medium | Medium |
| Framework support | All major | Manual | PyTorch only |
| Batching | Built-in | Manual | Built-in |
| Containerization | One command | Manual | Manual |
| Cloud deploy | BentoCloud | DIY | AWS SageMaker |
Key insight: BentoML bridges the gap between model training and production deployment with minimal code changes—from
model.predict()to a scalable API in minutes.
Next, we'll explore canary deployments and A/B testing strategies. :::