LLM Gateways & Routing
LiteLLM: Unified LLM Gateway
4 min read
LiteLLM is the leading open-source LLM gateway, providing a unified OpenAI-compatible interface for 100+ LLM providers. In 2025, LiteLLM introduced the Agent Gateway (A2A - Agent-to-Agent) protocol for inter-agent communication and achieved 8ms P95 latency at 1,000 requests per second.
Core Features
┌─────────────────────────────────────────────────────────────┐
│ LiteLLM Gateway │
├─────────────────────────────────────────────────────────────┤
│ │
│ Unified API │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ /v1/chat/completions → All chat models │ │
│ │ /v1/embeddings → All embedding models │ │
│ │ /v1/images/generations → Image models │ │
│ │ /v1/audio/transcriptions → Speech models │ │
│ │ /v1/search → Unified search (2025) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ 100+ Providers Supported: │
│ OpenAI, Anthropic, Azure, AWS Bedrock, Google Vertex, │
│ Cohere, Together, Replicate, Ollama, vLLM, and more... │
│ │
└─────────────────────────────────────────────────────────────┘
Quick Start
Installation
pip install litellm
# Or with proxy server
pip install 'litellm[proxy]'
SDK Usage
from litellm import completion
# OpenAI
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
# Anthropic - same interface
response = completion(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Hello!"}]
)
# Azure OpenAI
response = completion(
model="azure/gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
api_base="https://your-resource.openai.azure.com",
api_key="your-azure-key"
)
# AWS Bedrock
response = completion(
model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
messages=[{"role": "user", "content": "Hello!"}]
)
# Local Ollama
response = completion(
model="ollama/llama3.2",
messages=[{"role": "user", "content": "Hello!"}],
api_base="http://localhost:11434"
)
Async Support
import asyncio
from litellm import acompletion
async def generate_responses():
tasks = [
acompletion(model="gpt-4o", messages=[{"role": "user", "content": f"Question {i}"}])
for i in range(10)
]
responses = await asyncio.gather(*tasks)
return responses
Proxy Server Deployment
Configuration
# litellm_config.yaml
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-deployment
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
# Fallback configuration
router_settings:
routing_strategy: "latency-based"
retry_policy:
num_retries: 3
fallbacks: ["gpt-4o", "claude-sonnet"]
# Budget management
litellm_settings:
max_budget: 1000 # USD per month
budget_duration: monthly
Running the Proxy
# Start proxy server
litellm --config litellm_config.yaml --port 4000
# Or with Docker
docker run -p 4000:4000 \
-v $(pwd)/litellm_config.yaml:/app/config.yaml \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml
Using the Proxy
from openai import OpenAI
# Point to LiteLLM proxy
client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="your-litellm-key"
)
# Use any configured model
response = client.chat.completions.create(
model="gpt-4o", # Routed by LiteLLM
messages=[{"role": "user", "content": "Hello!"}]
)
Model Routing Strategies
Latency-Based Routing
router_settings:
routing_strategy: "latency-based"
# Route to lowest-latency deployment
Cost-Based Routing
router_settings:
routing_strategy: "cost-based"
# Route to cheapest available option
Load Balancing
model_list:
# Multiple deployments of same model
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
rpm: 1000 # Rate limit
model_info:
weight: 0.5
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o
rpm: 2000
model_info:
weight: 0.5
Agent Gateway (A2A Protocol)
The Agent Gateway enables agent-to-agent communication with standardized protocols:
from litellm import Agent, AgentGateway
# Define an agent
class ResearchAgent(Agent):
name = "research_agent"
description = "Researches topics and provides summaries"
async def process(self, query: str) -> str:
response = await acompletion(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a research assistant."},
{"role": "user", "content": query}
]
)
return response.choices[0].message.content
# Register with gateway
gateway = AgentGateway()
gateway.register(ResearchAgent())
# Other agents can discover and call
result = await gateway.call(
agent="research_agent",
query="What are the latest advances in quantum computing?"
)
Virtual Keys and Team Management
# Per-team budget and access
general_settings:
master_key: "sk-master-key"
team_settings:
- team_id: "engineering"
max_budget: 500
models: ["gpt-4o", "claude-sonnet"]
- team_id: "data-science"
max_budget: 1000
models: ["gpt-4o", "claude-sonnet", "embedding-*"]
# Generate team-specific keys
import requests
response = requests.post(
"http://localhost:4000/key/generate",
json={
"team_id": "engineering",
"key_alias": "eng-dev-key",
"max_budget": 100,
"duration": "30d"
},
headers={"Authorization": "Bearer sk-master-key"}
)
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-proxy
spec:
replicas: 3
selector:
matchLabels:
app: litellm
template:
metadata:
labels:
app: litellm
spec:
containers:
- name: litellm
image: ghcr.io/berriai/litellm:main-latest
ports:
- containerPort: 4000
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: openai-key
volumeMounts:
- name: config
mountPath: /app/config.yaml
subPath: config.yaml
volumes:
- name: config
configMap:
name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
name: litellm
spec:
selector:
app: litellm
ports:
- port: 4000
targetPort: 4000
Performance Benchmarks (2025)
| Metric | Value |
|---|---|
| P50 Latency | 3ms |
| P95 Latency | 8ms |
| P99 Latency | 15ms |
| Throughput | 1,000+ RPS per instance |
| Memory | ~200MB base |
| ::: |