LiteLLM: Unified LLM Gateway

LiteLLM is the leading open-source LLM gateway, providing a unified OpenAI-compatible interface for 100+ LLM providers. In 2025, LiteLLM introduced the Agent Gateway (A2A - Agent-to-Agent) protocol for inter-agent communication and achieved 8ms P95 latency at 1,000 requests per second.

Core Features

┌─────────────────────────────────────────────────────────────┐
│                     LiteLLM Gateway                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Unified API                                                │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  /v1/chat/completions   → All chat models           │   │
│  │  /v1/embeddings         → All embedding models      │   │
│  │  /v1/images/generations → Image models              │   │
│  │  /v1/audio/transcriptions → Speech models           │   │
│  │  /v1/search             → Unified search (2025)     │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  100+ Providers Supported:                                  │
│  OpenAI, Anthropic, Azure, AWS Bedrock, Google Vertex,     │
│  Cohere, Together, Replicate, Ollama, vLLM, and more...    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Quick Start

Installation

pip install litellm

# Or with proxy server
pip install 'litellm[proxy]'

SDK Usage

from litellm import completion

# OpenAI
response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Anthropic - same interface
response = completion(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Azure OpenAI
response = completion(
    model="azure/gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="https://your-resource.openai.azure.com",
    api_key="your-azure-key"
)

# AWS Bedrock
response = completion(
    model="bedrock/anthropic.claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Local Ollama
response = completion(
    model="ollama/llama3.2",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="http://localhost:11434"
)

Async Support

import asyncio
from litellm import acompletion

async def generate_responses():
    tasks = [
        acompletion(model="gpt-4o", messages=[{"role": "user", "content": f"Question {i}"}])
        for i in range(10)
    ]
    responses = await asyncio.gather(*tasks)
    return responses

Proxy Server Deployment

Configuration

# litellm_config.yaml
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-deployment
      api_base: os.environ/AZURE_API_BASE
      api_key: os.environ/AZURE_API_KEY

# Fallback configuration
router_settings:
  routing_strategy: "latency-based"
  retry_policy:
    num_retries: 3
    fallbacks: ["gpt-4o", "claude-sonnet"]

# Budget management
litellm_settings:
  max_budget: 1000  # USD per month
  budget_duration: monthly

Running the Proxy

# Start proxy server
litellm --config litellm_config.yaml --port 4000

# Or with Docker
docker run -p 4000:4000 \
  -v $(pwd)/litellm_config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

Using the Proxy

from openai import OpenAI

# Point to LiteLLM proxy
client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="your-litellm-key"
)

# Use any configured model
response = client.chat.completions.create(
    model="gpt-4o",  # Routed by LiteLLM
    messages=[{"role": "user", "content": "Hello!"}]
)

Model Routing Strategies

Latency-Based Routing

router_settings:
  routing_strategy: "latency-based"
  # Route to lowest-latency deployment

Cost-Based Routing

router_settings:
  routing_strategy: "cost-based"
  # Route to cheapest available option

Load Balancing

model_list:
  # Multiple deployments of same model
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      rpm: 1000  # Rate limit
    model_info:
      weight: 0.5

  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o
      rpm: 2000
    model_info:
      weight: 0.5

Agent Gateway (A2A Protocol)

The Agent Gateway enables agent-to-agent communication with standardized protocols:

from litellm import Agent, AgentGateway

# Define an agent
class ResearchAgent(Agent):
    name = "research_agent"
    description = "Researches topics and provides summaries"

    async def process(self, query: str) -> str:
        response = await acompletion(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a research assistant."},
                {"role": "user", "content": query}
            ]
        )
        return response.choices[0].message.content

# Register with gateway
gateway = AgentGateway()
gateway.register(ResearchAgent())

# Other agents can discover and call
result = await gateway.call(
    agent="research_agent",
    query="What are the latest advances in quantum computing?"
)

Virtual Keys and Team Management

# Per-team budget and access
general_settings:
  master_key: "sk-master-key"

team_settings:
  - team_id: "engineering"
    max_budget: 500
    models: ["gpt-4o", "claude-sonnet"]

  - team_id: "data-science"
    max_budget: 1000
    models: ["gpt-4o", "claude-sonnet", "embedding-*"]

# Generate team-specific keys
import requests

response = requests.post(
    "http://localhost:4000/key/generate",
    json={
        "team_id": "engineering",
        "key_alias": "eng-dev-key",
        "max_budget": 100,
        "duration": "30d"
    },
    headers={"Authorization": "Bearer sk-master-key"}
)

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm-proxy
spec:
  replicas: 3
  selector:
    matchLabels:
      app: litellm
  template:
    metadata:
      labels:
        app: litellm
    spec:
      containers:
        - name: litellm
          image: ghcr.io/berriai/litellm:main-latest
          ports:
            - containerPort: 4000
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: llm-secrets
                  key: openai-key
          volumeMounts:
            - name: config
              mountPath: /app/config.yaml
              subPath: config.yaml
      volumes:
        - name: config
          configMap:
            name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
  name: litellm
spec:
  selector:
    app: litellm
  ports:
    - port: 4000
      targetPort: 4000

Performance Benchmarks (2026)

Metric	Value
P50 Latency	3ms
P95 Latency	8ms
P99 Latency	15ms
Throughput	1,000+ RPS per instance
Memory	~200MB base
:::