Production Observability
Helicone: High-Performance LLM Proxy
3 min read
Helicone is a production-grade LLM observability platform built for scale. Its Rust-based proxy architecture delivers 8ms P50 latency overhead while providing comprehensive logging, caching, and rate limiting. Helicone is SOC 2 Type II and GDPR compliant.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Helicone Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ Your Application │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Helicone Proxy (Rust) │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Logging │ │ Caching │ │ Rate │ │ Retry │ │ │
│ │ │ │ │ │ │ Limit │ │ Logic │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ │ Performance: 8ms P50 | 15ms P95 | 99.99% uptime │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ LLM Providers │ │
│ │ ┌──────┐ ┌─────────┐ ┌──────┐ ┌────────┐ │ │
│ │ │OpenAI│ │Anthropic│ │Azure │ │Together│ ... │ │
│ │ └──────┘ └─────────┘ └──────┘ └────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Zero-Code Integration
The simplest integration requires only changing your API base URL:
OpenAI Python SDK
from openai import OpenAI
# Just change the base URL and add header
client = OpenAI(
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer sk-helicone-..."
}
)
# All requests are now logged through Helicone
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
Anthropic SDK
from anthropic import Anthropic
client = Anthropic(
base_url="https://anthropic.helicone.ai",
default_headers={
"Helicone-Auth": "Bearer sk-helicone-..."
}
)
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}]
)
Request Tagging and Metadata
Add rich metadata to requests for filtering and analysis:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": user_message}],
extra_headers={
# User and session tracking
"Helicone-User-Id": user_id,
"Helicone-Session-Id": session_id,
# Custom properties for filtering
"Helicone-Property-Environment": "production",
"Helicone-Property-Feature": "customer_support",
"Helicone-Property-Version": "v2.1.0",
# Request naming for easier identification
"Helicone-Request-Name": "support_ticket_response",
}
)
Response Caching
Helicone provides intelligent caching to reduce costs and latency:
# Enable caching for this request
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the capital of France?"}],
extra_headers={
"Helicone-Auth": "Bearer sk-helicone-...",
# Cache settings
"Helicone-Cache-Enabled": "true",
"Helicone-Cache-Bucket-Max-Size": "1000", # Max cached responses
"Helicone-Cache-Seed": "user-123", # Cache key seed
}
)
# Check if response was cached
# response.headers["Helicone-Cache-Hit"] == "true"
Cache Configuration Options
| Header | Description | Values |
|---|---|---|
Helicone-Cache-Enabled |
Enable caching | true/false |
Helicone-Cache-Bucket-Max-Size |
Max cached entries | Integer |
Helicone-Cache-Seed |
Cache key seed | String |
Rate Limiting
Protect your application and manage costs:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": message}],
extra_headers={
"Helicone-Auth": "Bearer sk-helicone-...",
# Rate limit by user
"Helicone-RateLimit-Policy": "100;w=3600;u=user", # 100 req/hour per user
"Helicone-User-Id": user_id,
}
)
Rate Limit Policy Syntax
{limit};w={window_seconds};u={unit}
Examples:
- "100;w=3600;u=user" # 100 requests per hour per user
- "1000;w=86400;u=org" # 1000 requests per day per org
- "10;w=60;u=request" # 10 requests per minute total
Retry and Fallback
Automatic retry logic with exponential backoff:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": message}],
extra_headers={
"Helicone-Auth": "Bearer sk-helicone-...",
# Retry configuration
"Helicone-Retry-Enabled": "true",
"Helicone-Retry-Num": "3", # Max retries
"Helicone-Retry-Factor": "2", # Exponential backoff factor
# Optional: Fallback to different model on failure
"Helicone-Fallback": '[{"model": "gpt-4o-mini"}]',
}
)
Prompt Management
Store and version prompts directly in Helicone:
# Use a managed prompt
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "{{user_input}}"}],
extra_headers={
"Helicone-Auth": "Bearer sk-helicone-...",
"Helicone-Prompt-Id": "customer-support-v2",
"Helicone-Prompt-Variables": '{"user_input": "How do I reset my password?"}',
}
)
Dashboard Features
┌─────────────────────────────────────────────────────────────┐
│ Helicone Dashboard │
├─────────────────────────────────────────────────────────────┤
│ │
│ Request Explorer │
│ ├── Filter by user, session, property, model │
│ ├── Full request/response inspection │
│ └── Latency and token breakdown │
│ │
│ Analytics │
│ ├── Cost tracking by model, user, feature │
│ ├── Latency percentiles (P50, P95, P99) │
│ ├── Request volume over time │
│ └── Error rate monitoring │
│ │
│ Alerts │
│ ├── Cost threshold alerts │
│ ├── Error rate spikes │
│ └── Latency degradation │
│ │
└─────────────────────────────────────────────────────────────┘
Self-Hosting Option
# Clone and deploy with Docker
git clone https://github.com/Helicone/helicone.git
cd helicone
docker-compose up -d
Key Differentiators
| Feature | Helicone Advantage |
|---|---|
| Latency | 8ms P50 (Rust-based proxy) |
| Scale | Handles billions of requests |
| Compliance | SOC 2 Type II, GDPR, HIPAA ready |
| Caching | Built-in semantic caching |
| Zero-code | Just change base URL |
| ::: |