Production Observability

Helicone: High-Performance LLM Proxy

3 min read

Helicone is a production-grade LLM observability platform built for scale. Its Rust-based proxy architecture delivers 8ms P50 latency overhead while providing comprehensive logging, caching, and rate limiting. Helicone is SOC 2 Type II and GDPR compliant.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                   Helicone Architecture                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Your Application                                           │
│       │                                                     │
│       ▼                                                     │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Helicone Proxy (Rust)                   │   │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐    │   │
│  │  │ Logging │ │ Caching │ │  Rate   │ │ Retry   │    │   │
│  │  │         │ │         │ │ Limit   │ │ Logic   │    │   │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘    │   │
│  │                                                      │   │
│  │  Performance: 8ms P50 | 15ms P95 | 99.99% uptime    │   │
│  └─────────────────────────────────────────────────────┘   │
│       │                                                     │
│       ▼                                                     │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              LLM Providers                           │   │
│  │  ┌──────┐ ┌─────────┐ ┌──────┐ ┌────────┐          │   │
│  │  │OpenAI│ │Anthropic│ │Azure │ │Together│ ...      │   │
│  │  └──────┘ └─────────┘ └──────┘ └────────┘          │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Zero-Code Integration

The simplest integration requires only changing your API base URL:

OpenAI Python SDK

from openai import OpenAI

# Just change the base URL and add header
client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer sk-helicone-..."
    }
)

# All requests are now logged through Helicone
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

Anthropic SDK

from anthropic import Anthropic

client = Anthropic(
    base_url="https://anthropic.helicone.ai",
    default_headers={
        "Helicone-Auth": "Bearer sk-helicone-..."
    }
)

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}]
)

Request Tagging and Metadata

Add rich metadata to requests for filtering and analysis:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": user_message}],
    extra_headers={
        # User and session tracking
        "Helicone-User-Id": user_id,
        "Helicone-Session-Id": session_id,

        # Custom properties for filtering
        "Helicone-Property-Environment": "production",
        "Helicone-Property-Feature": "customer_support",
        "Helicone-Property-Version": "v2.1.0",

        # Request naming for easier identification
        "Helicone-Request-Name": "support_ticket_response",
    }
)

Response Caching

Helicone provides intelligent caching to reduce costs and latency:

# Enable caching for this request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    extra_headers={
        "Helicone-Auth": "Bearer sk-helicone-...",
        # Cache settings
        "Helicone-Cache-Enabled": "true",
        "Helicone-Cache-Bucket-Max-Size": "1000",  # Max cached responses
        "Helicone-Cache-Seed": "user-123",  # Cache key seed
    }
)

# Check if response was cached
# response.headers["Helicone-Cache-Hit"] == "true"

Cache Configuration Options

Header Description Values
Helicone-Cache-Enabled Enable caching true/false
Helicone-Cache-Bucket-Max-Size Max cached entries Integer
Helicone-Cache-Seed Cache key seed String

Rate Limiting

Protect your application and manage costs:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": message}],
    extra_headers={
        "Helicone-Auth": "Bearer sk-helicone-...",
        # Rate limit by user
        "Helicone-RateLimit-Policy": "100;w=3600;u=user",  # 100 req/hour per user
        "Helicone-User-Id": user_id,
    }
)

Rate Limit Policy Syntax

{limit};w={window_seconds};u={unit}

Examples:
- "100;w=3600;u=user"     # 100 requests per hour per user
- "1000;w=86400;u=org"    # 1000 requests per day per org
- "10;w=60;u=request"     # 10 requests per minute total

Retry and Fallback

Automatic retry logic with exponential backoff:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": message}],
    extra_headers={
        "Helicone-Auth": "Bearer sk-helicone-...",
        # Retry configuration
        "Helicone-Retry-Enabled": "true",
        "Helicone-Retry-Num": "3",  # Max retries
        "Helicone-Retry-Factor": "2",  # Exponential backoff factor

        # Optional: Fallback to different model on failure
        "Helicone-Fallback": '[{"model": "gpt-4o-mini"}]',
    }
)

Prompt Management

Store and version prompts directly in Helicone:

# Use a managed prompt
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "{{user_input}}"}],
    extra_headers={
        "Helicone-Auth": "Bearer sk-helicone-...",
        "Helicone-Prompt-Id": "customer-support-v2",
        "Helicone-Prompt-Variables": '{"user_input": "How do I reset my password?"}',
    }
)

Dashboard Features

┌─────────────────────────────────────────────────────────────┐
│                  Helicone Dashboard                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Request Explorer                                           │
│  ├── Filter by user, session, property, model               │
│  ├── Full request/response inspection                       │
│  └── Latency and token breakdown                            │
│                                                             │
│  Analytics                                                  │
│  ├── Cost tracking by model, user, feature                  │
│  ├── Latency percentiles (P50, P95, P99)                    │
│  ├── Request volume over time                               │
│  └── Error rate monitoring                                  │
│                                                             │
│  Alerts                                                     │
│  ├── Cost threshold alerts                                  │
│  ├── Error rate spikes                                      │
│  └── Latency degradation                                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Self-Hosting Option

# Clone and deploy with Docker
git clone https://github.com/Helicone/helicone.git
cd helicone
docker-compose up -d

Key Differentiators

Feature Helicone Advantage
Latency 8ms P50 (Rust-based proxy)
Scale Handles billions of requests
Compliance SOC 2 Type II, GDPR, HIPAA ready
Caching Built-in semantic caching
Zero-code Just change base URL
:::

Quiz

Module 4: Production Observability

Take Quiz