LiteLLM Proxy Production Tutorial: LLM Gateway in 2026
May 19, 2026
TL;DR. This tutorial deploys a production-grade LiteLLM Proxy with Docker Compose, Postgres, virtual keys, per-team budgets, and automatic fallback routing across Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro. Every code block is runnable as written. We pin the container image to v1.85.0 and explain exactly why — the March 2026 supply chain incident is the kind of footnote that turns a tutorial from "I copied a blog" into "this is what we run on Monday."12 Total time: ~40 minutes.
LiteLLM Proxy is an MIT-licensed self-hosted LLM gateway that exposes 100+ provider APIs (OpenAI, Anthropic, Google, AWS Bedrock, Azure, vLLM, Ollama, and more) behind one OpenAI-compatible endpoint, with virtual keys, per-team budgets, cost tracking, automatic fallbacks, and an admin UI built in.3 You run it on your own infrastructure — only the model providers see your prompts, and only your virtual-key holders see your provider credentials.
What you'll learn
- How to deploy LiteLLM Proxy with Docker Compose and Postgres, pinned to a signed, immutable image tag.
- How to write a
config.yamlthat exposes Claude Sonnet 4.6, GPT-5.4, GPT-5.4 mini, and Gemini 2.5 Pro through one OpenAI-shaped endpoint. - How to create virtual keys with budgets, rate limits, and model allowlists via the
/key/generateAPI. - How to configure fallback routing so requests survive a provider outage or a context-window overrun.
- How to read cost tracking out of the
/global/spend/reportendpoint and the/uiadmin dashboard. - How
LITELLM_MASTER_KEYandLITELLM_SALT_KEYdiffer and which one you can never rotate. - How to verify the container's signature and avoid the rolling tags that recently shipped a backdoor.
Prerequisites
- Docker Desktop 4.30+ or a Linux host with Docker Engine 25+ and the Compose v2 plugin.
openssl(preinstalled on macOS and most Linux distros).- An Anthropic API key, an OpenAI API key, and a Google AI Studio API key. You don't strictly need all three to follow along — calls to providers whose keys are missing will return 401s, but the proxy starts up either way and any model whose key IS present will work. (The fallbacks step has the most to show when you have at least Anthropic + one of the other two.)
- ~2 GB of free RAM on the host (Postgres + LiteLLM are light, but headroom matters when the proxy is logging).
- 40 minutes.
Step 1 — Create the project directory and generate keys
mkdir -p ~/llm-gateway && cd ~/llm-gateway
Generate the two cryptographic keys LiteLLM needs. The master key authorizes admin endpoints and starts with the literal prefix sk-. The salt key encrypts the provider API keys you store in Postgres.4
echo "LITELLM_MASTER_KEY=sk-$(openssl rand -hex 24)" > .env
echo "LITELLM_SALT_KEY=$(openssl rand -base64 32)" >> .env
Verify the file:
cat .env
# LITELLM_MASTER_KEY=sk-9f6c... (48 hex chars after the sk- prefix)
# LITELLM_SALT_KEY=<base64-encoded 32 random bytes>
Warning — read this once. The LITELLM_SALT_KEY is the key used to encrypt every provider credential the proxy writes to Postgres. If you change it after adding even one model, every encrypted row becomes garbage and the proxy will fail to start with Error decrypting value.5 Pick it once, back it up to your password manager, and never rotate it for the lifetime of this database. The master key, by contrast, can be rotated and there is a documented rotation flow.
Append the three provider API keys (replace placeholders with real values; if you only have one provider's key, leave the other two empty — calls routed to those models will return an authentication error, but the proxy still starts and serves the providers you do have):
cat <<'EOF' >> .env
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-proj-...
GEMINI_API_KEY=AIza...
EOF
Step 2 — Write config.yaml with multi-provider routing
LiteLLM Proxy reads a YAML config that declares which models the gateway exposes, how they map to upstream providers, and how routing/fallbacks/limits behave. Create config.yaml in the same directory:
# config.yaml — LiteLLM Proxy v1.85.0
model_list:
# ---- Anthropic ----
- model_name: claude-sonnet-4-6
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
# ---- OpenAI ----
- model_name: gpt-5.4
litellm_params:
model: openai/gpt-5.4
api_key: os.environ/OPENAI_API_KEY
- model_name: gpt-5.4-mini
litellm_params:
model: openai/gpt-5.4-mini
api_key: os.environ/OPENAI_API_KEY
# ---- Google Gemini (AI Studio, NOT Vertex) ----
- model_name: gemini-2.5-pro
litellm_params:
# The "gemini/" prefix routes to AI Studio with GEMINI_API_KEY.
# A bare "gemini-2.5-pro" would be auto-detected as Vertex AI
# and the proxy would expect GCP application-default credentials.
model: gemini/gemini-2.5-pro
api_key: os.environ/GEMINI_API_KEY
general_settings:
# The proxy also reads LITELLM_MASTER_KEY and DATABASE_URL straight from
# the environment, so we don't repeat them here. Setting them in BOTH
# config.yaml and .env risks divergence if one drifts.
store_model_in_db: true
router_settings:
routing_strategy: simple-shuffle # the default; explicit for clarity
allowed_fails: 3 # cooldown threshold per 1-min window
cooldown_time: 60 # seconds a failed deployment is parked
# On the FIRST failure of gpt-5.4, the router walks this list and tries
# claude-sonnet-4-6, then gemini-2.5-pro. allowed_fails above controls
# how soon gpt-5.4 gets parked, NOT when fallbacks fire.
fallbacks:
- gpt-5.4: ["claude-sonnet-4-6", "gemini-2.5-pro"]
# If a request blows the gpt-5.4-mini window, escalate to full gpt-5.4.
context_window_fallbacks:
- gpt-5.4-mini: ["gpt-5.4"]
# If Claude refuses on policy, route to gpt-5.4 (different policy stack).
content_policy_fallbacks:
- claude-sonnet-4-6: ["gpt-5.4"]
Two non-obvious details worth knowing before you ship:
anthropic/claude-sonnet-4-6has no date suffix. Starting with the Claude 4.6 generation, Anthropic moved offmodel-name-YYYYMMDDIDs; the dateless string is itself a pinned snapshot — not an evergreen alias.6 If you're migrating a config from Claude 3.5 Sonnet, this change will surprise you.- The
gemini/prefix is mandatory for the AI Studio path. Without it, LiteLLM routes the request to Vertex AI and the proxy will demand GCP service-account credentials it doesn't have.7 This is the most common 401 in first-time Gemini configs.
Step 3 — Write the Compose file with a pinned, signed image
Production Docker tags are not a place to be casual. We pin to ghcr.io/berriai/litellm-database:v1.85.0 — starting with v1.84.0, LiteLLM dropped the older -stable suffix and stable releases now use plain SemVer 2.0 tags like v1.85.0 that are immutable and signed.89 Avoid :latest, :main-latest, and :main-stable in production: those rolling tags can change between deployments, and in March 2026 the project itself shipped two compromised PyPI releases via a tainted Trivy dependency in CI.2
Create docker-compose.yml:
# docker-compose.yml — Compose v2 (no `version:` key at top)
services:
db:
image: postgres:18-alpine
restart: always
environment:
POSTGRES_USER: litellm
POSTGRES_PASSWORD: changeme_in_prod
POSTGRES_DB: litellm
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U litellm -d litellm"]
interval: 5s
retries: 10
litellm:
# The "litellm-database" variant ships pre-generated Prisma binaries
# cached in /root/.cache, so the schema migration runs on first boot
# even when /app/.cache is volume-mounted. The bare "litellm" image
# works for in-memory deployments; for any DB-backed setup use this one.
image: ghcr.io/berriai/litellm-database:v1.85.0
restart: always
depends_on:
db:
condition: service_healthy
ports:
- "4000:4000"
volumes:
- ./config.yaml:/app/config.yaml:ro
command: ["--config", "/app/config.yaml", "--port", "4000", "--num_workers", "4"]
environment:
LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
LITELLM_SALT_KEY: ${LITELLM_SALT_KEY}
DATABASE_URL: "postgresql://litellm:changeme_in_prod@db:5432/litellm"
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:-}
OPENAI_API_KEY: ${OPENAI_API_KEY:-}
GEMINI_API_KEY: ${GEMINI_API_KEY:-}
volumes:
pgdata:
A few production tweaks the official quick-start glosses over:
- The Postgres health check +
condition: service_healthyprevents the proxy from racing the DB on first boot — without it, Prisma migrations sometimes fail with a connection refused on cold starts. --num_workers 4runs four Uvicorn workers behind the FastAPI app; bump this up on multi-core hosts.${VAR:-}(the:-fallback) lets the file boot even if a provider key is missing in.env, instead of dying withvariable not set.
Optional but recommended: verify the image signature before you start. LiteLLM signs every image published to GHCR with cosign, starting from v1.83.0 — the same release that closed out the March supply chain incident. The public key is checked into the repository at cosign.pub and is referenced from a pinned commit hash so the verification key itself cannot be substituted by a future repo compromise:8
# One-time install of cosign
brew install cosign # macOS
# or (Go 1.20+): go install github.com/sigstore/cosign/v3/cmd/cosign@latest
# Verify against the pinned-commit public key (recommended form per LiteLLM docs).
# The same key signs all four image variants (litellm, litellm-database,
# litellm-non_root, litellm-spend_logs).
cosign verify \
--key https://raw.githubusercontent.com/BerriAI/litellm/0112e53046018d726492c814b3644b7d376029d0/cosign.pub \
ghcr.io/berriai/litellm-database:v1.85.0
A clean run prints:
The following checks were performed on each of these signatures:
- The cosign claims were validated
- The signatures were verified against the specified public key
If the verify step fails, do not start the container — pull a different tag and re-verify before going further. The supply chain incident's malicious wheel passed casual eyeballing; cosign would have caught it.
Step 4 — Start the stack and verify the boot
docker compose up -d
docker compose logs -f litellm
Wait for the line Application startup complete. (typically ~5 seconds after Postgres is ready). The proxy is now listening on http://localhost:4000.
Quick health check:
curl -s http://localhost:4000/health/liveliness
# "I'm alive!"
Note the endpoint spelling: LiteLLM uses /health/liveliness (double "l"), not the Kubernetes-standard /health/liveness. Its sibling is /health/readiness, which adds a database round-trip and reports db: "connected" when Postgres is reachable.10
Inspect Postgres to confirm the proxy created its schema — Prisma will have run migrations on first boot:
docker compose exec db psql -U litellm -d litellm -c "\dt" | head -20
You should see a couple dozen tables prefixed with LiteLLM_ — LiteLLM_VerificationToken (virtual key storage), LiteLLM_SpendLogs (per-call spend ledger), LiteLLM_TeamTable, LiteLLM_UserTable, LiteLLM_BudgetTable, plus audit-log and config tables. The schema is defined in litellm/proxy/schema.prisma.
Step 5 — Make your first call with the master key
The master key authenticates against any endpoint — admin routes like /key/generate and chat routes like /chat/completions — but you should never ship it to applications. We're just smoke-testing the routing layer here.
source .env
curl -s http://localhost:4000/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"messages": [{"role": "user", "content": "Say hi in 5 words."}]
}' | jq '.choices[0].message.content, .usage'
Expected output (model wording will vary):
"Hi there, friend, today!"
{
"prompt_tokens": 14,
"completion_tokens": 9,
"total_tokens": 23
}
Try the same call against gpt-5.4 and gemini-2.5-pro — the response shape is identical because the proxy normalizes everything to the OpenAI Chat Completions format. That uniformity is the entire reason a gateway exists.
Step 6 — Create a virtual key with a budget and rate limit
Virtual keys are the keys your applications and your teammates actually use. Each one is independently revocable, has its own budget that resets on a schedule, has its own RPM/TPM limits, and can restrict which models the caller is allowed to hit.11 They are stored in Postgres encrypted with LITELLM_SALT_KEY.
Generate a key for a hypothetical "growth-team" with a $50/month budget, 60 requests/minute cap, and access to only the two cheaper models:
curl -s http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"models": ["gpt-5.4-mini", "gemini-2.5-pro"],
"max_budget": 50,
"budget_duration": "30d",
"rpm_limit": 60,
"tpm_limit": 200000,
"metadata": {"team": "growth", "owner": "alice@example.com"}
}' | jq
Expected response (per the official /key/generate spec):11
{
"key": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxx"
}
The response always contains the new key field. The budget, model allowlist, and rate limits you sent in the request are persisted server-side; query them back with GET /key/info?key=<key> whenever you need to inspect what's stored. Recent versions may echo a few extra fields directly in /key/generate for convenience, but the official spec only guarantees key.
Now call the proxy with the new key — and try to hit a model the key is not allowed to use:
GROWTH_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxx # paste from previous response
# Allowed — uses gpt-5.4-mini
curl -s http://localhost:4000/chat/completions \
-H "Authorization: Bearer $GROWTH_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5.4-mini",
"messages": [{"role": "user", "content": "ping"}]
}' | jq '.choices[0].message.content'
# Forbidden — gpt-5.4 is not in the key's models list
curl -s http://localhost:4000/chat/completions \
-H "Authorization: Bearer $GROWTH_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5.4",
"messages": [{"role": "user", "content": "ping"}]
}' | jq '.error.message'
# "key not allowed to access model. This key can only access models=['gpt-5.4-mini', 'gemini-2.5-pro']..."
Two production notes:
- Team budgets sit alongside key budgets, and supersede user budgets. When a virtual key has a
team_idset, the key's ownmax_budgetis still enforced, and the team's shared budget is enforced on top of it. The user-level budget of the key'suser_idis bypassed in this configuration — this is the documented behavior and the easy way to give an entire team one shared wallet without per-engineer budgets fighting it.12 - Budgets reset on
budget_duration. Accepted units include seconds ("30s"), minutes ("30m"), hours ("30h"), and days ("30d"). If you omitbudget_duration, the budget is one-shot and never resets.12
Step 7 — Watch the fallbacks work
Fallbacks fire on the first failure of a model-group deployment — allowed_fails: 3 doesn't gate the fallback; it gates the cooldown (how many failures a deployment can accumulate in a one-minute window before the router temporarily removes it from rotation for future requests).13 To prove the fallback end-to-end without a real provider outage, point one model at a deliberately broken upstream (a fake API key) and watch the fallback take over on the first call.
Temporarily edit config.yaml — change the gpt-5.4 block to use a guaranteed-bad key:
- model_name: gpt-5.4
litellm_params:
model: openai/gpt-5.4
api_key: "sk-proj-INVALID-this-will-401"
Restart and call gpt-5.4:
docker compose restart litellm
sleep 3
curl -s http://localhost:4000/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5.4",
"messages": [{"role": "user", "content": "Reply with the literal word: ok"}]
}' | jq '.choices[0].message.content, .model'
Expected output: a successful response, with .model set to claude-sonnet-4-6 rather than gpt-5.4. The router saw the bad OpenAI deployment return a 401, immediately walked the fallbacks array, and answered from Claude.13 Set the OpenAI key back when you're done:
# Restore the env-var reference
sed -i.bak 's|"sk-proj-INVALID-this-will-401"|os.environ/OPENAI_API_KEY|' config.yaml
docker compose restart litellm
The context_window_fallbacks rule works the same way: when a request exceeds the active model's context window, the proxy escalates to the next model in the list instead of returning a 400 to the caller. This single config line is the difference between "our agent crashes on long PDFs" and "our agent transparently switches to a bigger model and we eat the bill."
Step 8 — Read the spend report
Every successful completion writes a row to LiteLLM_SpendLogs with the token counts, the model, the key ID, and the dollar cost computed against the proxy's pricing table.14 Cost tracking covers 100+ providers automatically — the proxy ships with a model-pricing JSON and updates it on every release.
Pull a spend report for the last 30 days. The endpoint takes start_date and end_date in plain YYYY-MM-DD form — no time component — and returns an array grouped by day, with a nested teams array per day:14
# GNU date (Linux) or BSD date (macOS) — both supported via the || fallback
START=$(date -u -d '30 days ago' +%Y-%m-%d 2>/dev/null || date -u -v-30d +%Y-%m-%d)
END=$(date -u +%Y-%m-%d)
curl -s "http://localhost:4000/global/spend/report?start_date=$START&end_date=$END" \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
| jq '.[] | {day: .group_by_day, teams: (.teams | map({team_name, total_spend}))}'
For per-key spend you can also call /key/info directly:
curl -s "http://localhost:4000/key/info?key=$GROWTH_KEY" \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '.info.spend, .info.models'
If you'd rather click than curl, open http://localhost:4000/ui in a browser and log in with the master key (or the UI_USERNAME / UI_PASSWORD env vars if you set them). The dashboard is a single-page web app served by the proxy itself — you can add models, generate keys, and inspect token usage entirely from the UI.15
Step 9 — Reference pricing the gateway uses
These are the per-million-token rates LiteLLM applies when it stamps a cost on each spend-log row, verified against the official provider pricing pages on 2026-05-19. They will drift; treat them as a snapshot.
| Model | Input ($/M) | Output ($/M) | Notes |
|---|---|---|---|
| Claude Sonnet 4.6 | $3.00 | $15.00 | 1M-token context window, no long-context surcharge16 |
| GPT-5.4 (≤272K input) | $2.50 | $15.00 | Standard tier on inputs up to 272K tokens17 |
| GPT-5.4 (>272K input) | $5.00 | $22.50 | 2× input / 1.5× output surcharge above 272K, up to ~1M17 |
| GPT-5.4 mini | $0.75 | $4.50 | Cheaper sibling for high-volume work17 |
| Gemini 2.5 Pro (≤200K input) | $1.25 | $10.00 | Tier 1 pricing18 |
| Gemini 2.5 Pro (>200K input) | $2.50 | $15.00 | Long-context surcharge18 |
For a back-of-envelope: a single 4k-token in / 1k-token out call to GPT-5.4 costs roughly $0.025 (4k × $2.50/M + 1k × $15.00/M). Doing the same 100 times a day for 30 days is about $75/month per workload — which is exactly the budget envelope max_budget: 50 enforces on a virtual key.
Verification
Run the four checks below; all should pass before you treat the proxy as healthy.
# 1. Liveness — endpoint is /health/liveliness (double "l"), returns the
# literal string "I'm alive!"
test "$(curl -s http://localhost:4000/health/liveliness)" = '"I'\''m alive!"'
# 2. Readiness — adds a DB round-trip; db field becomes "connected" when ready
curl -s http://localhost:4000/health/readiness | jq '.db, .status'
# Expect: "connected" "connected"
# 3. Postgres schema present
docker compose exec -T db psql -U litellm -d litellm -tAc \
"SELECT count(*) FROM information_schema.tables WHERE table_name LIKE 'LiteLLM_%';"
# Expect: >= 10
# 4. A chat completion round-trips
curl -s http://localhost:4000/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gemini-2.5-pro","messages":[{"role":"user","content":"reply with: ok"}]}' \
| jq -r '.choices[0].message.content'
# Expect: a non-empty reply
Troubleshooting
Error decrypting value on startup, immediately after a config change. You changed LITELLM_SALT_KEY after rows were already encrypted with the previous one. There is no recovery short of dropping the encrypted rows and re-adding the models. Restore the previous salt key from your password manager, or docker compose down -v to wipe the volume and start clean if this is a dev box.5
api_key client option must be set for Anthropic, even though .env has the key. The proxy reads provider keys via the os.environ/<NAME> indirection in config.yaml; if you wrote the literal key into the YAML you bypassed the salt-key encryption and any subsequent restart re-reads from env and finds nothing. Move the key to .env, change api_key: sk-ant-... to api_key: os.environ/ANTHROPIC_API_KEY, and docker compose restart litellm.
Gemini calls 401 with "Could not find application default credentials." You omitted the gemini/ prefix on the model field in litellm_params. Without it, LiteLLM auto-detects the bare gemini-2.5-pro identifier as Vertex AI and looks for GCP service-account creds it doesn't have.7 Add the prefix and restart.
Fallbacks don't trigger; the call returns the original error. Fallbacks fire on the first failure of the targeted deployment — allowed_fails only controls cooldown for future requests, not the first-call fallback. If the original error is passing through, the most common causes are: (a) the failure is happening at LiteLLM's auth/rate-limit layer (key over budget, RPM exceeded) rather than the upstream provider — those errors aren't subject to model fallbacks; (b) the fallback target model is also failing; or (c) fallbacks isn't matching the model name in the request (verify with docker compose logs litellm and look for the routing decision line).
docker compose up complains the :v1.85.0 tag isn't found. Check two things: (1) the registry path is ghcr.io/berriai/litellm-database, not docker.io/berriai/litellm-database; (2) you didn't append the legacy -stable suffix — starting with v1.84.0, stable images are plain :v1.85.0, and :v1.85.0-stable does not exist.89
The proxy starts but the admin UI returns a blank page at /ui. Either you set DISABLE_ADMIN_UI=True somewhere, or your reverse proxy is stripping the JWT cookie that /ui uses for auth.15 Hit /ui directly without your reverse proxy first; if it works, the issue is upstream.
Next steps
- Vector retrieval next to your gateway. Pair LiteLLM with a vector index for retrieval-augmented prompts — see our pgvector HNSW production tuning tutorial for the recall/latency knobs that matter.
- Tool-using agents through the proxy. Once you're routing chat completions, the next layer is structured tool calls; our TypeScript MCP server with OAuth and streamable HTTP shows the agent-side wiring that pairs nicely with this gateway.
- Scale the Postgres backing. When your spend-log table grows past a million rows or the proxy's connection count is hammering the DB, switch to a pooler — the Supabase Supavisor / PgBouncer pooling tutorial covers transaction pooling sizing.
- Tail the proxy from a log aggregator. Workers and Lambda-shaped runtimes are the most common ingress to a gateway like this — our Cloudflare Workers observability with Sentry tutorial gives you the log-and-trace shape to wire up downstream.
External references for production operators:
- LiteLLM Best Practices for Production — Redis cross-instance state, num_workers tuning, DB deadlock notes.
- LiteLLM Docker Image Security Guide — cosign verification, tag selection.
- Security Update: Suspected Supply Chain Incident — March 2026 timeline and post-incident hardening.
Footnotes
-
BerriAI/litellm releases. v1.85.0 published 2026-05-17. https://github.com/BerriAI/litellm/releases ↩
-
"Security Update: Suspected Supply Chain Incident," LiteLLM blog, March 2026 — describes the compromise of v1.82.7 and v1.82.8 on PyPI for ~40 minutes on 2026-03-24, the Trivy-CI-token exfiltration vector, and the v1.83.0 hardened pipeline. https://docs.litellm.ai/blog/security-update-march-2026 ↩ ↩2
-
BerriAI/litellm README: "Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging." https://github.com/BerriAI/litellm ↩
-
LiteLLM Self-Hosted Security & Encryption FAQ — explains how
LITELLM_SALT_KEYencrypts credentials in the DB and the irreversibility of changing it. https://docs.litellm.ai/docs/proxy/security_encryption_faq ↩ -
Same source as above. The FAQ explicitly states the salt key cannot be changed after models have been added because it is used to encrypt/decrypt the stored API key credentials. ↩ ↩2
-
Anthropic, "Models overview" — confirms
claude-sonnet-4-6is a dateless pinned ID for the Sonnet 4.6 generation. https://platform.claude.com/docs/en/about-claude/models/overview ↩ -
LiteLLM, "Gemini - Google AI Studio" — provider documentation showing that model identifiers must be prefixed
gemini/to route to the AI Studio API rather than Vertex AI. https://docs.litellm.ai/docs/providers/gemini ↩ ↩2 -
LiteLLM Docker Image Security Guide — covers cosign verification, the recommendation to pin to immutable version tags such as
v1.85.0(or digest-pin via@sha256:…), and the warning against rolling tags likelatest/main-stablein production. Older releases still carry the legacy-stablesuffix (e.g.v1.83.0-stable); from v1.84.0 onward the suffix is dropped. https://docs.litellm.ai/docs/proxy/docker_image_security ↩ ↩2 ↩3 -
LiteLLM blog, "LiteLLM release versioning is changing: standard names, MINOR for weekly, PATCH for hotfixes" — explains that starting with v1.84.0 the
-stablesuffix is dropped and stable releases use plain SemVer 2.0 tags (v1.84.0,v1.85.0, …). MINOR bumps weekly; PATCH is reserved for hotfixes. https://docs.litellm.ai/blog/cleaner-release-versions ↩ ↩2 -
LiteLLM, "Health Checks" — endpoint names
/health,/health/readiness,/health/liveliness(double "l"),/health/services. The liveliness response is the literal string"I'm alive!"; readiness returns JSON withdbandstatusfields. https://docs.litellm.ai/docs/proxy/health ↩ -
LiteLLM, "Virtual Keys" —
/key/generatepayload schema includingmodels,max_budget,budget_duration,rpm_limit,tpm_limit, andmetadata. https://docs.litellm.ai/docs/proxy/virtual_keys ↩ ↩2 -
LiteLLM, "Budgets, Rate Limits" — describes team-budget supersession and the budget_duration units. https://docs.litellm.ai/docs/proxy/users ↩ ↩2
-
LiteLLM, "Fallbacks" — by default the router walks the fallbacks list on the first failure of a model-group deployment.
allowed_failsis a separate, parallel mechanism: it controls how many failures a deployment can accumulate in a rolling one-minute window before that deployment is parked (cooled down) and skipped for subsequent requests. The two mechanisms coexist; fallbacks do not wait forallowed_fails. https://docs.litellm.ai/docs/proxy/reliability ↩ ↩2 -
LiteLLM, "Spend Tracking" —
/global/spend/reportacceptsstart_date/end_date(plainYYYY-MM-DD) and returns an array of objects withgroup_by_dayplus a nestedteamsarray (team_name,total_spend,metadata). https://docs.litellm.ai/docs/proxy/cost_tracking ↩ ↩2 -
LiteLLM, "Quick Start (UI)" — admin dashboard at
/ui, JWT cookie auth, disable withDISABLE_ADMIN_UI=True. https://docs.litellm.ai/docs/proxy/ui ↩ ↩2 -
Anthropic API pricing page, Claude Sonnet 4.6 listed at $3.00 / $15.00 per million tokens. https://platform.claude.com/docs/en/about-claude/pricing ↩
-
OpenAI API pricing, GPT-5.4 family rates. https://developers.openai.com/api/docs/pricing ↩ ↩2 ↩3
-
Google Gemini Developer API pricing — 2.5 Pro tiered at $1.25 / $10.00 ≤200k context and $2.50 / $15.00 above. https://ai.google.dev/gemini-api/docs/pricing ↩ ↩2