vLLM & Open-Source Inference Engines

Alternative Inference Engines: TGI, OpenLLM & SGLang

3 min read

While vLLM dominates, other engines excel in specific scenarios. Understanding the landscape helps you choose the right tool.

Text Generation Inference (TGI)

Hugging Face's production inference server:

# TGI Docker deployment
docker run --gpus all \
    -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.3-70B-Instruct \
    --num-shard 4

Strengths:

  • Native Hugging Face Hub integration
  • Built-in quantization (GPTQ, AWQ, GGUF)
  • Excellent for Hugging Face ecosystem users
  • Grammar-constrained generation

Use cases:

# TGI client usage
from huggingface_hub import InferenceClient

client = InferenceClient("http://localhost:8080")

# Standard generation
response = client.text_generation(
    "Explain machine learning:",
    max_new_tokens=500,
    temperature=0.7,
)

# Grammar-constrained (JSON output)
response = client.text_generation(
    "Generate a user profile:",
    grammar={
        "type": "json",
        "value": {"name": "string", "age": "number"}
    }
)

SGLang

High-performance engine with RadixAttention for prefix caching:

# SGLang installation
pip install sglang[all]

# Server launch
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.3-70B-Instruct \
    --tp 4 \
    --port 30000

Key innovation - RadixAttention:

┌─────────────────────────────────────────────────┐
│          RadixAttention (SGLang)                │
├─────────────────────────────────────────────────┤
│                                                 │
│  Radix Tree for prefix caching:                │
│                                                 │
│         "System: You are"                      │
│              /          \                       │
│    "helpful"             "expert"              │
│       /    \                 \                  │
│  "assistant" "bot"        "programmer"         │
│                                                 │
│  Automatic prefix matching & sharing           │
│  No manual prefix management needed            │
│  2-5x speedup for repeated prefixes           │
│                                                 │
└─────────────────────────────────────────────────┘

Frontend DSL:

import sglang as sgl

@sgl.function
def multi_turn_chat(s, user_messages):
    s += sgl.system("You are a helpful assistant.")

    for msg in user_messages:
        s += sgl.user(msg)
        s += sgl.assistant(sgl.gen("response", max_tokens=256))

    return s

# Automatic KV cache reuse across calls
result = multi_turn_chat.run(
    user_messages=["Hello!", "What's the weather?"]
)

OpenLLM

BentoML's LLM serving framework:

# OpenLLM installation
pip install openllm

# Serve model
openllm serve meta-llama/Llama-3.3-70B-Instruct \
    --backend vllm \
    --quantize int4

Strengths:

  • Unified API across backends (vLLM, TGI, GGML)
  • Easy model versioning and deployment
  • BentoML ecosystem integration
  • Multi-model serving
# OpenLLM Python API
import openllm

llm = openllm.LLM("meta-llama/Llama-3.3-70B-Instruct")

# Sync generation
response = llm.generate("Hello, how are you?")

# Async streaming
async for chunk in llm.generate_stream("Write a story:"):
    print(chunk, end="")

Engine Comparison

Feature vLLM TGI SGLang OpenLLM
PagedAttention ✅ Native ✅ Yes ✅ RadixAttn Via backend
Continuous Batching ✅ Yes ✅ Yes ✅ Yes Via backend
Speculative Decode ✅ Yes ✅ Yes ✅ Yes Limited
OpenAI API ✅ Yes ✅ Yes ✅ Yes ✅ Yes
Multi-Modal ✅ Good ✅ Good ✅ Basic ✅ Basic
Quantization FP8, AWQ, GPTQ GPTQ, AWQ, GGUF FP8, AWQ All
Prefix Caching ✅ Manual ❌ Limited ✅ Automatic Via backend
Grammar Constrain ❌ No ✅ Yes ✅ Yes ✅ Yes
Best For General production HF ecosystem Prefix-heavy Multi-backend

Decision Framework

┌─────────────────────────────────────────────────────────────┐
│              WHICH ENGINE TO CHOOSE?                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  START HERE                                                 │
│      │                                                      │
│      ▼                                                      │
│  Need maximum throughput?                                   │
│      │                                                      │
│      ├── YES ──► vLLM (default choice)                     │
│      │                                                      │
│      └── NO ──► Continue...                                │
│                   │                                         │
│                   ▼                                         │
│           Lots of repeated prefixes?                        │
│                   │                                         │
│                   ├── YES ──► SGLang (RadixAttention)      │
│                   │                                         │
│                   └── NO ──► Continue...                   │
│                                │                            │
│                                ▼                            │
│                    Need JSON/grammar output?                │
│                                │                            │
│                                ├── YES ──► TGI             │
│                                │                            │
│                                └── NO ──► Continue...      │
│                                             │               │
│                                             ▼               │
│                                  HuggingFace ecosystem?     │
│                                             │               │
│                                             ├── YES ──► TGI│
│                                             │               │
│                                             └── NO ──► vLLM│
│                                                             │
└─────────────────────────────────────────────────────────────┘

Hybrid Deployments

Sometimes multiple engines work together:

# Kubernetes: Route by use case
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-router
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /$2
spec:
  rules:
  - http:
      paths:
      # High-throughput batch processing
      - path: /batch(/|$)(.*)
        pathType: Prefix
        backend:
          service:
            name: vllm-service
            port: 8000

      # Structured output generation
      - path: /structured(/|$)(.*)
        pathType: Prefix
        backend:
          service:
            name: tgi-service
            port: 8080

      # Chat with prefix caching
      - path: /chat(/|$)(.*)
        pathType: Prefix
        backend:
          service:
            name: sglang-service
            port: 30000

Most organizations start with vLLM and add specialized engines as needs emerge.

Next, we'll explore prefix caching and advanced optimization techniques. :::

Quiz

Module 2: vLLM & Open-Source Inference Engines

Take Quiz