Compress Your Prompts: Smarter AI, Lower Costs

November 19, 2025

Compress Your Prompts: Smarter AI, Lower Costs

TL;DR

  • Shorter, focused prompts reduce token usage and lower API costs significantly.
  • Concise prompts often yield better accuracy by reducing the "lost in the middle" effect.
  • LLMLingua (Microsoft Research) achieves up to 20x compression with minimal quality loss.
  • GIST tokens enable 26x compression through learned embeddings (requires model fine-tuning).
  • PCToolkit provides a unified framework for comparing compression methods.
  • In production, expect 50–80% cost savings; research setups can achieve 90%+ in ideal cases.

What You'll Learn

  1. Why prompt compression matters — the economic and accuracy benefits.
  2. How to use LLMLingua — with working code examples.
  3. The differences between LLMLingua variants — LLMLingua, LongLLMLingua, and LLMLingua-2.
  4. When GIST tokens and PCToolkit are appropriate — and their limitations.
  5. How to test and monitor compression in production systems.

Prerequisites

You'll get the most from this article if you:

  • Have basic familiarity with LLM APIs (OpenAI, Anthropic, or similar).
  • Understand what tokens are and how they affect pricing.
  • Know how to write and structure prompts for generative AI.

Introduction: Why Prompt Compression Matters

Every token you send to an LLM costs money. Whether you're building a chatbot, RAG system, or autonomous agent, your bill scales with token count.

But cost isn't the only factor. Research shows that long prompts can actually reduce accuracy. The "Lost in the Middle" paper1 demonstrates that LLMs struggle with information positioned in the middle of long contexts — performance follows a U-shaped curve, with best results when relevant information appears at the beginning or end.

Prompt compression addresses both problems: reducing costs while potentially improving output quality.


The Economics of Tokens

Most LLM APIs charge per token for both input and output. Current pricing as of November 2025:

Model Input (per 1M tokens) Output (per 1M tokens)
GPT-4o $2.50 $10.00
GPT-4o-mini $0.15 $0.60
Claude Opus 4.5 $5.00 $25.00
Claude Sonnet 4.5 $3.00 $15.00
Claude Haiku 4.5 $1.00 $5.00

Pricing can change — always confirm on the official OpenAI and Anthropic pricing pages before budgeting.

Cost Savings Example

Scenario Input Tokens Monthly Volume Monthly Input Cost (Haiku 4.5) With 50% Compression
Chatbot 2,000/request 1M requests $2,000 $1,000
RAG System 5,000/request 500K requests $2,500 $1,250
Code Analysis 10,000/request 100K requests $1,000 $500

At scale, compression directly impacts profitability — especially on input-heavy workloads.


The Accuracy Paradox: Why Shorter Can Be Better

It's intuitive to think more context equals better results. But for LLMs, verbosity introduces problems:

The "Lost in the Middle" Effect

Research by Liu et al.1 found that LLM performance degrades significantly when relevant information appears in the middle of long contexts. Performance is highest when key information is at the beginning or end of the prompt.

Why This Happens

  1. Attention dilution: Transformer attention spreads across all tokens, reducing focus on critical information.
  2. Noise accumulation: Redundant or irrelevant content can confuse the model.
  3. Position bias: Models trained on certain patterns may weight positions differently.

Compression Benefits

The LongLLMLingua paper2 reports improvements on specific benchmarks:

  • 21.4% accuracy improvement on NaturalQuestions (multi-document QA at position 10)
  • Significant cost reduction on long-context benchmarks

These gains are task-specific, particularly for RAG and long-context scenarios. Results vary by use case — always test on your specific workload.


LLMLingua: The Leading Compression Tool

LLMLingua3 is an open-source compression framework from Microsoft Research, published at EMNLP 2023 and ACL 2024. It uses perplexity-based token filtering to remove redundant information while preserving semantic meaning.

Installation

pip install llmlingua

Basic Usage

from llmlingua import PromptCompressor

# Initialize the compressor
llm_lingua = PromptCompressor()

original_prompt = """
You are an expert data analyst. Please analyze the following dataset 
and provide insights about trends, anomalies, and correlations. 
Be concise but detailed in your analysis. Make sure to explain any 
patterns you observe and provide actionable recommendations based on 
the data. Consider both short-term and long-term implications.
The dataset contains quarterly sales figures from 2020 to 2024.
"""

# Compress the prompt
compressed_result = llm_lingua.compress_prompt(
    original_prompt,
    target_token=50,  # Target compressed length
)

print(f"Original length: {len(original_prompt.split())} words")
print(f"Compressed: {compressed_result['compressed_prompt']}")
print(f"Compression ratio: {compressed_result['ratio']:.2f}")

How LLMLingua Works

LLMLingua uses a small language model (like LLaMA-7B or GPT-2) to calculate perplexity for each token:

  1. High perplexity tokens = surprising/informative → keep these
  2. Low perplexity tokens = predictable/redundant → safe to remove

The algorithm preserves semantic meaning by keeping tokens that carry the most information.

Specifying a Compression Model

from llmlingua import PromptCompressor

# Use a specific model for perplexity calculation
llm_lingua = PromptCompressor(
    model_name="NousResearch/Llama-2-7b-hf",
    device_map="cuda"  # Use GPU if available
)

# For faster compression with smaller model
llm_lingua_fast = PromptCompressor(
    model_name="gpt2",
    device_map="cpu"
)

Preserving Important Tokens

Force certain tokens to be kept regardless of perplexity:

compressed = llm_lingua.compress_prompt(
    original_prompt,
    target_token=100,
    force_tokens=['\n', '?', ':', 'API', 'error', 'function']  # Always keep these
)

LLMLingua Variants: Which One to Use?

Microsoft Research has released three variants, each optimized for different use cases:

Variant Best For Key Feature Speed
LLMLingua General compression Coarse-to-fine perplexity filtering Baseline
LongLLMLingua RAG / long contexts Question-aware compression + reordering Similar
LLMLingua-2 Production / speed BERT encoder + GPT-4 distillation 3–6x faster

LLMLingua (Original)

Published at EMNLP 2023. Best for general-purpose compression.

  • Up to 20x compression with minimal performance loss on benchmarks like GSM8K
  • Uses iterative token pruning based on perplexity
  • Works with any downstream LLM (black-box compatible)

LongLLMLingua

Published at ACL 2024. Optimized for RAG and long-context scenarios.

  • Question-aware compression: Prioritizes tokens relevant to the query
  • Document reordering: Moves important content to beginning/end (addresses "lost in the middle")
  • Best results on multi-document QA tasks
from llmlingua import PromptCompressor

llm_lingua = PromptCompressor()

# LongLLMLingua-style compression with question awareness
compressed = llm_lingua.compress_prompt(
    context=long_document,
    question="What were the Q3 revenue figures?",
    target_token=500,
    reorder_context="sort"  # Reorder by relevance
)

LLMLingua-2

Published at ACL 2024 Findings. Optimized for speed and production use.

  • Uses BERT-level encoder instead of LLaMA (much smaller)
  • Trained via GPT-4 data distillation
  • 3–6x faster than original LLMLingua
  • 1.6–2.9x lower end-to-end latency
  • Achieves 2–5x compression (more conservative than original)
from llmlingua import PromptCompressor

# LLMLingua-2 configuration
llm_lingua_2 = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True
)

compressed = llm_lingua_2.compress_prompt(
    original_prompt,
    target_token=100
)

GIST Tokens: Extreme Compression via Learned Embeddings

GIST tokens4 take a fundamentally different approach. Instead of removing tokens, gisting learns compressed embeddings that represent entire prompts.

Key Characteristics

  • Published at NeurIPS 2023 (Stanford/UC Berkeley)
  • Achieves up to 26x compression with 40% FLOPs reduction
  • Requires white-box model access (not compatible with API-only services)
  • Needs fine-tuning infrastructure

How GIST Works

  1. Train a model to encode long prompts into a small number of "gist" tokens (e.g., 1–10 tokens)
  2. These gist tokens serve as compressed context for subsequent queries
  3. The model learns to reconstruct semantic meaning from compressed representations

Conceptual Example

Without GIST:

[500-token system prompt] + [user query] → LLM → response

With GIST:

[10 gist tokens representing system prompt] + [user query] → LLM → response

Limitations

  • Not API-compatible: Requires access to model internals
  • Training required: Must fine-tune for your specific use case
  • Model-specific: Gist tokens trained for LLaMA won't work with GPT

Repository

Pre-trained models available for LLaMA-7B and FLAN-T5-XXL: https://github.com/jayelm/gisting


PCToolkit: Unified Compression Framework

PCToolkit5 provides a standardized interface for comparing multiple compression methods side-by-side.

Installation

# Clone the repository
git clone https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression.git
cd Toolkit-for-Prompt-Compression

# Install dependencies
pip install -r requirements.txt

You'll also need to download models — most are available from Hugging Face Hub, but SCRL models require manual download (see the /models folder in the repository for instructions).

Included Methods

PCToolkit integrates five compression approaches:

  1. Selective Context — Rule-based filtering
  2. LLMLingua — Perplexity-based compression
  3. LongLLMLingua — Question-aware compression
  4. SCRL — Reinforcement learning approach
  5. Keep it Simple — Minimal compression baseline

Usage Example

from pctoolkit.compressors import PromptCompressor

compressor = PromptCompressor(type='SCCompressor', device='cuda')

test_prompt = "Your long prompt here..."
ratio = 0.5

result = compressor.compressgo(test_prompt, ratio)
print(result)

When to Use PCToolkit

  • Research: Benchmarking compression methods
  • Evaluation: Finding the best method for your use case
  • A/B testing: Comparing approaches in production experiments

Choosing the Right Tool

Use Case Recommended Tool Why
General compression LLMLingua Well-tested, easy to use
RAG systems LongLLMLingua Question-aware, handles long docs
Production (speed critical) LLMLingua-2 3–6x faster
Maximum compression GIST tokens 26x compression (if you can fine-tune)
Research/comparison PCToolkit Unified benchmarking
API-only access LLMLingua/LLMLingua-2 No model internals needed

Realistic Compression Expectations

Based on published research, here's what to expect:

Method Typical Compression Best Case Quality Impact
LLMLingua 4–10x 20x Minimal on most tasks
LLMLingua-2 2–5x 5x Minimal, faster
LongLLMLingua 4x (~75% reduction) Similar Can improve RAG accuracy
GIST tokens 10–26x 26x Requires fine-tuning

In production, 50–80% cost savings (2–5x compression) are realistic with LLMLingua-2. The 20x compression (95% savings) represents best-case scenarios on specific benchmarks — always validate on your workload.


Production Integration

Building a Compression Pipeline

from llmlingua import PromptCompressor
from anthropic import Anthropic
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Initialize compressor and LLM client
compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True
)
client = Anthropic()

def query_with_compression(
    prompt: str,
    target_ratio: float = 0.5,
    model: str = "claude-haiku-4-5-20251001"
) -> dict:
    """Query LLM with compressed prompt, returning response and metrics."""
    
    # Compress the prompt
    compressed_result = compressor.compress_prompt(
        prompt,
        rate=target_ratio
    )
    
    compressed_prompt = compressed_result['compressed_prompt']
    compression_ratio = compressed_result['ratio']
    
    logger.info(f"Compression ratio: {compression_ratio:.2%}")
    
    # Query the LLM
    message = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": compressed_prompt}]
    )
    
    return {
        "response": message.content[0].text,
        "original_length": len(prompt.split()),
        "compressed_length": len(compressed_prompt.split()),
        "compression_ratio": compression_ratio,
        "input_tokens": message.usage.input_tokens,
        "output_tokens": message.usage.output_tokens
    }

Compressing Conversation History

For chat applications, compress older messages while keeping recent ones intact:

def compress_conversation(
    messages: list[dict],
    keep_recent: int = 2,
    target_ratio: float = 0.5
) -> list[dict]:
    """Compress older conversation history while preserving recent messages."""
    
    if len(messages) <= keep_recent:
        return messages
    
    # Split into old (compress) and recent (keep)
    old_messages = messages[:-keep_recent]
    recent_messages = messages[-keep_recent:]
    
    # Combine old messages into text
    history_text = "\n".join([
        f"{m['role'].upper()}: {m['content']}" 
        for m in old_messages
    ])
    
    # Compress
    compressed = compressor.compress_prompt(
        history_text,
        rate=target_ratio
    )
    
    # Return as summary + recent messages
    return [
        {"role": "system", "content": f"Previous conversation summary:\n{compressed['compressed_prompt']}"},
        *recent_messages
    ]

Testing and Validation

Semantic Similarity Testing

Verify compressed prompts maintain meaning:

import pytest
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

def test_semantic_preservation():
    """Verify compression preserves semantic meaning."""
    original = """
    Analyze the quarterly sales data and identify trends.
    Focus on year-over-year growth and seasonal patterns.
    Provide actionable recommendations for Q4.
    """
    
    compressed_result = compressor.compress_prompt(original, rate=0.5)
    compressed = compressed_result['compressed_prompt']
    
    # Compute embeddings
    orig_embedding = model.encode([original])
    comp_embedding = model.encode([compressed])
    
    # Calculate similarity
    similarity = cosine_similarity(orig_embedding, comp_embedding)[0][0]
    
    assert similarity > 0.80, f"Semantic similarity too low: {similarity:.2f}"

def test_compression_ratio():
    """Verify target compression ratio is achieved."""
    original = "A " * 200  # 200 tokens
    
    compressed_result = compressor.compress_prompt(original, rate=0.5)
    
    actual_ratio = len(compressed_result['compressed_prompt'].split()) / 200
    
    # Allow 20% tolerance
    assert 0.4 <= actual_ratio <= 0.6

Integration Testing

def test_end_to_end_pipeline():
    """Test full compression + LLM query pipeline."""
    test_prompt = """
    You are a helpful assistant. Please summarize the following:
    The quick brown fox jumps over the lazy dog. This sentence
    contains every letter of the alphabet and is commonly used
    for typing practice and font demonstrations.
    """
    
    result = query_with_compression(test_prompt, target_ratio=0.6)
    
    assert "response" in result
    assert len(result["response"]) > 0
    assert result["compression_ratio"] < 0.8

Monitoring in Production

Track these metrics:

from dataclasses import dataclass
from datetime import datetime
import json

@dataclass
class CompressionMetrics:
    timestamp: str
    original_tokens: int
    compressed_tokens: int
    compression_ratio: float
    semantic_similarity: float
    llm_input_tokens: int
    llm_output_tokens: int
    model: str

def log_metrics(metrics: CompressionMetrics):
    """Log compression metrics for monitoring."""
    print(json.dumps({
        "timestamp": metrics.timestamp,
        "original_tokens": metrics.original_tokens,
        "compressed_tokens": metrics.compressed_tokens,
        "compression_ratio": f"{metrics.compression_ratio:.2%}",
        "semantic_similarity": f"{metrics.semantic_similarity:.3f}",
        "model": metrics.model
    }))

Key Metrics to Track

Metric Target Alert Threshold
Compression ratio 40–60% kept >80% (too little compression)
Semantic similarity >0.85 <0.75 (meaning loss)
Latency overhead <100ms >500ms
Task accuracy Baseline ±5% >10% degradation

Security Considerations

Preserve Safety Instructions

Never compress system prompts or safety guardrails:

def safe_compress(system_prompt: str, user_content: str) -> str:
    """Compress user content while preserving system instructions."""
    
    # Only compress user content
    compressed_user = compressor.compress_prompt(
        user_content,
        rate=0.5
    )['compressed_prompt']
    
    # Combine with preserved system prompt
    return f"{system_prompt}\n\nUser query: {compressed_user}"

Validate Compressed Output

Check that compression doesn't expose or remove sensitive patterns:

def validate_compression(original: str, compressed: str) -> bool:
    """Validate compressed output for security concerns."""
    
    sensitive_patterns = ['api_key', 'password', 'secret', 'token']
    
    for pattern in sensitive_patterns:
        if pattern in compressed.lower() and pattern not in original.lower():
            return False
    
    return True

When NOT to Use Compression

Scenario Reason
Short prompts (<100 tokens) Overhead exceeds benefit
Legal/medical/compliance text Risk of losing critical details
Code with exact syntax requirements May break syntax
Creative writing May lose stylistic nuance
One-off queries Setup overhead not justified

Troubleshooting Guide

Issue Cause Solution
ModuleNotFoundError: llmlingua Not installed pip install llmlingua
CUDA out of memory Model too large Use smaller model or CPU
Compressed output incoherent Over-compression Increase target ratio to 0.5–0.7
Key information missing Important tokens removed Add to force_tokens
Slow compression Using large model Switch to LLMLingua-2

Key Takeaways

  • Prompt compression saves 50–80% on input costs in production, with up to 95% possible in research scenarios.
  • LLMLingua is the most accessible tool — install via pip install llmlingua and use the PromptCompressor class.
  • LLMLingua-2 is best for production (3–6x faster, 2–5x compression).
  • LongLLMLingua is best for RAG systems (question-aware compression).
  • GIST tokens offer maximum compression but require fine-tuning and white-box model access.
  • Always test semantic preservation with similarity scores >0.80.
  • Never compress system prompts or safety instructions.
  • Results are task-specific — always benchmark on your workload.

FAQ

Q1: How much can I realistically save?
In production, 50–80% input cost savings (2–5x compression) are typical with LLMLingua-2. The 20x compression (95% savings) is achievable on specific benchmarks but not universal.

Q2: Will compression reduce output quality?
It depends on the task. For RAG and long-context scenarios, compression can improve quality by 10–20% by addressing "lost in the middle" effects. For other tasks, quality is usually maintained within 5% of baseline.

Q3: Which tool should I start with?
LLMLingua via pip install llmlingua. Use the PromptCompressor class. For production speed, switch to LLMLingua-2.

Q4: Can I use compression with OpenAI/Anthropic APIs?
Yes. LLMLingua and LLMLingua-2 are "black-box compatible" — they compress text before sending to any API. GIST tokens require white-box model access.

Q5: What's the best compression ratio?
Start with 50% (rate=0.5). Test semantic similarity — maintain >0.80. For aggressive compression, don't go below 30% without careful testing.

Q6: How does compression affect latency?
Compression adds 50–200ms overhead but reduces LLM inference time (fewer tokens to process). Net effect is usually positive for prompts >1000 tokens.


References

Footnotes

  1. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" — TACL 2024 https://arxiv.org/abs/2307.03172 2

  2. Jiang et al., "LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression" — ACL 2024 https://arxiv.org/abs/2310.06839

  3. Microsoft Research, "LLMLingua: Compressing Prompts for Accelerated Inference" — EMNLP 2023 https://github.com/microsoft/LLMLingua

  4. Mu et al., "Learning to Compress Prompts with Gist Tokens" — NeurIPS 2023 https://arxiv.org/abs/2304.08467

  5. PCToolkit Repository https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression