Hybrid Search & Reranking
Query Enhancement
Improving the query before retrieval significantly impacts results. These techniques bridge the gap between how users ask and how documents are written.
The Query-Document Gap
Users ask questions differently than documents are written:
User Query: "Why is my app slow?"
Document: "Performance optimization techniques include..."
Gap: Different vocabulary, question vs statement
Query Expansion
Generate multiple query variations to improve recall:
def expand_query(query: str, llm) -> list[str]:
"""Generate query variations for better coverage."""
prompt = f"""Generate 3 alternative search queries for:
"{query}"
Include:
1. A rephrased version
2. A more technical version
3. A simpler version
Return only the queries, one per line."""
response = llm.invoke(prompt)
variations = response.content.strip().split('\n')
# Include original query
return [query] + variations[:3]
# Example
query = "Why is my app slow?"
expanded = expand_query(query, llm)
# ["Why is my app slow?",
# "What causes application performance issues?",
# "Application latency troubleshooting",
# "App running slowly"]
Multi-Query Retrieval
Search with all query variations:
class MultiQueryRetriever:
def __init__(self, retriever, llm):
self.retriever = retriever
self.llm = llm
def search(self, query: str, k: int = 10) -> list[dict]:
# Expand query
queries = expand_query(query, self.llm)
# Retrieve for each variation
all_results = []
seen_ids = set()
for q in queries:
results = self.retriever.search(q, k=k)
for result in results:
if result["id"] not in seen_ids:
all_results.append(result)
seen_ids.add(result["id"])
# Rerank combined results
return self._rerank(query, all_results, k)
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer, then search for similar documents:
class HyDERetriever:
def __init__(self, vectorstore, llm, embeddings):
self.vectorstore = vectorstore
self.llm = llm
self.embeddings = embeddings
def search(self, query: str, k: int = 5) -> list[dict]:
# Generate hypothetical document
prompt = f"""Write a detailed answer to this question as if you were
writing documentation:
Question: {query}
Answer:"""
hypothetical_doc = self.llm.invoke(prompt).content
# Embed the hypothetical document
hyde_embedding = self.embeddings.embed_query(hypothetical_doc)
# Search using hypothetical embedding
results = self.vectorstore.similarity_search_by_vector(
hyde_embedding,
k=k
)
return results
# Example
query = "How do I implement rate limiting?"
# Hypothetical doc generated:
# "Rate limiting can be implemented using a token bucket algorithm.
# First, define a bucket size and refill rate..."
# This embedding matches documentation better than the question would
Why HyDE works:
- Questions and documents have different embeddings
- A hypothetical answer is closer to actual documentation
- The embedding of the answer matches document embeddings better
Query Decomposition
Break complex queries into sub-queries:
def decompose_query(query: str, llm) -> list[str]:
"""Break complex query into simpler sub-queries."""
prompt = f"""Analyze this query and break it into simpler sub-queries
that can be answered independently:
Query: {query}
If the query is already simple, return it as-is.
Otherwise, return 2-4 sub-queries, one per line.
Sub-queries:"""
response = llm.invoke(prompt)
sub_queries = response.content.strip().split('\n')
return sub_queries if len(sub_queries) > 1 else [query]
# Example
query = "Compare OAuth and JWT for API authentication and show implementation"
sub_queries = decompose_query(query, llm)
# ["What is OAuth for API authentication?",
# "What is JWT for API authentication?",
# "How to implement OAuth?",
# "How to implement JWT?"]
Step-Back Prompting
Generate a more general query first:
def step_back_query(query: str, llm) -> str:
"""Generate a broader, more general query."""
prompt = f"""Given this specific query, generate a broader question
that would provide useful background context:
Specific query: {query}
Broader question:"""
return llm.invoke(prompt).content.strip()
class StepBackRetriever:
def __init__(self, retriever, llm):
self.retriever = retriever
self.llm = llm
def search(self, query: str, k: int = 5) -> list[dict]:
# Get step-back query
broad_query = step_back_query(query, self.llm)
# Retrieve for both
specific_results = self.retriever.search(query, k=k)
broad_results = self.retriever.search(broad_query, k=k//2)
# Combine (broad context first, then specific)
return broad_results + specific_results
# Example
query = "Why does HNSW index have higher memory usage than IVF?"
step_back = "How do vector database indexing algorithms work?"
# Broad context helps answer the specific question
Query Transformation Pipeline
Combine multiple techniques:
class QueryTransformPipeline:
def __init__(self, retriever, llm, embeddings):
self.retriever = retriever
self.llm = llm
self.embeddings = embeddings
def search(
self,
query: str,
k: int = 5,
use_expansion: bool = True,
use_hyde: bool = False,
use_decomposition: bool = False
) -> list[dict]:
all_results = []
seen = set()
# Original query
queries = [query]
# Query expansion
if use_expansion:
queries.extend(expand_query(query, self.llm))
# Query decomposition
if use_decomposition:
queries.extend(decompose_query(query, self.llm))
# Retrieve for all queries
for q in queries:
if use_hyde:
results = self._hyde_search(q, k=k)
else:
results = self.retriever.search(q, k=k)
for r in results:
if r["id"] not in seen:
all_results.append(r)
seen.add(r["id"])
# Rerank by original query
return self._rerank(query, all_results, k)
Choosing Enhancement Techniques
| Technique | Best For | Latency Impact |
|---|---|---|
| Query expansion | Vocabulary mismatch | +100-200ms |
| HyDE | Q&A retrieval | +200-500ms |
| Decomposition | Complex queries | +200-400ms |
| Step-back | Questions needing context | +100-200ms |
START
│
▼
Simple factual query?
│
├─ YES → Query expansion only
│
▼
Q&A over documentation?
│
├─ YES → HyDE + expansion
│
▼
Complex multi-part query?
│
├─ YES → Decomposition
│
▼
Query needs background context?
│
├─ YES → Step-back prompting
│
▼
Default → Query expansion (good baseline)
Latency Note: Query enhancement adds LLM calls. Cache common query transformations and consider async processing for production systems.
In the next module, we'll learn how to evaluate RAG systems systematically using RAGAS and other frameworks. :::