Hybrid Search & Reranking
Reranking Strategies
Reranking takes initial retrieval results and reorders them using more sophisticated models for improved precision.
Why Rerank?
Initial retrieval is fast but imprecise. Reranking adds precision:
Initial Retrieval (fast, recall-focused)
├── Retrieve top 100 candidates
└── Using bi-encoder embeddings
Reranking (slower, precision-focused)
├── Score each candidate against query
└── Using cross-encoder or LLM
└── Return top 10
| Stage | Speed | Quality | Use |
|---|---|---|---|
| Retrieval | Fast (~10ms) | Good recall | Cast wide net |
| Reranking | Slower (~100ms) | High precision | Select best |
Cross-Encoder Reranking
Cross-encoders process query-document pairs together for better relevance scoring:
from sentence_transformers import CrossEncoder
class CrossEncoderReranker:
def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
self.model = CrossEncoder(model_name)
def rerank(
self,
query: str,
documents: list[str],
top_k: int = 5
) -> list[tuple[str, float]]:
"""
Rerank documents by relevance to query.
Returns:
List of (document, score) sorted by relevance
"""
# Create query-document pairs
pairs = [(query, doc) for doc in documents]
# Score all pairs
scores = self.model.predict(pairs)
# Sort by score descending
doc_scores = list(zip(documents, scores))
doc_scores.sort(key=lambda x: x[1], reverse=True)
return doc_scores[:top_k]
# Usage
reranker = CrossEncoderReranker()
results = reranker.rerank(
query="How to handle API authentication?",
documents=retrieved_docs,
top_k=5
)
Popular cross-encoder models:
| Model | Size | Speed | Quality |
|---|---|---|---|
| ms-marco-MiniLM-L-6-v2 | 23M | Fast | Good |
| ms-marco-MiniLM-L-12-v2 | 33M | Medium | Better |
| bge-reranker-large | 560M | Slow | Best |
Cohere Rerank
Commercial reranking API with excellent quality:
import cohere
co = cohere.Client(api_key="your-key")
def cohere_rerank(
query: str,
documents: list[str],
top_k: int = 5
) -> list[dict]:
"""Rerank using Cohere's API."""
response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=documents,
top_n=top_k
)
return [
{
"document": documents[r.index],
"score": r.relevance_score,
"index": r.index
}
for r in response.results
]
# Usage
results = cohere_rerank(
query="OAuth 2.0 implementation",
documents=retrieved_docs,
top_k=5
)
ColBERT Late Interaction
ColBERT provides fast reranking through late interaction:
from colbert import Searcher
from colbert.infra import ColBERTConfig
class ColBERTReranker:
def __init__(self, index_path: str):
config = ColBERTConfig(
doc_maxlen=512,
query_maxlen=64
)
self.searcher = Searcher(index=index_path, config=config)
def rerank(self, query: str, doc_ids: list[int], top_k: int = 5):
"""Rerank using ColBERT late interaction."""
scores = []
for doc_id in doc_ids:
score = self.searcher.score(query, doc_id)
scores.append((doc_id, score))
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]
How ColBERT works:
- Pre-compute token embeddings for documents
- At query time, compute query token embeddings
- Late interaction: MaxSim between query and doc tokens
- Much faster than full cross-encoder
LLM-Based Reranking
Use LLMs for zero-shot reranking:
from openai import OpenAI
client = OpenAI()
def llm_rerank(
query: str,
documents: list[str],
top_k: int = 5
) -> list[tuple[str, int]]:
"""Rerank using GPT-4."""
# Format documents with indices
doc_list = "\n".join([f"[{i}] {doc[:500]}" for i, doc in enumerate(documents)])
prompt = f"""Given the query and documents below, rank the documents by relevance.
Return only the document indices in order of relevance, most relevant first.
Query: {query}
Documents:
{doc_list}
Return format: comma-separated indices (e.g., "3, 1, 4, 2, 0")
Ranking:"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
# Parse response
indices = [int(i.strip()) for i in response.choices[0].message.content.split(",")]
return [(documents[i], idx) for idx, i in enumerate(indices[:top_k])]
Reranking Pipeline
Complete retrieval + reranking pipeline:
class RAGPipelineWithReranking:
def __init__(self, retriever, reranker):
self.retriever = retriever
self.reranker = reranker
def search(self, query: str, retrieve_k: int = 20, final_k: int = 5):
"""
Two-stage retrieval with reranking.
Args:
query: Search query
retrieve_k: Number to retrieve initially
final_k: Number to return after reranking
"""
# Stage 1: Fast retrieval
candidates = self.retriever.search(query, k=retrieve_k)
# Stage 2: Rerank
documents = [c["content"] for c in candidates]
reranked = self.reranker.rerank(query, documents, top_k=final_k)
return reranked
# Usage
pipeline = RAGPipelineWithReranking(
retriever=HybridRetriever(documents),
reranker=CrossEncoderReranker()
)
results = pipeline.search("How to implement OAuth?", retrieve_k=50, final_k=5)
Choosing a Reranker
| Reranker | Latency | Quality | Cost |
|---|---|---|---|
| Cross-encoder (small) | ~50ms | Good | Free |
| Cross-encoder (large) | ~200ms | Better | Free |
| Cohere Rerank | ~100ms | Excellent | $$ |
| ColBERT | ~30ms | Good | Free |
| LLM (GPT-4) | ~500ms | Excellent | $$$ |
START
│
▼
Latency critical (<100ms)?
│
├─ YES → ColBERT or small cross-encoder
│
▼
Quality is top priority?
│
├─ YES → Cohere Rerank or large cross-encoder
│
▼
Budget constrained?
│
├─ YES → Open source cross-encoder
│
▼
Default → ms-marco-MiniLM-L-6-v2 (best balance)
Performance Tip: Retrieve 3-5x more candidates than you need, then rerank. The sweet spot is usually retrieving 20-50 candidates for top-5 results.
Next, let's explore query enhancement techniques to improve retrieval before it even begins. :::