Embedding Models & Vector Databases
Indexing Strategies
Vector indexes enable fast similarity search over millions of vectors. Understanding index types helps you balance speed, accuracy, and memory.
Why Indexing Matters
Without an index, finding similar vectors requires comparing against every vector:
# Brute force: O(n) - checks every vector
def brute_force_search(query: list[float], vectors: list[list[float]], k: int):
distances = [cosine_distance(query, v) for v in vectors]
return sorted(range(len(distances)), key=lambda i: distances[i])[:k]
# With 1M vectors: ~1 second per query
# With index: ~10ms per query
Index Types
| Index | Speed | Accuracy | Memory | Best For |
|---|---|---|---|---|
| Flat | Slow | 100% | High | < 10K vectors |
| IVF | Fast | 95-99% | Medium | 10K-10M vectors |
| HNSW | Fastest | 95-99% | High | Any scale, low latency |
| PQ | Fast | 90-95% | Low | Memory constrained |
Flat Index (Brute Force)
Exact search with no approximation:
# Pinecone
index = pc.create_index(
name="exact-search",
dimension=1536,
metric="cosine"
# Flat index by default for small datasets
)
# Chroma
collection = client.create_collection(
name="documents",
metadata={"hnsw:space": "cosine"}
# Uses flat search for small collections
)
Use when: < 10K vectors or when 100% recall is required.
IVF (Inverted File Index)
Clusters vectors, only searches relevant clusters:
# pgvector with IVF
cur.execute("""
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100) -- Number of clusters
""")
# During search, only probes subset of clusters
# More lists = faster search, lower recall
# Fewer lists = slower search, higher recall
How it works:
- Clusters vectors into
listsgroups during indexing - At query time, finds nearest cluster centroids
- Only searches vectors in those clusters
Tuning IVF:
| Dataset Size | Lists | Probes |
|---|---|---|
| 100K | 100 | 10 |
| 1M | 1000 | 50 |
| 10M | 4000 | 100 |
HNSW (Hierarchical Navigable Small World)
Graph-based index for ultra-fast search:
# Qdrant
from qdrant_client.models import VectorParams, HnswConfigDiff
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE
),
hnsw_config=HnswConfigDiff(
m=16, # Connections per node (higher = better recall, more memory)
ef_construct=100 # Build quality (higher = better index, slower build)
)
)
# Query-time tuning
results = client.search(
collection_name="documents",
query_vector=query_embedding,
limit=5,
search_params={"ef": 128} # Higher = better recall, slower search
)
HNSW Parameters:
| Parameter | Effect | Typical Values |
|---|---|---|
m |
Graph connectivity | 8-64 (16 default) |
ef_construct |
Index build quality | 64-512 (100 default) |
ef (query) |
Search quality | 50-500 |
Product Quantization (PQ)
Compresses vectors for memory efficiency:
# Milvus with PQ
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
schema = CollectionSchema([
FieldSchema("id", DataType.INT64, is_primary=True),
FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=1536)
])
collection = Collection("documents", schema)
# Create PQ index
index_params = {
"index_type": "IVF_PQ",
"metric_type": "IP", # Inner product
"params": {
"nlist": 1024, # IVF clusters
"m": 64, # Number of subquantizers
"nbits": 8 # Bits per subquantizer
}
}
collection.create_index("embedding", index_params)
Memory savings: 1536-dim float32 (6KB) → PQ with m=64 (64 bytes) = ~99% reduction
Trade-off: Lower recall accuracy (90-95%) in exchange for massive memory savings.
Quantization Methods
| Method | Memory Reduction | Accuracy Impact |
|---|---|---|
| SQ (Scalar) | 4x | Minimal |
| PQ (Product) | 32-64x | Moderate |
| Binary | 32x | Significant |
# Qdrant scalar quantization
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
quantization_config={
"scalar": {
"type": "int8",
"quantile": 0.99,
"always_ram": True
}
}
)
Choosing an Index Strategy
START
│
▼
< 10K vectors?
│
├─ YES → Flat (exact search)
│
▼
Memory constrained?
│
├─ YES → IVF_PQ or HNSW + scalar quantization
│
▼
Need lowest latency?
│
├─ YES → HNSW (tune m and ef)
│
▼
> 10M vectors with balanced needs?
│
├─ YES → IVF_HNSW hybrid
│
▼
Default → HNSW (best latency/recall trade-off)
Index Build vs Query Time
| Index | Build Time | Query Time | Memory |
|---|---|---|---|
| Flat | O(n) | O(n) | 1x |
| IVF | O(n log n) | O(log n) | 1.1x |
| HNSW | O(n log n) | O(log n) | 1.5-2x |
| PQ | O(n) | O(log n) | 0.1x |
Production Tip: Start with HNSW defaults. Only tune parameters when you have specific latency or recall requirements. Pre-optimization is rarely necessary.
Next, let's explore metadata filtering for precise retrieval control. :::