Indexing Strategies

Vector indexes enable fast similarity search over millions of vectors. Understanding index types helps you balance speed, accuracy, and memory.

Why Indexing Matters

Without an index, finding similar vectors requires comparing against every vector:

# Brute force: O(n) - checks every vector
def brute_force_search(query: list[float], vectors: list[list[float]], k: int):
    distances = [cosine_distance(query, v) for v in vectors]
    return sorted(range(len(distances)), key=lambda i: distances[i])[:k]

# With 1M vectors: ~1 second per query
# With index: ~10ms per query

Index Types

Index	Speed	Accuracy	Memory	Best For
Flat	Slow	100%	High	< 10K vectors
IVF	Fast	95-99%	Medium	10K-10M vectors
HNSW	Fastest	95-99%	High	Any scale, low latency
PQ	Fast	90-95%	Low	Memory constrained

Flat Index (Brute Force)

Exact search with no approximation:

# Pinecone
index = pc.create_index(
    name="exact-search",
    dimension=1536,
    metric="cosine"
    # Flat index by default for small datasets
)

# Chroma
collection = client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
    # Uses flat search for small collections
)

Use when: < 10K vectors or when 100% recall is required.

IVF (Inverted File Index)

Clusters vectors, only searches relevant clusters:

# pgvector with IVF
cur.execute("""
    CREATE INDEX ON documents
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100)  -- Number of clusters
""")

# During search, only probes subset of clusters
# More lists = faster search, lower recall
# Fewer lists = slower search, higher recall

How it works:

Clusters vectors into lists groups during indexing
At query time, finds nearest cluster centroids
Only searches vectors in those clusters

Tuning IVF:

Dataset Size	Lists	Probes
100K	100	10
1M	1000	50
10M	4000	100

HNSW (Hierarchical Navigable Small World)

Graph-based index for ultra-fast search:

# Qdrant
from qdrant_client.models import VectorParams, HnswConfigDiff

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE
    ),
    hnsw_config=HnswConfigDiff(
        m=16,           # Connections per node (higher = better recall, more memory)
        ef_construct=100  # Build quality (higher = better index, slower build)
    )
)

# Query-time tuning
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=5,
    search_params={"ef": 128}  # Higher = better recall, slower search
)

HNSW Parameters:

Parameter	Effect	Typical Values
`m`	Graph connectivity	8-64 (16 default)
`ef_construct`	Index build quality	64-512 (100 default)
`ef` (query)	Search quality	50-500

Product Quantization (PQ)

Compresses vectors for memory efficiency:

# Milvus with PQ
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType

schema = CollectionSchema([
    FieldSchema("id", DataType.INT64, is_primary=True),
    FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=1536)
])

collection = Collection("documents", schema)

# Create PQ index
index_params = {
    "index_type": "IVF_PQ",
    "metric_type": "IP",  # Inner product
    "params": {
        "nlist": 1024,    # IVF clusters
        "m": 64,          # Number of subquantizers
        "nbits": 8        # Bits per subquantizer
    }
}
collection.create_index("embedding", index_params)

Memory savings: 1536-dim float32 (6KB) → PQ with m=64 (64 bytes) = ~99% reduction

Trade-off: Lower recall accuracy (90-95%) in exchange for massive memory savings.

Quantization Methods

Method	Memory Reduction	Accuracy Impact
SQ (Scalar)	4x	Minimal
PQ (Product)	32-64x	Moderate
Binary	32x	Significant

# Qdrant scalar quantization
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    quantization_config={
        "scalar": {
            "type": "int8",
            "quantile": 0.99,
            "always_ram": True
        }
    }
)

Choosing an Index Strategy

START
  │
  ▼
< 10K vectors?
  │
  ├─ YES → Flat (exact search)
  │
  ▼
Memory constrained?
  │
  ├─ YES → IVF_PQ or HNSW + scalar quantization
  │
  ▼
Need lowest latency?
  │
  ├─ YES → HNSW (tune m and ef)
  │
  ▼
> 10M vectors with balanced needs?
  │
  ├─ YES → IVF_HNSW hybrid
  │
  ▼
Default → HNSW (best latency/recall trade-off)

Index Build vs Query Time

Index	Build Time	Query Time	Memory
Flat	O(n)	O(n)	1x
IVF	O(n log n)	O(log n)	1.1x
HNSW	O(n log n)	O(log n)	1.5-2x
PQ	O(n)	O(log n)	0.1x

Production Tip: Start with HNSW defaults. Only tune parameters when you have specific latency or recall requirements. Pre-optimization is rarely necessary.

Next, let's explore metadata filtering for precise retrieval control. :::