Lesson 7 of 23

Embedding Models & Vector Databases

Indexing Strategies

3 min read

Vector indexes enable fast similarity search over millions of vectors. Understanding index types helps you balance speed, accuracy, and memory.

Why Indexing Matters

Without an index, finding similar vectors requires comparing against every vector:

# Brute force: O(n) - checks every vector
def brute_force_search(query: list[float], vectors: list[list[float]], k: int):
    distances = [cosine_distance(query, v) for v in vectors]
    return sorted(range(len(distances)), key=lambda i: distances[i])[:k]

# With 1M vectors: ~1 second per query
# With index: ~10ms per query

Index Types

Index Speed Accuracy Memory Best For
Flat Slow 100% High < 10K vectors
IVF Fast 95-99% Medium 10K-10M vectors
HNSW Fastest 95-99% High Any scale, low latency
PQ Fast 90-95% Low Memory constrained

Flat Index (Brute Force)

Exact search with no approximation:

# Pinecone
index = pc.create_index(
    name="exact-search",
    dimension=1536,
    metric="cosine"
    # Flat index by default for small datasets
)

# Chroma
collection = client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
    # Uses flat search for small collections
)

Use when: < 10K vectors or when 100% recall is required.

IVF (Inverted File Index)

Clusters vectors, only searches relevant clusters:

# pgvector with IVF
cur.execute("""
    CREATE INDEX ON documents
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100)  -- Number of clusters
""")

# During search, only probes subset of clusters
# More lists = faster search, lower recall
# Fewer lists = slower search, higher recall

How it works:

  1. Clusters vectors into lists groups during indexing
  2. At query time, finds nearest cluster centroids
  3. Only searches vectors in those clusters

Tuning IVF:

Dataset Size Lists Probes
100K 100 10
1M 1000 50
10M 4000 100

HNSW (Hierarchical Navigable Small World)

Graph-based index for ultra-fast search:

# Qdrant
from qdrant_client.models import VectorParams, HnswConfigDiff

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE
    ),
    hnsw_config=HnswConfigDiff(
        m=16,           # Connections per node (higher = better recall, more memory)
        ef_construct=100  # Build quality (higher = better index, slower build)
    )
)

# Query-time tuning
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=5,
    search_params={"ef": 128}  # Higher = better recall, slower search
)

HNSW Parameters:

Parameter Effect Typical Values
m Graph connectivity 8-64 (16 default)
ef_construct Index build quality 64-512 (100 default)
ef (query) Search quality 50-500

Product Quantization (PQ)

Compresses vectors for memory efficiency:

# Milvus with PQ
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType

schema = CollectionSchema([
    FieldSchema("id", DataType.INT64, is_primary=True),
    FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=1536)
])

collection = Collection("documents", schema)

# Create PQ index
index_params = {
    "index_type": "IVF_PQ",
    "metric_type": "IP",  # Inner product
    "params": {
        "nlist": 1024,    # IVF clusters
        "m": 64,          # Number of subquantizers
        "nbits": 8        # Bits per subquantizer
    }
}
collection.create_index("embedding", index_params)

Memory savings: 1536-dim float32 (6KB) → PQ with m=64 (64 bytes) = ~99% reduction

Trade-off: Lower recall accuracy (90-95%) in exchange for massive memory savings.

Quantization Methods

Method Memory Reduction Accuracy Impact
SQ (Scalar) 4x Minimal
PQ (Product) 32-64x Moderate
Binary 32x Significant
# Qdrant scalar quantization
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    quantization_config={
        "scalar": {
            "type": "int8",
            "quantile": 0.99,
            "always_ram": True
        }
    }
)

Choosing an Index Strategy

START
< 10K vectors?
  ├─ YES → Flat (exact search)
Memory constrained?
  ├─ YES → IVF_PQ or HNSW + scalar quantization
Need lowest latency?
  ├─ YES → HNSW (tune m and ef)
> 10M vectors with balanced needs?
  ├─ YES → IVF_HNSW hybrid
Default → HNSW (best latency/recall trade-off)

Index Build vs Query Time

Index Build Time Query Time Memory
Flat O(n) O(n) 1x
IVF O(n log n) O(log n) 1.1x
HNSW O(n log n) O(log n) 1.5-2x
PQ O(n) O(log n) 0.1x

Production Tip: Start with HNSW defaults. Only tune parameters when you have specific latency or recall requirements. Pre-optimization is rarely necessary.

Next, let's explore metadata filtering for precise retrieval control. :::

Quiz

Module 2: Embedding Models & Vector Databases

Take Quiz