Lesson 7 of 24

Embedding Models & Vector Databases

Indexing Strategies

3 min read

Vector indexes enable fast similarity search over millions of vectors. Understanding index types helps you balance speed, accuracy, and memory.

Why Indexing Matters

Without an index, finding similar vectors requires comparing against every vector:

# Brute force: O(n) - checks every vector
def brute_force_search(query: list[float], vectors: list[list[float]], k: int):
    distances = [cosine_distance(query, v) for v in vectors]
    return sorted(range(len(distances)), key=lambda i: distances[i])[:k]

# With 1M vectors: ~1 second per query
# With index: ~10ms per query

Index Types

IndexSpeedAccuracyMemoryBest For
FlatSlow100%High< 10K vectors
IVFFast95-99%Medium10K-10M vectors
HNSWFastest95-99%HighAny scale, low latency
PQFast90-95%LowMemory constrained

Flat Index (Brute Force)

Exact search with no approximation:

# Pinecone
index = pc.create_index(
    name="exact-search",
    dimension=1536,
    metric="cosine"
    # Flat index by default for small datasets
)

# Chroma
collection = client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
    # Uses flat search for small collections
)

Use when: < 10K vectors or when 100% recall is required.

IVF (Inverted File Index)

Clusters vectors, only searches relevant clusters:

# pgvector with IVF
cur.execute("""
    CREATE INDEX ON documents
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100)  -- Number of clusters
""")

# During search, only probes subset of clusters
# More lists = faster search, lower recall
# Fewer lists = slower search, higher recall

How it works:

  1. Clusters vectors into lists groups during indexing
  2. At query time, finds nearest cluster centroids
  3. Only searches vectors in those clusters

Tuning IVF:

Dataset SizeListsProbes
100K10010
1M100050
10M4000100

HNSW (Hierarchical Navigable Small World)

Graph-based index for ultra-fast search:

# Qdrant
from qdrant_client.models import VectorParams, HnswConfigDiff

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE
    ),
    hnsw_config=HnswConfigDiff(
        m=16,           # Connections per node (higher = better recall, more memory)
        ef_construct=100  # Build quality (higher = better index, slower build)
    )
)

# Query-time tuning
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=5,
    search_params={"ef": 128}  # Higher = better recall, slower search
)

HNSW Parameters:

ParameterEffectTypical Values
mGraph connectivity8-64 (16 default)
ef_constructIndex build quality64-512 (100 default)
ef (query)Search quality50-500

Product Quantization (PQ)

Compresses vectors for memory efficiency:

# Milvus with PQ
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType

schema = CollectionSchema([
    FieldSchema("id", DataType.INT64, is_primary=True),
    FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=1536)
])

collection = Collection("documents", schema)

# Create PQ index
index_params = {
    "index_type": "IVF_PQ",
    "metric_type": "IP",  # Inner product
    "params": {
        "nlist": 1024,    # IVF clusters
        "m": 64,          # Number of subquantizers
        "nbits": 8        # Bits per subquantizer
    }
}
collection.create_index("embedding", index_params)

Memory savings: 1536-dim float32 (6KB) → PQ with m=64 (64 bytes) = ~99% reduction

Trade-off: Lower recall accuracy (90-95%) in exchange for massive memory savings.

Quantization Methods

MethodMemory ReductionAccuracy Impact
SQ (Scalar)4xMinimal
PQ (Product)32-64xModerate
Binary32xSignificant
# Qdrant scalar quantization
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    quantization_config={
        "scalar": {
            "type": "int8",
            "quantile": 0.99,
            "always_ram": True
        }
    }
)

Choosing an Index Strategy

START
< 10K vectors?
  ├─ YES → Flat (exact search)
Memory constrained?
  ├─ YES → IVF_PQ or HNSW + scalar quantization
Need lowest latency?
  ├─ YES → HNSW (tune m and ef)
> 10M vectors with balanced needs?
  ├─ YES → IVF_HNSW hybrid
Default → HNSW (best latency/recall trade-off)

Index Build vs Query Time

IndexBuild TimeQuery TimeMemory
FlatO(n)O(n)1x
IVFO(n log n)O(log n)1.1x
HNSWO(n log n)O(log n)1.5-2x
PQO(n)O(log n)0.1x

Production Tip: Start with HNSW defaults. Only tune parameters when you have specific latency or recall requirements. Pre-optimization is rarely necessary.

Next, let's explore metadata filtering for precise retrieval control. :::

Quick check: how does this lesson land for you?

Quiz

Module 2: Embedding Models & Vector Databases

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.