Metadata Filtering

Metadata filtering combines structured queries with vector similarity, enabling precise retrieval that pure semantic search cannot achieve.

Why Metadata Matters

Semantic search alone has limitations:

# Query: "Return policy for electronics"
# Problem: Retrieves return policies from ALL departments

# With metadata filtering:
results = vectorstore.similarity_search(
    query="Return policy",
    k=5,
    filter={"department": "electronics"}
)
# Now retrieves only electronics-specific policies

Metadata Schema Design

Design metadata for your retrieval patterns:

# Document with rich metadata
document = {
    "content": "Electronics can be returned within 30 days...",
    "metadata": {
        # Categorical - for exact matching
        "department": "electronics",
        "document_type": "policy",
        "language": "en",

        # Numeric - for range queries
        "page_number": 5,
        "word_count": 250,
        "created_year": 2024,

        # Boolean - for flags
        "is_current": True,
        "requires_approval": False,

        # Text - for partial matching
        "source_file": "electronics_manual.pdf",
        "author": "John Smith"
    }
}

Filtering Patterns by Database

Pinecone

# Exact match
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={"department": {"$eq": "electronics"}}
)

# Multiple conditions (AND)
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={
        "$and": [
            {"department": {"$eq": "electronics"}},
            {"is_current": {"$eq": True}},
            {"created_year": {"$gte": 2023}}
        ]
    }
)

# OR conditions
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={
        "$or": [
            {"department": {"$eq": "electronics"}},
            {"department": {"$eq": "computers"}}
        ]
    }
)

# IN operator (multiple values)
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={"department": {"$in": ["electronics", "computers", "phones"]}}
)

Qdrant

from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

# Exact match
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=5,
    query_filter=Filter(
        must=[
            FieldCondition(key="department", match=MatchValue(value="electronics"))
        ]
    )
)

# Range query
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=5,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="created_year",
                range=Range(gte=2023, lte=2024)
            )
        ]
    )
)

# Complex conditions
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=5,
    query_filter=Filter(
        must=[
            FieldCondition(key="is_current", match=MatchValue(value=True))
        ],
        should=[  # OR conditions
            FieldCondition(key="department", match=MatchValue(value="electronics")),
            FieldCondition(key="department", match=MatchValue(value="computers"))
        ]
    )
)

Chroma

# Exact match
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
    where={"department": "electronics"}
)

# Multiple conditions (AND)
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
    where={
        "$and": [
            {"department": {"$eq": "electronics"}},
            {"is_current": {"$eq": True}}
        ]
    }
)

# OR conditions
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
    where={
        "$or": [
            {"department": {"$eq": "electronics"}},
            {"department": {"$eq": "computers"}}
        ]
    }
)

Filter Pushdown vs Post-Filtering

Strategy	How It Works	Performance
Filter Pushdown	Filters during vector search	Fast, accurate
Post-Filtering	Filters after retrieving top-k	May return < k results

# Filter pushdown (preferred) - database filters during search
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={"department": "electronics"}  # Applied during search
)
# Always returns 10 results matching filter

# Post-filtering (fallback) - filter after retrieval
results = index.query(vector=query_embedding, top_k=100)
filtered = [r for r in results if r.metadata["department"] == "electronics"][:10]
# May return fewer than 10 if not enough matches in top-100

Namespace Strategies

Organize data for efficient retrieval:

# Pinecone namespaces
index.upsert(
    vectors=[{"id": "doc1", "values": embedding, "metadata": {...}}],
    namespace="electronics"  # Separate index partition
)

# Query specific namespace
results = index.query(
    vector=query_embedding,
    top_k=5,
    namespace="electronics"
)

# Query all namespaces
results = index.query(
    vector=query_embedding,
    top_k=5,
    namespace=""  # Empty string = all namespaces
)

When to use namespaces:

Multi-tenant applications (separate by tenant_id)
Distinct document types (policies vs FAQs)
Language separation

Dynamic Filtering

Build filters from user context:

def build_filter(user_query: str, user_context: dict):
    """Build filter based on user context."""
    filters = []

    # User's department access
    if user_context.get("departments"):
        filters.append({
            "department": {"$in": user_context["departments"]}
        })

    # Only current documents
    filters.append({"is_current": {"$eq": True}})

    # Language preference
    if user_context.get("language"):
        filters.append({
            "language": {"$eq": user_context["language"]}
        })

    return {"$and": filters} if filters else {}

# Usage
user_context = {
    "departments": ["electronics", "computers"],
    "language": "en"
}

results = index.query(
    vector=query_embedding,
    top_k=5,
    filter=build_filter(query, user_context)
)

Best Practices

Practice	Benefit
Index metadata fields	Faster filtering
Use categorical over text	Exact matching
Limit filter complexity	Query performance
Pre-compute common filters	Reduce runtime processing
Test with realistic data	Validate filter selectivity

Design Principle: The best RAG systems use metadata to narrow the search space before semantic similarity. Think of filters as "where to look" and vectors as "what to find."

In the next module, we'll explore advanced chunking strategies that directly impact retrieval quality. :::