Embedding Models & Vector Databases
Metadata Filtering
3 min read
Metadata filtering combines structured queries with vector similarity, enabling precise retrieval that pure semantic search cannot achieve.
Why Metadata Matters
Semantic search alone has limitations:
# Query: "Return policy for electronics"
# Problem: Retrieves return policies from ALL departments
# With metadata filtering:
results = vectorstore.similarity_search(
query="Return policy",
k=5,
filter={"department": "electronics"}
)
# Now retrieves only electronics-specific policies
Metadata Schema Design
Design metadata for your retrieval patterns:
# Document with rich metadata
document = {
"content": "Electronics can be returned within 30 days...",
"metadata": {
# Categorical - for exact matching
"department": "electronics",
"document_type": "policy",
"language": "en",
# Numeric - for range queries
"page_number": 5,
"word_count": 250,
"created_year": 2024,
# Boolean - for flags
"is_current": True,
"requires_approval": False,
# Text - for partial matching
"source_file": "electronics_manual.pdf",
"author": "John Smith"
}
}
Filtering Patterns by Database
Pinecone
# Exact match
results = index.query(
vector=query_embedding,
top_k=5,
filter={"department": {"$eq": "electronics"}}
)
# Multiple conditions (AND)
results = index.query(
vector=query_embedding,
top_k=5,
filter={
"$and": [
{"department": {"$eq": "electronics"}},
{"is_current": {"$eq": True}},
{"created_year": {"$gte": 2023}}
]
}
)
# OR conditions
results = index.query(
vector=query_embedding,
top_k=5,
filter={
"$or": [
{"department": {"$eq": "electronics"}},
{"department": {"$eq": "computers"}}
]
}
)
# IN operator (multiple values)
results = index.query(
vector=query_embedding,
top_k=5,
filter={"department": {"$in": ["electronics", "computers", "phones"]}}
)
Qdrant
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
# Exact match
results = client.search(
collection_name="documents",
query_vector=query_embedding,
limit=5,
query_filter=Filter(
must=[
FieldCondition(key="department", match=MatchValue(value="electronics"))
]
)
)
# Range query
results = client.search(
collection_name="documents",
query_vector=query_embedding,
limit=5,
query_filter=Filter(
must=[
FieldCondition(
key="created_year",
range=Range(gte=2023, lte=2024)
)
]
)
)
# Complex conditions
results = client.search(
collection_name="documents",
query_vector=query_embedding,
limit=5,
query_filter=Filter(
must=[
FieldCondition(key="is_current", match=MatchValue(value=True))
],
should=[ # OR conditions
FieldCondition(key="department", match=MatchValue(value="electronics")),
FieldCondition(key="department", match=MatchValue(value="computers"))
]
)
)
Chroma
# Exact match
results = collection.query(
query_embeddings=[query_embedding],
n_results=5,
where={"department": "electronics"}
)
# Multiple conditions (AND)
results = collection.query(
query_embeddings=[query_embedding],
n_results=5,
where={
"$and": [
{"department": {"$eq": "electronics"}},
{"is_current": {"$eq": True}}
]
}
)
# OR conditions
results = collection.query(
query_embeddings=[query_embedding],
n_results=5,
where={
"$or": [
{"department": {"$eq": "electronics"}},
{"department": {"$eq": "computers"}}
]
}
)
Filter Pushdown vs Post-Filtering
| Strategy | How It Works | Performance |
|---|---|---|
| Filter Pushdown | Filters during vector search | Fast, accurate |
| Post-Filtering | Filters after retrieving top-k | May return < k results |
# Filter pushdown (preferred) - database filters during search
results = index.query(
vector=query_embedding,
top_k=10,
filter={"department": "electronics"} # Applied during search
)
# Always returns 10 results matching filter
# Post-filtering (fallback) - filter after retrieval
results = index.query(vector=query_embedding, top_k=100)
filtered = [r for r in results if r.metadata["department"] == "electronics"][:10]
# May return fewer than 10 if not enough matches in top-100
Namespace Strategies
Organize data for efficient retrieval:
# Pinecone namespaces
index.upsert(
vectors=[{"id": "doc1", "values": embedding, "metadata": {...}}],
namespace="electronics" # Separate index partition
)
# Query specific namespace
results = index.query(
vector=query_embedding,
top_k=5,
namespace="electronics"
)
# Query all namespaces
results = index.query(
vector=query_embedding,
top_k=5,
namespace="" # Empty string = all namespaces
)
When to use namespaces:
- Multi-tenant applications (separate by tenant_id)
- Distinct document types (policies vs FAQs)
- Language separation
Dynamic Filtering
Build filters from user context:
def build_filter(user_query: str, user_context: dict):
"""Build filter based on user context."""
filters = []
# User's department access
if user_context.get("departments"):
filters.append({
"department": {"$in": user_context["departments"]}
})
# Only current documents
filters.append({"is_current": {"$eq": True}})
# Language preference
if user_context.get("language"):
filters.append({
"language": {"$eq": user_context["language"]}
})
return {"$and": filters} if filters else {}
# Usage
user_context = {
"departments": ["electronics", "computers"],
"language": "en"
}
results = index.query(
vector=query_embedding,
top_k=5,
filter=build_filter(query, user_context)
)
Best Practices
| Practice | Benefit |
|---|---|
| Index metadata fields | Faster filtering |
| Use categorical over text | Exact matching |
| Limit filter complexity | Query performance |
| Pre-compute common filters | Reduce runtime processing |
| Test with realistic data | Validate filter selectivity |
Design Principle: The best RAG systems use metadata to narrow the search space before semantic similarity. Think of filters as "where to look" and vectors as "what to find."
In the next module, we'll explore advanced chunking strategies that directly impact retrieval quality. :::