Embedding Models & Vector Databases
Embedding Model Comparison
3 min read
The embedding model is the foundation of semantic search. Choosing the right model dramatically impacts retrieval quality.
What Embeddings Do
Embeddings convert text into dense vectors that capture semantic meaning:
from openai import OpenAI
client = OpenAI()
# Same meaning, different words → similar vectors
text1 = "The cat sat on the mat"
text2 = "A feline rested on the rug"
emb1 = client.embeddings.create(input=text1, model="text-embedding-3-small")
emb2 = client.embeddings.create(input=text2, model="text-embedding-3-small")
# Cosine similarity will be high (~0.85+)
Model Categories
| Category | Examples | Best For |
|---|---|---|
| Commercial APIs | OpenAI, Cohere, Voyage | Production, ease of use |
| Open Source | BGE, E5, GTE | Privacy, cost control, customization |
| Domain-Specific | Legal-BERT, BioBERT | Specialized domains |
Commercial API Models
OpenAI Embeddings
from openai import OpenAI
client = OpenAI()
def embed_openai(texts: list[str], model: str = "text-embedding-3-small"):
response = client.embeddings.create(input=texts, model=model)
return [item.embedding for item in response.data]
# Models available:
# text-embedding-3-small: 1536 dims, $0.02/1M tokens
# text-embedding-3-large: 3072 dims, $0.13/1M tokens
# text-embedding-ada-002: 1536 dims (legacy)
Cohere Embed
import cohere
co = cohere.Client()
def embed_cohere(texts: list[str], input_type: str = "search_document"):
response = co.embed(
texts=texts,
model="embed-english-v3.0",
input_type=input_type # "search_document" or "search_query"
)
return response.embeddings
# Separate modes for documents vs queries improves retrieval
Voyage AI
import voyageai
vo = voyageai.Client()
def embed_voyage(texts: list[str]):
response = vo.embed(
texts,
model="voyage-large-2",
input_type="document" # or "query"
)
return response.embeddings
# Known for excellent code and legal domain performance
Open Source Models
BGE (BAAI General Embedding)
from sentence_transformers import SentenceTransformer
# BGE models - excellent multilingual support
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
def embed_bge(texts: list[str]):
# BGE recommends adding instruction for queries
return model.encode(texts, normalize_embeddings=True)
def embed_bge_query(query: str):
# Prefix for queries improves retrieval
instruction = "Represent this sentence for searching relevant passages: "
return model.encode(instruction + query, normalize_embeddings=True)
E5 (Embeddings from Experts)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-large-v2')
def embed_e5(texts: list[str], is_query: bool = False):
# E5 requires prefixes
if is_query:
texts = [f"query: {t}" for t in texts]
else:
texts = [f"passage: {t}" for t in texts]
return model.encode(texts, normalize_embeddings=True)
MTEB Benchmark Comparison
The Massive Text Embedding Benchmark (MTEB) provides standardized comparisons:
| Model | MTEB Score | Dimensions | Speed |
|---|---|---|---|
| voyage-large-2 | 68.28 | 1536 | Fast (API) |
| text-embedding-3-large | 64.59 | 3072 | Fast (API) |
| bge-large-en-v1.5 | 64.23 | 1024 | Medium |
| e5-large-v2 | 62.25 | 1024 | Medium |
| text-embedding-3-small | 62.26 | 1536 | Fast (API) |
Note: MTEB scores vary by task type. Check retrieval-specific benchmarks for RAG.
Choosing the Right Model
START
│
▼
Data must stay on-premise?
│
├─ YES → Open source (BGE, E5)
│
▼
Specialized domain (legal, medical, code)?
│
├─ YES → Domain-specific or Voyage
│
▼
Multilingual requirements?
│
├─ YES → BGE-M3 or Cohere multilingual
│
▼
Budget constrained?
│
├─ YES → text-embedding-3-small or open source
│
▼
Default → text-embedding-3-small (best cost/quality ratio)
Implementation Tips
class EmbeddingManager:
"""Manage embeddings with batching and caching."""
def __init__(self, model_name: str, batch_size: int = 100):
self.model_name = model_name
self.batch_size = batch_size
self.cache = {}
def embed(self, texts: list[str]) -> list[list[float]]:
# Check cache
uncached = [t for t in texts if t not in self.cache]
if uncached:
# Batch for efficiency
for i in range(0, len(uncached), self.batch_size):
batch = uncached[i:i + self.batch_size]
embeddings = self._embed_batch(batch)
for text, emb in zip(batch, embeddings):
self.cache[text] = emb
return [self.cache[t] for t in texts]
def _embed_batch(self, texts: list[str]) -> list[list[float]]:
# Implementation depends on model
pass
Cost Tip: OpenAI's text-embedding-3-small offers the best cost-to-quality ratio for most use cases. Only upgrade to larger models if retrieval quality demonstrably improves on your data.
Next, let's explore vector database options for storing and searching embeddings. :::