Lesson 5 of 23

Embedding Models & Vector Databases

Embedding Model Comparison

3 min read

The embedding model is the foundation of semantic search. Choosing the right model dramatically impacts retrieval quality.

What Embeddings Do

Embeddings convert text into dense vectors that capture semantic meaning:

from openai import OpenAI

client = OpenAI()

# Same meaning, different words → similar vectors
text1 = "The cat sat on the mat"
text2 = "A feline rested on the rug"

emb1 = client.embeddings.create(input=text1, model="text-embedding-3-small")
emb2 = client.embeddings.create(input=text2, model="text-embedding-3-small")

# Cosine similarity will be high (~0.85+)

Model Categories

Category Examples Best For
Commercial APIs OpenAI, Cohere, Voyage Production, ease of use
Open Source BGE, E5, GTE Privacy, cost control, customization
Domain-Specific Legal-BERT, BioBERT Specialized domains

Commercial API Models

OpenAI Embeddings

from openai import OpenAI

client = OpenAI()

def embed_openai(texts: list[str], model: str = "text-embedding-3-small"):
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

# Models available:
# text-embedding-3-small: 1536 dims, $0.02/1M tokens
# text-embedding-3-large: 3072 dims, $0.13/1M tokens
# text-embedding-ada-002: 1536 dims (legacy)

Cohere Embed

import cohere

co = cohere.Client()

def embed_cohere(texts: list[str], input_type: str = "search_document"):
    response = co.embed(
        texts=texts,
        model="embed-english-v3.0",
        input_type=input_type  # "search_document" or "search_query"
    )
    return response.embeddings

# Separate modes for documents vs queries improves retrieval

Voyage AI

import voyageai

vo = voyageai.Client()

def embed_voyage(texts: list[str]):
    response = vo.embed(
        texts,
        model="voyage-large-2",
        input_type="document"  # or "query"
    )
    return response.embeddings

# Known for excellent code and legal domain performance

Open Source Models

BGE (BAAI General Embedding)

from sentence_transformers import SentenceTransformer

# BGE models - excellent multilingual support
model = SentenceTransformer('BAAI/bge-large-en-v1.5')

def embed_bge(texts: list[str]):
    # BGE recommends adding instruction for queries
    return model.encode(texts, normalize_embeddings=True)

def embed_bge_query(query: str):
    # Prefix for queries improves retrieval
    instruction = "Represent this sentence for searching relevant passages: "
    return model.encode(instruction + query, normalize_embeddings=True)

E5 (Embeddings from Experts)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('intfloat/e5-large-v2')

def embed_e5(texts: list[str], is_query: bool = False):
    # E5 requires prefixes
    if is_query:
        texts = [f"query: {t}" for t in texts]
    else:
        texts = [f"passage: {t}" for t in texts]
    return model.encode(texts, normalize_embeddings=True)

MTEB Benchmark Comparison

The Massive Text Embedding Benchmark (MTEB) provides standardized comparisons:

Model MTEB Score Dimensions Speed
voyage-large-2 68.28 1536 Fast (API)
text-embedding-3-large 64.59 3072 Fast (API)
bge-large-en-v1.5 64.23 1024 Medium
e5-large-v2 62.25 1024 Medium
text-embedding-3-small 62.26 1536 Fast (API)

Note: MTEB scores vary by task type. Check retrieval-specific benchmarks for RAG.

Choosing the Right Model

START
Data must stay on-premise?
  ├─ YES → Open source (BGE, E5)
Specialized domain (legal, medical, code)?
  ├─ YES → Domain-specific or Voyage
Multilingual requirements?
  ├─ YES → BGE-M3 or Cohere multilingual
Budget constrained?
  ├─ YES → text-embedding-3-small or open source
Default → text-embedding-3-small (best cost/quality ratio)

Implementation Tips

class EmbeddingManager:
    """Manage embeddings with batching and caching."""

    def __init__(self, model_name: str, batch_size: int = 100):
        self.model_name = model_name
        self.batch_size = batch_size
        self.cache = {}

    def embed(self, texts: list[str]) -> list[list[float]]:
        # Check cache
        uncached = [t for t in texts if t not in self.cache]

        if uncached:
            # Batch for efficiency
            for i in range(0, len(uncached), self.batch_size):
                batch = uncached[i:i + self.batch_size]
                embeddings = self._embed_batch(batch)
                for text, emb in zip(batch, embeddings):
                    self.cache[text] = emb

        return [self.cache[t] for t in texts]

    def _embed_batch(self, texts: list[str]) -> list[list[float]]:
        # Implementation depends on model
        pass

Cost Tip: OpenAI's text-embedding-3-small offers the best cost-to-quality ratio for most use cases. Only upgrade to larger models if retrieval quality demonstrably improves on your data.

Next, let's explore vector database options for storing and searching embeddings. :::

Quiz

Module 2: Embedding Models & Vector Databases

Take Quiz