Search Engines, Recommendations & Analytics Pipelines

Search, recommendation, and analytics are three pillars that power every modern product. In system design interviews, you will be asked to design systems that combine all three. This lesson covers the internals you need to explain clearly at the whiteboard.

Full-Text Search Internals

Full-text search transforms unstructured text into a structure optimized for fast retrieval. The pipeline flows through five stages:

Document → Tokenizer → Inverted Index → Query Processing → Ranking (BM25)
              ↓               ↓
         "running"      term → [doc1:2, doc3:1]
         "shoes"        (posting lists with
         "trail"         term frequencies)

Stage 1: Tokenization

Tokenization breaks raw text into searchable terms. A production tokenizer performs multiple transformations:

Step	Input	Output
Lowercasing	`"Running Shoes"`	`"running shoes"`
Splitting	`"running shoes"`	`["running", "shoes"]`
Stop-word removal	`["the", "running", "shoes"]`	`["running", "shoes"]`
Stemming	`["running", "shoes"]`	`["run", "shoe"]`

Stage 2: Inverted Index

An inverted index maps each term to the list of documents containing it. Each entry in the posting list stores the document ID and the term frequency (how many times the term appears in that document).

Inverted Index:
─────────────────────────────────────────
  term    →  posting list
─────────────────────────────────────────
  "run"   →  [(doc1, tf=3), (doc4, tf=1)]
  "shoe"  →  [(doc1, tf=2), (doc2, tf=5)]
  "trail" →  [(doc3, tf=1), (doc4, tf=2)]
─────────────────────────────────────────

The inverted index enables O(1) lookup per term. Multi-term queries intersect or union posting lists depending on the boolean operator (AND vs OR).

Stage 3: TF-IDF Scoring

TF-IDF (Term Frequency - Inverse Document Frequency) weighs terms by how informative they are:

TF(t, d) = count of term t in document d / total terms in d
IDF(t) = log(N / df(t)), where N = total documents, df(t) = documents containing t

A term that appears in every document (like "the") gets a low IDF. A rare term (like "kubernetes") gets a high IDF.

Stage 4: BM25 Scoring

BM25 (Best Matching 25), developed by Robertson and Zaragoza, improves on TF-IDF with two key refinements: term frequency saturation and document length normalization.

The BM25 score for a query Q containing terms q1, q2, ..., qn against document D is:

BM25(D, Q) = SUM over qi of:

  IDF(qi) * [ f(qi, D) * (k1 + 1) ]
             ─────────────────────────────────────
             f(qi, D) + k1 * (1 - b + b * |D| / avgdl)

Where:
  f(qi, D)  = frequency of term qi in document D
  |D|       = length of document D (in terms)
  avgdl     = average document length across the corpus
  k1 = 1.2  = term frequency saturation parameter
  b  = 0.75 = document length normalization parameter

Why k1 = 1.2? It controls how quickly term frequency saturates. With k1 = 1.2, a term appearing 3 times scores only marginally more than a term appearing 2 times. This prevents keyword stuffing from dominating relevance.

Why b = 0.75? It controls how much document length penalizes the score. At b = 0.75, longer documents are penalized (a term appearing once in a 1000-word document scores lower than in a 100-word document), but not as harshly as b = 1.0 would.

BM25 Practical Example

Suppose we have 3 documents and query Q = "trail running shoes":

Document	Length (terms)	"trail" freq	"running" freq	"shoes" freq
D1: Trail shoe review	50	3	0	2
D2: Best running shoes guide	200	0	5	4
D3: Trail running shoe comparison	80	2	3	3

With avgdl = 110, k1 = 1.2, b = 0.75, and N = 3:

D3 scores highest because it contains all three query terms and has a moderate document length
D2 has high term frequencies for "running" and "shoes" but zero for "trail", and its length (200) penalizes it
D1 has "trail" but lacks "running"

interface PostingEntry {
  docId: string;
  termFrequency: number;
}

interface BM25Config {
  k1: number;  // 1.2 — term frequency saturation
  b: number;   // 0.75 — document length normalization
}

function computeBM25(
  queryTerms: string[],
  docId: string,
  index: Map<string, PostingEntry[]>,
  docLengths: Map<string, number>,
  avgDocLength: number,
  totalDocs: number,
  config: BM25Config = { k1: 1.2, b: 0.75 }
): number {
  let score = 0;
  const docLength = docLengths.get(docId) ?? 0;

  for (const term of queryTerms) {
    const postings = index.get(term) ?? [];
    const df = postings.length; // document frequency
    if (df === 0) continue;

    const entry = postings.find(p => p.docId === docId);
    const tf = entry?.termFrequency ?? 0;
    if (tf === 0) continue;

    // IDF component: log((N - df + 0.5) / (df + 0.5) + 1)
    const idf = Math.log(
      (totalDocs - df + 0.5) / (df + 0.5) + 1
    );

    // BM25 numerator and denominator
    const numerator = tf * (config.k1 + 1);
    const denominator =
      tf + config.k1 * (1 - config.b + config.b * docLength / avgDocLength);

    score += idf * (numerator / denominator);
  }

  return score;
}

Typeahead and Autocomplete

Autocomplete must return results within 100ms to feel instantaneous. Two main approaches exist:

Trie-Based Autocomplete

A trie stores every prefix of every searchable term. When the user types "ru", the system traverses to the "r" -> "u" node and returns all completions ranked by popularity.

         (root)
        /      \
       r        s
      /          \
     u            h
    / \            \
   n   s           o
   |   |           |
   [run:450]  [rush:30]  [shoe:800]
   |
  [running:320]

Prefix Search with Popularity Ranking

For large-scale systems, a sorted index of terms with prefix filtering is more practical than an in-memory trie:

interface AutocompleteEntry {
  term: string;
  score: number; // popularity or recency weight
}

class AutocompleteService {
  private entries: AutocompleteEntry[];

  constructor(entries: AutocompleteEntry[]) {
    // Sort by term for binary search on prefix
    this.entries = entries.sort((a, b) => a.term.localeCompare(b.term));
  }

  suggest(prefix: string, limit: number = 10): AutocompleteEntry[] {
    const lowerPrefix = prefix.toLowerCase();

    // Filter entries matching the prefix
    const matches = this.entries.filter(
      entry => entry.term.startsWith(lowerPrefix)
    );

    // Rank by popularity score descending
    return matches
      .sort((a, b) => b.score - a.score)
      .slice(0, limit);
  }
}

In production, Elasticsearch provides a completion suggester backed by an FST (finite state transducer) that achieves sub-millisecond prefix lookups.

Recommendation Systems

Recommendation systems predict what users want before they search. Three approaches form the foundation of every recommendation engine.

Collaborative Filtering

Collaborative filtering works on the principle that users who agreed in the past will agree in the future. It requires no knowledge of item content.

User-Based CF: Find users similar to the target user, recommend items those similar users liked.

Item-Based CF: Find items similar to items the target user liked, recommend those similar items.

User-Item Rating Matrix:
────────────────────────────────────
         Item A   Item B   Item C   Item D
User 1     5        3        ?        1
User 2     4        ?        2        1
User 3     ?        3        5        4
User 4     5        4        ?        ?
────────────────────────────────────

User-Based CF for User 1, Item C:
1. Find similar users to User 1 (by cosine similarity on ratings)
2. User 4 is most similar (both rated A=5, B high)
3. But User 4 hasn't rated C either
4. User 3 rated C=5, and is somewhat similar
5. Predict User 1 would rate C ≈ 4

Cosine similarity between two user rating vectors:

function cosineSimilarity(a: number[], b: number[]): number {
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;

  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }

  const denominator = Math.sqrt(normA) * Math.sqrt(normB);
  return denominator === 0 ? 0 : dotProduct / denominator;
}

Content-Based Filtering

Content-based filtering recommends items similar to what the user previously liked, based on item features (genre, brand, price range, tags).

Item Feature Vectors:
────────────────────────────────────
              Action  Comedy  Sci-Fi  Runtime
"The Matrix"   0.9     0.1     0.9     0.7
"Die Hard"     0.9     0.3     0.1     0.6
"Inception"    0.8     0.1     0.8     0.8
────────────────────────────────────

User liked: "The Matrix", "Inception"
User profile vector: [0.85, 0.1, 0.85, 0.75]  (average)
Most similar unwatched item: "Die Hard" (cosine sim = 0.88)

Cold Start Problem

Both approaches struggle when data is sparse:

Scenario	Problem	Solution
New user	No ratings or history	Content-based from onboarding preferences
New item	No one has rated it yet	Content-based from item metadata
New system	Neither users nor items have data	Popularity-based fallback

Hybrid Recommenders

Production systems combine both approaches. Netflix, for example, uses collaborative filtering for personalized recommendations, content-based filtering for "Because you watched X", and popularity-based ranking as a fallback for new users.

interface Recommendation {
  itemId: string;
  score: number;
  source: "collaborative" | "content" | "popularity";
}

function hybridRecommend(
  userId: string,
  cfResults: Recommendation[],
  cbResults: Recommendation[],
  popularItems: Recommendation[],
  userHistoryCount: number
): Recommendation[] {
  // Cold start: fallback to content-based + popularity
  if (userHistoryCount < 5) {
    const blended = [
      ...cbResults.map(r => ({ ...r, score: r.score * 0.6 })),
      ...popularItems.map(r => ({ ...r, score: r.score * 0.4 })),
    ];
    return deduplicateAndSort(blended);
  }

  // Warm user: blend collaborative and content-based
  const blended = [
    ...cfResults.map(r => ({ ...r, score: r.score * 0.7 })),
    ...cbResults.map(r => ({ ...r, score: r.score * 0.3 })),
  ];
  return deduplicateAndSort(blended);
}

function deduplicateAndSort(items: Recommendation[]): Recommendation[] {
  const seen = new Map<string, Recommendation>();
  for (const item of items) {
    const existing = seen.get(item.itemId);
    if (!existing || item.score > existing.score) {
      seen.set(item.itemId, item);
    }
  }
  return [...seen.values()].sort((a, b) => b.score - a.score);
}

Matrix Factorization

Matrix factorization decomposes the sparse user-item rating matrix into two lower-rank matrices. This was famously used in the Netflix Prize competition (2006-2009), where the winning team used SVD-based approaches.

R ≈ U × V^T

R (users x items)     U (users x k)    V (items x k)
[5  3  ?  1]         [u1]              [v1]
[4  ?  2  1]    ≈    [u2]     ×        [v2]^T
[?  3  5  4]         [u3]              [v3]
[5  4  ?  ?]         [u4]              [v4]

k = latent factor dimensions (e.g., 50)

SVD (Singular Value Decomposition) finds the optimal U and V matrices. ALS (Alternating Least Squares) is the common algorithm for large-scale systems because it can be parallelized: fix U and solve for V, then fix V and solve for U, and repeat until convergence.

Analytics Pipelines: OLTP vs OLAP

Analytics systems require fundamentally different architectures from transactional systems.

Property	OLTP	OLAP
Purpose	Record transactions	Analyze trends
Query pattern	Point lookups, small writes	Full table scans, aggregations
Data model	Normalized (3NF)	Denormalized (star schema)
Latency target	< 10ms	Seconds to minutes
Data freshness	Real-time	Minutes to hours
Examples	PostgreSQL, MySQL	BigQuery, Redshift, ClickHouse

Star Schema Design

A star schema organizes analytical data around a central fact table surrounded by dimension tables:

                    dim_date
                   ┌─────────┐
                   │date_key │
                   │day      │
                   │month    │
                   │quarter  │
                   └────┬────┘
                        │
dim_product         fact_sales          dim_customer
┌──────────┐    ┌──────────────┐     ┌────────────┐
│product_id│    │sale_id       │     │customer_id │
│name      │◄───│product_id    │────►│name        │
│category  │    │customer_id   │     │segment     │
│brand     │    │date_key      │     │region      │
└──────────┘    │quantity      │     └────────────┘
                │revenue       │
                │discount      │
                └──────────────┘

Fact tables store measurable events (sales, clicks, page views). Dimension tables store descriptive attributes (product details, customer demographics, dates). This layout enables efficient aggregation queries like "total revenue by product category per quarter."

ETL vs ELT

Aspect	ETL	ELT
Transform step	Before loading into warehouse	After loading into warehouse
Tooling	Airflow, Informatica, custom scripts	dbt, BigQuery SQL, Snowflake
Best for	Legacy systems, complex transformations	Cloud warehouses with cheap compute
Data freshness	Batch (hourly/daily)	Near-real-time possible

Stream Processing for Real-Time Analytics

When analytics require sub-minute freshness, stream processing replaces batch ETL:

Event Source → Kafka → Stream Processor → Analytics Store
  (clicks)    (buffer)  (Flink/Spark)     (ClickHouse)
                              ↓
                        Aggregate in
                        sliding windows
                        (1min, 5min, 1hr)

Elasticsearch Architecture

Elasticsearch is the most widely deployed search engine for application search. Understanding its architecture is essential for search system design interviews.

Elasticsearch Cluster
┌─────────────────────────────────────────────┐
│  Node 1              Node 2                 │
│  ┌──────────┐       ┌──────────┐           │
│  │ Shard P0 │       │ Shard P1 │           │
│  │ Shard R1 │       │ Shard R0 │  (replica)│
│  └──────────┘       └──────────┘           │
│                                             │
│  Node 3                                     │
│  ┌──────────┐                               │
│  │ Shard P2 │                               │
│  │ Shard R0 │  (replica)                    │
│  └──────────┘                               │
└─────────────────────────────────────────────┘

Shards distribute an index across nodes. Each shard is a self-contained Lucene index. A 5-shard index can handle 5x the data of a single-shard index.

Replicas provide redundancy and read throughput. A shard with 1 replica means 2 copies of the data exist. If a node dies, the replica promotes to primary.

Analyzers define the tokenization pipeline (character filters, tokenizer, token filters). Custom analyzers handle language-specific needs like CJK tokenization or synonym expansion.

Interview Application: E-Commerce Product Search and Recommendations

When asked "Design a product search and recommendation system for an e-commerce platform," structure your answer across these four layers:

┌────────────────────────────────────────────────────┐
│                    API Gateway                      │
├────────────────────────────────────────────────────┤
│  Search Service     │  Recommendation Service      │
│  ┌──────────────┐   │  ┌────────────────────────┐  │
│  │ Query Parser  │   │  │ Collaborative Filter   │  │
│  │ Elasticsearch │   │  │ Content-Based Filter   │  │
│  │ BM25 + Boost  │   │  │ Hybrid Blender         │  │
│  │ Autocomplete  │   │  │ Popularity Fallback    │  │
│  └──────────────┘   │  └────────────────────────┘  │
├────────────────────────────────────────────────────┤
│           Analytics Pipeline (Kafka → Flink)        │
│  Click tracking, A/B test events, search logs       │
├────────────────────────────────────────────────────┤
│  PostgreSQL    │ Elasticsearch │ Redis  │ ClickHouse│
│  (catalog)     │ (search index)│ (cache)│ (analytics)│
└────────────────────────────────────────────────────┘

Key design decisions to discuss:

Search ranking: BM25 base relevance + business boosting (sales rank, margin, availability) + personalization signals
Autocomplete: Elasticsearch completion suggester with popularity weighting, updated from search log analytics
Recommendations: Collaborative filtering for "Customers also bought", content-based for "Similar products", and popularity for new users
Cold start: New products start with content-based recommendations from category/brand metadata; new users see trending items until they have 5+ interactions
Analytics pipeline: Kafka ingests click and purchase events, Flink computes real-time metrics (conversion rate, search success rate), ClickHouse stores aggregated analytics for dashboards

Next: Module 04 Quiz — Test your understanding of search internals, recommendation algorithms, and analytics architecture. :::

Quiz

Stay on the Nerd Track