Search, Recommendation & Analytics Systems
Search Engines, Recommendations & Analytics Pipelines
Search, recommendation, and analytics are three pillars that power every modern product. In system design interviews, you will be asked to design systems that combine all three. This lesson covers the internals you need to explain clearly at the whiteboard.
Full-Text Search Internals
Full-text search transforms unstructured text into a structure optimized for fast retrieval. The pipeline flows through five stages:
Document → Tokenizer → Inverted Index → Query Processing → Ranking (BM25)
↓ ↓
"running" term → [doc1:2, doc3:1]
"shoes" (posting lists with
"trail" term frequencies)
Stage 1: Tokenization
Tokenization breaks raw text into searchable terms. A production tokenizer performs multiple transformations:
| Step | Input | Output |
|---|---|---|
| Lowercasing | "Running Shoes" |
"running shoes" |
| Splitting | "running shoes" |
["running", "shoes"] |
| Stop-word removal | ["the", "running", "shoes"] |
["running", "shoes"] |
| Stemming | ["running", "shoes"] |
["run", "shoe"] |
Stage 2: Inverted Index
An inverted index maps each term to the list of documents containing it. Each entry in the posting list stores the document ID and the term frequency (how many times the term appears in that document).
Inverted Index:
─────────────────────────────────────────
term → posting list
─────────────────────────────────────────
"run" → [(doc1, tf=3), (doc4, tf=1)]
"shoe" → [(doc1, tf=2), (doc2, tf=5)]
"trail" → [(doc3, tf=1), (doc4, tf=2)]
─────────────────────────────────────────
The inverted index enables O(1) lookup per term. Multi-term queries intersect or union posting lists depending on the boolean operator (AND vs OR).
Stage 3: TF-IDF Scoring
TF-IDF (Term Frequency - Inverse Document Frequency) weighs terms by how informative they are:
- TF(t, d) = count of term t in document d / total terms in d
- IDF(t) = log(N / df(t)), where N = total documents, df(t) = documents containing t
A term that appears in every document (like "the") gets a low IDF. A rare term (like "kubernetes") gets a high IDF.
Stage 4: BM25 Scoring
BM25 (Best Matching 25), developed by Robertson and Zaragoza, improves on TF-IDF with two key refinements: term frequency saturation and document length normalization.
The BM25 score for a query Q containing terms q1, q2, ..., qn against document D is:
BM25(D, Q) = SUM over qi of:
IDF(qi) * [ f(qi, D) * (k1 + 1) ]
─────────────────────────────────────
f(qi, D) + k1 * (1 - b + b * |D| / avgdl)
Where:
f(qi, D) = frequency of term qi in document D
|D| = length of document D (in terms)
avgdl = average document length across the corpus
k1 = 1.2 = term frequency saturation parameter
b = 0.75 = document length normalization parameter
Why k1 = 1.2? It controls how quickly term frequency saturates. With k1 = 1.2, a term appearing 3 times scores only marginally more than a term appearing 2 times. This prevents keyword stuffing from dominating relevance.
Why b = 0.75? It controls how much document length penalizes the score. At b = 0.75, longer documents are penalized (a term appearing once in a 1000-word document scores lower than in a 100-word document), but not as harshly as b = 1.0 would.
BM25 Practical Example
Suppose we have 3 documents and query Q = "trail running shoes":
| Document | Length (terms) | "trail" freq | "running" freq | "shoes" freq |
|---|---|---|---|---|
| D1: Trail shoe review | 50 | 3 | 0 | 2 |
| D2: Best running shoes guide | 200 | 0 | 5 | 4 |
| D3: Trail running shoe comparison | 80 | 2 | 3 | 3 |
With avgdl = 110, k1 = 1.2, b = 0.75, and N = 3:
- D3 scores highest because it contains all three query terms and has a moderate document length
- D2 has high term frequencies for "running" and "shoes" but zero for "trail", and its length (200) penalizes it
- D1 has "trail" but lacks "running"
interface PostingEntry {
docId: string;
termFrequency: number;
}
interface BM25Config {
k1: number; // 1.2 — term frequency saturation
b: number; // 0.75 — document length normalization
}
function computeBM25(
queryTerms: string[],
docId: string,
index: Map<string, PostingEntry[]>,
docLengths: Map<string, number>,
avgDocLength: number,
totalDocs: number,
config: BM25Config = { k1: 1.2, b: 0.75 }
): number {
let score = 0;
const docLength = docLengths.get(docId) ?? 0;
for (const term of queryTerms) {
const postings = index.get(term) ?? [];
const df = postings.length; // document frequency
if (df === 0) continue;
const entry = postings.find(p => p.docId === docId);
const tf = entry?.termFrequency ?? 0;
if (tf === 0) continue;
// IDF component: log((N - df + 0.5) / (df + 0.5) + 1)
const idf = Math.log(
(totalDocs - df + 0.5) / (df + 0.5) + 1
);
// BM25 numerator and denominator
const numerator = tf * (config.k1 + 1);
const denominator =
tf + config.k1 * (1 - config.b + config.b * docLength / avgDocLength);
score += idf * (numerator / denominator);
}
return score;
}
Typeahead and Autocomplete
Autocomplete must return results within 100ms to feel instantaneous. Two main approaches exist:
Trie-Based Autocomplete
A trie stores every prefix of every searchable term. When the user types "ru", the system traverses to the "r" -> "u" node and returns all completions ranked by popularity.
(root)
/ \
r s
/ \
u h
/ \ \
n s o
| | |
[run:450] [rush:30] [shoe:800]
|
[running:320]
Prefix Search with Popularity Ranking
For large-scale systems, a sorted index of terms with prefix filtering is more practical than an in-memory trie:
interface AutocompleteEntry {
term: string;
score: number; // popularity or recency weight
}
class AutocompleteService {
private entries: AutocompleteEntry[];
constructor(entries: AutocompleteEntry[]) {
// Sort by term for binary search on prefix
this.entries = entries.sort((a, b) => a.term.localeCompare(b.term));
}
suggest(prefix: string, limit: number = 10): AutocompleteEntry[] {
const lowerPrefix = prefix.toLowerCase();
// Filter entries matching the prefix
const matches = this.entries.filter(
entry => entry.term.startsWith(lowerPrefix)
);
// Rank by popularity score descending
return matches
.sort((a, b) => b.score - a.score)
.slice(0, limit);
}
}
In production, Elasticsearch provides a completion suggester backed by an FST (finite state transducer) that achieves sub-millisecond prefix lookups.
Recommendation Systems
Recommendation systems predict what users want before they search. Three approaches form the foundation of every recommendation engine.
Collaborative Filtering
Collaborative filtering works on the principle that users who agreed in the past will agree in the future. It requires no knowledge of item content.
User-Based CF: Find users similar to the target user, recommend items those similar users liked.
Item-Based CF: Find items similar to items the target user liked, recommend those similar items.
User-Item Rating Matrix:
────────────────────────────────────
Item A Item B Item C Item D
User 1 5 3 ? 1
User 2 4 ? 2 1
User 3 ? 3 5 4
User 4 5 4 ? ?
────────────────────────────────────
User-Based CF for User 1, Item C:
1. Find similar users to User 1 (by cosine similarity on ratings)
2. User 4 is most similar (both rated A=5, B high)
3. But User 4 hasn't rated C either
4. User 3 rated C=5, and is somewhat similar
5. Predict User 1 would rate C ≈ 4
Cosine similarity between two user rating vectors:
function cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
const denominator = Math.sqrt(normA) * Math.sqrt(normB);
return denominator === 0 ? 0 : dotProduct / denominator;
}
Content-Based Filtering
Content-based filtering recommends items similar to what the user previously liked, based on item features (genre, brand, price range, tags).
Item Feature Vectors:
────────────────────────────────────
Action Comedy Sci-Fi Runtime
"The Matrix" 0.9 0.1 0.9 0.7
"Die Hard" 0.9 0.3 0.1 0.6
"Inception" 0.8 0.1 0.8 0.8
────────────────────────────────────
User liked: "The Matrix", "Inception"
User profile vector: [0.85, 0.1, 0.85, 0.75] (average)
Most similar unwatched item: "Die Hard" (cosine sim = 0.88)
Cold Start Problem
Both approaches struggle when data is sparse:
| Scenario | Problem | Solution |
|---|---|---|
| New user | No ratings or history | Content-based from onboarding preferences |
| New item | No one has rated it yet | Content-based from item metadata |
| New system | Neither users nor items have data | Popularity-based fallback |
Hybrid Recommenders
Production systems combine both approaches. Netflix, for example, uses collaborative filtering for personalized recommendations, content-based filtering for "Because you watched X", and popularity-based ranking as a fallback for new users.
interface Recommendation {
itemId: string;
score: number;
source: "collaborative" | "content" | "popularity";
}
function hybridRecommend(
userId: string,
cfResults: Recommendation[],
cbResults: Recommendation[],
popularItems: Recommendation[],
userHistoryCount: number
): Recommendation[] {
// Cold start: fallback to content-based + popularity
if (userHistoryCount < 5) {
const blended = [
...cbResults.map(r => ({ ...r, score: r.score * 0.6 })),
...popularItems.map(r => ({ ...r, score: r.score * 0.4 })),
];
return deduplicateAndSort(blended);
}
// Warm user: blend collaborative and content-based
const blended = [
...cfResults.map(r => ({ ...r, score: r.score * 0.7 })),
...cbResults.map(r => ({ ...r, score: r.score * 0.3 })),
];
return deduplicateAndSort(blended);
}
function deduplicateAndSort(items: Recommendation[]): Recommendation[] {
const seen = new Map<string, Recommendation>();
for (const item of items) {
const existing = seen.get(item.itemId);
if (!existing || item.score > existing.score) {
seen.set(item.itemId, item);
}
}
return [...seen.values()].sort((a, b) => b.score - a.score);
}
Matrix Factorization
Matrix factorization decomposes the sparse user-item rating matrix into two lower-rank matrices. This was famously used in the Netflix Prize competition (2006-2009), where the winning team used SVD-based approaches.
R ≈ U × V^T
R (users x items) U (users x k) V (items x k)
[5 3 ? 1] [u1] [v1]
[4 ? 2 1] ≈ [u2] × [v2]^T
[? 3 5 4] [u3] [v3]
[5 4 ? ?] [u4] [v4]
k = latent factor dimensions (e.g., 50)
SVD (Singular Value Decomposition) finds the optimal U and V matrices. ALS (Alternating Least Squares) is the common algorithm for large-scale systems because it can be parallelized: fix U and solve for V, then fix V and solve for U, and repeat until convergence.
Analytics Pipelines: OLTP vs OLAP
Analytics systems require fundamentally different architectures from transactional systems.
| Property | OLTP | OLAP |
|---|---|---|
| Purpose | Record transactions | Analyze trends |
| Query pattern | Point lookups, small writes | Full table scans, aggregations |
| Data model | Normalized (3NF) | Denormalized (star schema) |
| Latency target | < 10ms | Seconds to minutes |
| Data freshness | Real-time | Minutes to hours |
| Examples | PostgreSQL, MySQL | BigQuery, Redshift, ClickHouse |
Star Schema Design
A star schema organizes analytical data around a central fact table surrounded by dimension tables:
dim_date
┌─────────┐
│date_key │
│day │
│month │
│quarter │
└────┬────┘
│
dim_product fact_sales dim_customer
┌──────────┐ ┌──────────────┐ ┌────────────┐
│product_id│ │sale_id │ │customer_id │
│name │◄───│product_id │────►│name │
│category │ │customer_id │ │segment │
│brand │ │date_key │ │region │
└──────────┘ │quantity │ └────────────┘
│revenue │
│discount │
└──────────────┘
Fact tables store measurable events (sales, clicks, page views). Dimension tables store descriptive attributes (product details, customer demographics, dates). This layout enables efficient aggregation queries like "total revenue by product category per quarter."
ETL vs ELT
| Aspect | ETL | ELT |
|---|---|---|
| Transform step | Before loading into warehouse | After loading into warehouse |
| Tooling | Airflow, Informatica, custom scripts | dbt, BigQuery SQL, Snowflake |
| Best for | Legacy systems, complex transformations | Cloud warehouses with cheap compute |
| Data freshness | Batch (hourly/daily) | Near-real-time possible |
Stream Processing for Real-Time Analytics
When analytics require sub-minute freshness, stream processing replaces batch ETL:
Event Source → Kafka → Stream Processor → Analytics Store
(clicks) (buffer) (Flink/Spark) (ClickHouse)
↓
Aggregate in
sliding windows
(1min, 5min, 1hr)
Elasticsearch Architecture
Elasticsearch is the most widely deployed search engine for application search. Understanding its architecture is essential for search system design interviews.
Elasticsearch Cluster
┌─────────────────────────────────────────────┐
│ Node 1 Node 2 │
│ ┌──────────┐ ┌──────────┐ │
│ │ Shard P0 │ │ Shard P1 │ │
│ │ Shard R1 │ │ Shard R0 │ (replica)│
│ └──────────┘ └──────────┘ │
│ │
│ Node 3 │
│ ┌──────────┐ │
│ │ Shard P2 │ │
│ │ Shard R0 │ (replica) │
│ └──────────┘ │
└─────────────────────────────────────────────┘
Shards distribute an index across nodes. Each shard is a self-contained Lucene index. A 5-shard index can handle 5x the data of a single-shard index.
Replicas provide redundancy and read throughput. A shard with 1 replica means 2 copies of the data exist. If a node dies, the replica promotes to primary.
Analyzers define the tokenization pipeline (character filters, tokenizer, token filters). Custom analyzers handle language-specific needs like CJK tokenization or synonym expansion.
Interview Application: E-Commerce Product Search and Recommendations
When asked "Design a product search and recommendation system for an e-commerce platform," structure your answer across these four layers:
┌────────────────────────────────────────────────────┐
│ API Gateway │
├────────────────────────────────────────────────────┤
│ Search Service │ Recommendation Service │
│ ┌──────────────┐ │ ┌────────────────────────┐ │
│ │ Query Parser │ │ │ Collaborative Filter │ │
│ │ Elasticsearch │ │ │ Content-Based Filter │ │
│ │ BM25 + Boost │ │ │ Hybrid Blender │ │
│ │ Autocomplete │ │ │ Popularity Fallback │ │
│ └──────────────┘ │ └────────────────────────┘ │
├────────────────────────────────────────────────────┤
│ Analytics Pipeline (Kafka → Flink) │
│ Click tracking, A/B test events, search logs │
├────────────────────────────────────────────────────┤
│ PostgreSQL │ Elasticsearch │ Redis │ ClickHouse│
│ (catalog) │ (search index)│ (cache)│ (analytics)│
└────────────────────────────────────────────────────┘
Key design decisions to discuss:
- Search ranking: BM25 base relevance + business boosting (sales rank, margin, availability) + personalization signals
- Autocomplete: Elasticsearch completion suggester with popularity weighting, updated from search log analytics
- Recommendations: Collaborative filtering for "Customers also bought", content-based for "Similar products", and popularity for new users
- Cold start: New products start with content-based recommendations from category/brand metadata; new users see trending items until they have 5+ interactions
- Analytics pipeline: Kafka ingests click and purchase events, Flink computes real-time metrics (conversion rate, search success rate), ClickHouse stores aggregated analytics for dashboards
Next: Module 04 Quiz — Test your understanding of search internals, recommendation algorithms, and analytics architecture. :::