Feature Stores & Feature Engineering
Why Feature Stores?
3 min read
Feature stores solve one of the most common production ML problems: training-serving skew. They ensure your model sees the same features in production as it did during training.
The Training-Serving Skew Problem
Training Pipeline Serving Pipeline
┌──────────────────┐ ┌──────────────────┐
│ SQL Query A │ │ Python Code B │
│ (PostgreSQL) │ │ (Real-time API) │
└────────┬─────────┘ └────────┬─────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Feature X = 10 │ ≠ │ Feature X = 10.1│
└──────────────────┘ └──────────────────┘
│ │
▼ ▼
Model Model
(accurate) (degraded)
The problem: Different code computes the same features, leading to subtle differences that degrade model performance.
What is a Feature Store?
A feature store is a centralized repository for:
- Storing feature definitions
- Computing features consistently
- Serving features for training and inference
- Tracking feature lineage and versions
┌─────────────────────┐
│ Feature Store │
│ ┌───────────────┐ │
Raw Data ──────▶ │ │ Transform │ │ ──────▶ Training
│ │ & Store │ │
│ └───────────────┐ │ ──────▶ Serving
│ │ Online/ │ │
│ │ Offline │ │
└──┴───────────────┴──┘
Online vs Offline Stores
| Aspect | Offline Store | Online Store |
|---|---|---|
| Use case | Training | Inference |
| Latency | Minutes-hours | Milliseconds |
| Storage | Data warehouse | Key-value store |
| Volume | Historical data | Latest values |
| Access | Batch queries | Point lookups |
Offline Store (Training)
# Query historical features for training
training_data = feature_store.get_historical_features(
entity_df=entity_dataframe,
features=[
"customer_features:total_purchases",
"customer_features:avg_order_value",
"customer_features:days_since_last_order"
]
)
Online Store (Inference)
# Get latest features for real-time prediction
features = feature_store.get_online_features(
features=[
"customer_features:total_purchases",
"customer_features:avg_order_value"
],
entity_rows=[{"customer_id": 12345}]
)
Feature Store Benefits
1. Consistency
┌─────────────────────────────────────────────────────────┐
│ Single Feature Definition │
│ │
│ def avg_order_value(orders): │
│ return orders.groupby('customer_id')['amount'].mean()│
└─────────────────────────────────────────────────────────┘
│
┌───────────────┴───────────────┐
│ │
▼ ▼
Training Serving
(Same result) (Same result)
2. Reusability
Feature: customer_lifetime_value
├── Used by: Churn Prediction Model
├── Used by: Upsell Model
├── Used by: Risk Assessment Model
└── Used by: Marketing Segmentation
3. Discovery
Teams can browse and reuse existing features:
Feature Catalog
───────────────────────────────────────────────
Name │ Owner │ Last Updated
───────────────────────────────────────────────
customer_ltv │ Team A │ 2025-01-15
product_avg_rating │ Team B │ 2025-01-10
user_session_count │ Team C │ 2025-01-12
order_frequency_30d │ Team A │ 2025-01-14
───────────────────────────────────────────────
4. Time Travel
# Get features as they were on a specific date
point_in_time_features = feature_store.get_historical_features(
entity_df=entity_dataframe,
features=["customer_features:total_purchases"],
timestamp_field="event_timestamp"
)
Common Use Cases
| Use Case | Features Needed | Latency |
|---|---|---|
| Fraud detection | Transaction patterns, device info | < 50ms |
| Recommendations | User preferences, item embeddings | < 100ms |
| Credit scoring | Financial history, behavior patterns | < 1s |
| Dynamic pricing | Demand signals, competitor prices | < 500ms |
Feature Store Architecture
┌─────────────────────────────────────────────────────────────┐
│ Data Sources │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Database│ │ Streams │ │ Files │ │ APIs │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
└───────┼────────────┼────────────┼────────────┼──────────────┘
│ │ │ │
└────────────┴─────┬──────┴────────────┘
│
▼
┌────────────────────────┐
│ Feature Engineering │
│ (Transformations) │
└───────────┬────────────┘
│
┌─────────────────┴─────────────────┐
│ │
▼ ▼
┌───────────────────┐ ┌───────────────────┐
│ Offline Store │ │ Online Store │
│ (Data Lake) │ │ (Redis/DynamoDB)│
└─────────┬─────────┘ └─────────┬─────────┘
│ │
▼ ▼
Training Pipeline Inference Service
Popular Feature Stores
| Tool | Type | Best For |
|---|---|---|
| Feast | Open-source | General purpose, self-hosted |
| Tecton | Managed | Enterprise, real-time ML |
| Databricks | Managed | Spark-based workflows |
| AWS SageMaker | Managed | AWS ecosystem |
| Vertex AI | Managed | GCP ecosystem |
When Do You Need a Feature Store?
| Situation | Need Feature Store? |
|---|---|
| Single model, batch inference | Maybe |
| Multiple models sharing features | Yes |
| Real-time inference | Yes |
| Training-serving skew issues | Yes |
| Feature discovery/governance | Yes |
Key insight: Feature stores aren't just storage—they're the bridge between data engineering and ML, ensuring consistency, reusability, and governance across your ML platform.
Next, we'll dive deep into Feast, the most popular open-source feature store. :::