Feature Stores & Feature Engineering

Why Feature Stores?

3 min read

Feature stores solve one of the most common production ML problems: training-serving skew. They ensure your model sees the same features in production as it did during training.

The Training-Serving Skew Problem

Training Pipeline                     Serving Pipeline
┌──────────────────┐                 ┌──────────────────┐
│  SQL Query A     │                 │  Python Code B   │
│  (PostgreSQL)    │                 │  (Real-time API) │
└────────┬─────────┘                 └────────┬─────────┘
         │                                    │
         ▼                                    ▼
┌──────────────────┐                 ┌──────────────────┐
│  Feature X = 10  │      ≠         │  Feature X = 10.1│
└──────────────────┘                 └──────────────────┘
         │                                    │
         ▼                                    ▼
      Model                              Model
    (accurate)                         (degraded)

The problem: Different code computes the same features, leading to subtle differences that degrade model performance.

What is a Feature Store?

A feature store is a centralized repository for:

  • Storing feature definitions
  • Computing features consistently
  • Serving features for training and inference
  • Tracking feature lineage and versions
                    ┌─────────────────────┐
                    │    Feature Store    │
                    │  ┌───────────────┐  │
   Raw Data ──────▶ │  │  Transform    │  │ ──────▶ Training
                    │  │  & Store      │  │
                    │  └───────────────┐  │ ──────▶ Serving
                    │  │  Online/      │  │
                    │  │  Offline      │  │
                    └──┴───────────────┴──┘

Online vs Offline Stores

Aspect Offline Store Online Store
Use case Training Inference
Latency Minutes-hours Milliseconds
Storage Data warehouse Key-value store
Volume Historical data Latest values
Access Batch queries Point lookups

Offline Store (Training)

# Query historical features for training
training_data = feature_store.get_historical_features(
    entity_df=entity_dataframe,
    features=[
        "customer_features:total_purchases",
        "customer_features:avg_order_value",
        "customer_features:days_since_last_order"
    ]
)

Online Store (Inference)

# Get latest features for real-time prediction
features = feature_store.get_online_features(
    features=[
        "customer_features:total_purchases",
        "customer_features:avg_order_value"
    ],
    entity_rows=[{"customer_id": 12345}]
)

Feature Store Benefits

1. Consistency

┌─────────────────────────────────────────────────────────┐
│              Single Feature Definition                   │
│                                                         │
│  def avg_order_value(orders):                           │
│      return orders.groupby('customer_id')['amount'].mean()│
└─────────────────────────────────────────────────────────┘
         ┌───────────────┴───────────────┐
         │                               │
         ▼                               ▼
    Training                        Serving
    (Same result)                  (Same result)

2. Reusability

Feature: customer_lifetime_value
    ├── Used by: Churn Prediction Model
    ├── Used by: Upsell Model
    ├── Used by: Risk Assessment Model
    └── Used by: Marketing Segmentation

3. Discovery

Teams can browse and reuse existing features:

Feature Catalog
───────────────────────────────────────────────
Name                    │ Owner   │ Last Updated
───────────────────────────────────────────────
customer_ltv            │ Team A  │ 2025-01-15
product_avg_rating      │ Team B  │ 2025-01-10
user_session_count      │ Team C  │ 2025-01-12
order_frequency_30d     │ Team A  │ 2025-01-14
───────────────────────────────────────────────

4. Time Travel

# Get features as they were on a specific date
point_in_time_features = feature_store.get_historical_features(
    entity_df=entity_dataframe,
    features=["customer_features:total_purchases"],
    timestamp_field="event_timestamp"
)

Common Use Cases

Use Case Features Needed Latency
Fraud detection Transaction patterns, device info < 50ms
Recommendations User preferences, item embeddings < 100ms
Credit scoring Financial history, behavior patterns < 1s
Dynamic pricing Demand signals, competitor prices < 500ms

Feature Store Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Data Sources                            │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐        │
│  │ Database│  │ Streams │  │  Files  │  │  APIs   │        │
│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘        │
└───────┼────────────┼────────────┼────────────┼──────────────┘
        │            │            │            │
        └────────────┴─────┬──────┴────────────┘
              ┌────────────────────────┐
              │   Feature Engineering  │
              │   (Transformations)    │
              └───────────┬────────────┘
        ┌─────────────────┴─────────────────┐
        │                                   │
        ▼                                   ▼
┌───────────────────┐            ┌───────────────────┐
│   Offline Store   │            │   Online Store    │
│   (Data Lake)     │            │   (Redis/DynamoDB)│
└─────────┬─────────┘            └─────────┬─────────┘
          │                                │
          ▼                                ▼
   Training Pipeline              Inference Service
Tool Type Best For
Feast Open-source General purpose, self-hosted
Tecton Managed Enterprise, real-time ML
Databricks Managed Spark-based workflows
AWS SageMaker Managed AWS ecosystem
Vertex AI Managed GCP ecosystem

When Do You Need a Feature Store?

Situation Need Feature Store?
Single model, batch inference Maybe
Multiple models sharing features Yes
Real-time inference Yes
Training-serving skew issues Yes
Feature discovery/governance Yes

Key insight: Feature stores aren't just storage—they're the bridge between data engineering and ML, ensuring consistency, reusability, and governance across your ML platform.

Next, we'll dive deep into Feast, the most popular open-source feature store. :::

Quiz

Module 4: Feature Stores & Feature Engineering

Take Quiz