ML Pipelines & Orchestration
Feature Stores and Data Versioning
5 min read
Feature stores are critical infrastructure for ML systems. Interviewers test both conceptual understanding and practical implementation.
Why Feature Stores Matter
Interview Question: "Why not just compute features at prediction time?"
Answer Framework:
# Problems without a feature store
problems = {
"training_serving_skew": "Different code paths = different features",
"computation_cost": "Recomputing complex aggregations per request",
"consistency": "No single source of truth for feature definitions",
"latency": "Complex features can't meet SLA requirements",
"reuse": "Teams duplicate feature engineering work"
}
# Feature store benefits
benefits = {
"single_source_of_truth": "Same features for training and serving",
"real_time_serving": "Pre-computed features with low latency",
"feature_reuse": "Catalog of approved, documented features",
"point_in_time_correctness": "Historical features without data leakage",
"governance": "Track feature lineage and usage"
}
Feature Store Architecture
# Core components to know for interviews
feature_store_architecture:
offline_store:
purpose: "Historical features for training"
storage: "Data warehouse (BigQuery, Snowflake, Parquet)"
latency: "Seconds to minutes"
online_store:
purpose: "Low-latency features for inference"
storage: "Key-value store (Redis, DynamoDB)"
latency: "< 10ms p99"
feature_registry:
purpose: "Metadata, schemas, documentation"
contents: "Feature definitions, owners, SLAs"
materialization:
purpose: "Sync offline → online store"
frequency: "Batch or streaming"
Feast Interview Example
Interview Question: "Walk me through implementing a feature store for a fraud detection system."
from feast import Entity, Feature, FeatureView, FileSource, ValueType
from feast.types import Float32, Int64
from datetime import timedelta
# Entity: What are we computing features for?
user = Entity(
name="user_id",
value_type=ValueType.INT64,
description="Unique user identifier"
)
# Offline source: Where historical data lives
transactions_source = FileSource(
path="s3://features/user_transactions.parquet",
timestamp_field="event_timestamp"
)
# Feature View: A logical group of features
user_transaction_features = FeatureView(
name="user_transaction_features",
entities=[user],
ttl=timedelta(days=1), # How long features are valid
schema=[
Feature(name="transaction_count_7d", dtype=Int64),
Feature(name="avg_transaction_amount_7d", dtype=Float32),
Feature(name="max_transaction_amount_7d", dtype=Float32),
Feature(name="unique_merchants_7d", dtype=Int64),
],
online=True, # Enable online serving
source=transactions_source,
tags={"team": "fraud", "pii": "false"}
)
Fetching Features:
from feast import FeatureStore
store = FeatureStore(repo_path=".")
# Training: Get historical features (offline)
training_data = store.get_historical_features(
entity_df=transactions_df, # DataFrame with user_id, event_timestamp
features=[
"user_transaction_features:transaction_count_7d",
"user_transaction_features:avg_transaction_amount_7d"
]
).to_df()
# Inference: Get online features (real-time)
online_features = store.get_online_features(
features=[
"user_transaction_features:transaction_count_7d",
"user_transaction_features:avg_transaction_amount_7d"
],
entity_rows=[{"user_id": 12345}]
).to_dict()
Data Versioning with DVC
Interview Question: "How do you version training datasets?"
# Initialize DVC in existing Git repo
dvc init
# Track large data files
dvc add data/training_v1.parquet
# This creates data/training_v1.parquet.dvc (pointer file)
# Git tracks .dvc file, DVC tracks actual data
git add data/training_v1.parquet.dvc data/.gitignore
git commit -m "Add training data v1"
# Push data to remote storage
dvc push # Uploads to S3/GCS configured in .dvc/config
Reproducing Training:
# Switch to previous data version
git checkout v1.0.0
dvc checkout # Pulls matching data version
# Run training
python train.py
# Data + code + model are now reproducible
Interview Trade-offs Discussion
| Approach | Pros | Cons |
|---|---|---|
| Feast | Open source, Kubernetes-native | Requires infrastructure |
| Tecton | Managed, enterprise features | Cost, vendor lock-in |
| Databricks Feature Store | Integrated with Databricks | Databricks-only |
| Custom Redis | Simple, low latency | No feature management |
Expert Insight: In interviews, discuss point-in-time correctness. "We can't use features computed after the prediction timestamp - that's data leakage."
Next, we'll cover pipeline design patterns for interviews. :::