ML Pipelines & Orchestration

Feature Stores and Data Versioning

5 min read

Feature stores are critical infrastructure for ML systems. Interviewers test both conceptual understanding and practical implementation.

Why Feature Stores Matter

Interview Question: "Why not just compute features at prediction time?"

Answer Framework:

# Problems without a feature store
problems = {
    "training_serving_skew": "Different code paths = different features",
    "computation_cost": "Recomputing complex aggregations per request",
    "consistency": "No single source of truth for feature definitions",
    "latency": "Complex features can't meet SLA requirements",
    "reuse": "Teams duplicate feature engineering work"
}

# Feature store benefits
benefits = {
    "single_source_of_truth": "Same features for training and serving",
    "real_time_serving": "Pre-computed features with low latency",
    "feature_reuse": "Catalog of approved, documented features",
    "point_in_time_correctness": "Historical features without data leakage",
    "governance": "Track feature lineage and usage"
}

Feature Store Architecture

# Core components to know for interviews
feature_store_architecture:
  offline_store:
    purpose: "Historical features for training"
    storage: "Data warehouse (BigQuery, Snowflake, Parquet)"
    latency: "Seconds to minutes"

  online_store:
    purpose: "Low-latency features for inference"
    storage: "Key-value store (Redis, DynamoDB)"
    latency: "< 10ms p99"

  feature_registry:
    purpose: "Metadata, schemas, documentation"
    contents: "Feature definitions, owners, SLAs"

  materialization:
    purpose: "Sync offline → online store"
    frequency: "Batch or streaming"

Feast Interview Example

Interview Question: "Walk me through implementing a feature store for a fraud detection system."

from feast import Entity, Feature, FeatureView, FileSource, ValueType
from feast.types import Float32, Int64
from datetime import timedelta

# Entity: What are we computing features for?
user = Entity(
    name="user_id",
    value_type=ValueType.INT64,
    description="Unique user identifier"
)

# Offline source: Where historical data lives
transactions_source = FileSource(
    path="s3://features/user_transactions.parquet",
    timestamp_field="event_timestamp"
)

# Feature View: A logical group of features
user_transaction_features = FeatureView(
    name="user_transaction_features",
    entities=[user],
    ttl=timedelta(days=1),  # How long features are valid
    schema=[
        Feature(name="transaction_count_7d", dtype=Int64),
        Feature(name="avg_transaction_amount_7d", dtype=Float32),
        Feature(name="max_transaction_amount_7d", dtype=Float32),
        Feature(name="unique_merchants_7d", dtype=Int64),
    ],
    online=True,  # Enable online serving
    source=transactions_source,
    tags={"team": "fraud", "pii": "false"}
)

Fetching Features:

from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Training: Get historical features (offline)
training_data = store.get_historical_features(
    entity_df=transactions_df,  # DataFrame with user_id, event_timestamp
    features=[
        "user_transaction_features:transaction_count_7d",
        "user_transaction_features:avg_transaction_amount_7d"
    ]
).to_df()

# Inference: Get online features (real-time)
online_features = store.get_online_features(
    features=[
        "user_transaction_features:transaction_count_7d",
        "user_transaction_features:avg_transaction_amount_7d"
    ],
    entity_rows=[{"user_id": 12345}]
).to_dict()

Data Versioning with DVC

Interview Question: "How do you version training datasets?"

# Initialize DVC in existing Git repo
dvc init

# Track large data files
dvc add data/training_v1.parquet

# This creates data/training_v1.parquet.dvc (pointer file)
# Git tracks .dvc file, DVC tracks actual data

git add data/training_v1.parquet.dvc data/.gitignore
git commit -m "Add training data v1"

# Push data to remote storage
dvc push  # Uploads to S3/GCS configured in .dvc/config

Reproducing Training:

# Switch to previous data version
git checkout v1.0.0
dvc checkout  # Pulls matching data version

# Run training
python train.py

# Data + code + model are now reproducible

Interview Trade-offs Discussion

Approach Pros Cons
Feast Open source, Kubernetes-native Requires infrastructure
Tecton Managed, enterprise features Cost, vendor lock-in
Databricks Feature Store Integrated with Databricks Databricks-only
Custom Redis Simple, low latency No feature management

Expert Insight: In interviews, discuss point-in-time correctness. "We can't use features computed after the prediction timestamp - that's data leakage."

Next, we'll cover pipeline design patterns for interviews. :::

Quiz

Module 3: ML Pipelines & Orchestration

Take Quiz