Logging and Debugging ML Systems

Production ML debugging requires specialized techniques. Interviewers test your ability to diagnose issues that span data, models, and infrastructure.

ML-Specific Logging Requirements

Log Type	What to Capture	Why It Matters
Request logs	Input features, timestamps	Reproduce issues
Prediction logs	Output, confidence, latency	Debug wrong predictions
Model logs	Version, load time, errors	Track model lifecycle
Data logs	Schema, null rates, ranges	Detect data issues

Structured Logging for ML

import structlog
import json
from datetime import datetime

# Configure structured logging
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)

logger = structlog.get_logger()

def log_prediction(request_id, model_name, features, prediction, latency_ms):
    logger.info(
        "prediction_completed",
        request_id=request_id,
        model_name=model_name,
        model_version=get_model_version(model_name),
        feature_count=len(features),
        # Sample features for debugging (not all - privacy + size)
        feature_sample={k: features[k] for k in list(features.keys())[:5]},
        prediction=prediction,
        confidence=prediction.get("confidence"),
        latency_ms=latency_ms,
        timestamp=datetime.utcnow().isoformat()
    )

Output:

{
  "event": "prediction_completed",
  "request_id": "abc-123",
  "model_name": "fraud_detector",
  "model_version": "v2.3.1",
  "feature_count": 45,
  "feature_sample": {"amount": 150.0, "merchant_id": "M123"},
  "prediction": {"label": "fraud", "score": 0.87},
  "confidence": 0.87,
  "latency_ms": 23,
  "timestamp": "2026-01-04T10:30:00Z"
}

Interview Question: Debug Wrong Predictions

Question: "Users report the fraud model is flagging legitimate transactions. How do you debug?"

Structured Debugging Approach:

def debug_false_positives(flagged_transactions):
    """
    Step-by-step debugging for wrong predictions
    """

    # Step 1: Reproduce the prediction
    debug_steps = {}

    debug_steps["1_replay_prediction"] = """
    - Fetch the exact input features from prediction logs
    - Run the same model version with those features
    - Verify we get the same prediction (reproducibility)
    """

    debug_steps["2_analyze_features"] = """
    - Compare flagged transactions to training data
    - Identify outlier features (amount, time, location)
    - Check for missing or unexpected values
    """

    debug_steps["3_check_feature_pipeline"] = """
    - Verify feature store is returning correct values
    - Check for stale cached features
    - Look for feature transformation bugs
    """

    debug_steps["4_analyze_model_behavior"] = """
    - Get feature importance for these predictions
    - Use SHAP/LIME for local explanations
    - Identify which features drove the decision
    """

    debug_steps["5_check_for_drift"] = """
    - Compare production data distribution to training
    - Check if this represents a new user segment
    - Evaluate if model was trained on similar cases
    """

    return debug_steps

Tracing for ML Systems

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter

# Setup tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

@tracer.start_as_current_span("predict")
def predict(request):
    span = trace.get_current_span()

    # Trace feature retrieval
    with tracer.start_as_current_span("fetch_features"):
        features = fetch_from_feature_store(request.user_id)
        span.set_attribute("feature_count", len(features))

    # Trace preprocessing
    with tracer.start_as_current_span("preprocess"):
        processed = preprocess(features)
        span.set_attribute("preprocessing_time_ms", timer.elapsed())

    # Trace inference
    with tracer.start_as_current_span("model_inference"):
        prediction = model.predict(processed)
        span.set_attribute("model_version", model.version)
        span.set_attribute("prediction", str(prediction))

    return prediction

Log Aggregation Strategy

# Recommended log pipeline
log_architecture:
  application_logs:
    format: "JSON structured"
    destination: "Loki or Elasticsearch"
    retention: "30 days"

  prediction_logs:
    format: "JSON with request_id correlation"
    destination: "Dedicated ML log store (BigQuery)"
    retention: "90 days (for model debugging)"

  metrics:
    format: "Prometheus exposition format"
    destination: "Prometheus + Thanos (long-term)"
    retention: "1 year"

  traces:
    format: "OpenTelemetry"
    destination: "Jaeger or Tempo"
    retention: "7 days"

Common Debugging Scenarios

Symptom	Likely Cause	Investigation
Sudden accuracy drop	Data pipeline change	Check upstream schema
Gradual accuracy drop	Concept drift	Compare distributions
High latency spikes	Memory pressure	Check GPU memory logs
Random errors	Intermittent dependency	Check trace spans
Wrong predictions	Feature bug	Compare logged vs expected features

Pro Tip: "We keep prediction logs for 90 days so we can replay any prediction when users report issues. The request_id is our correlation key across all systems."

Next module covers CI/CD for ML Systems interview questions. :::