Root Cause Analysis

"A metric dropped 20% last week. How would you investigate?" This question tests your structured problem-solving ability under ambiguity.

The Investigation Framework

Follow this systematic approach:

1. Clarify and scope the problem
2. Validate the data
3. Segment the decline
4. Generate hypotheses
5. Test hypotheses with data
6. Identify root cause
7. Recommend actions

Step 1: Clarify and Scope

Before diving in, ask clarifying questions:

Question	Why It Matters
Which metric exactly?	"Revenue" could mean many things
What's the time period?	Week-over-week vs year-over-year
How much of a drop?	5% vs 50% requires different urgency
Any known events?	Product launch, holiday, outage

Example: "Before I investigate, I want to confirm: we're looking at daily active users, which dropped 20% this week compared to last week. Is this the first time we've seen this pattern, or is it seasonal?"

Step 2: Validate the Data

Never assume the data is correct:

Check	What to Look For
Logging changes	Did tracking code change?
Definition changes	Did the metric calculation change?
Data pipeline issues	ETL failures, duplicates, delays
Outliers	Single large customer or bot activity

Interview insight: "My first step would be to check if this is a real change or a data quality issue. I'd look at whether our logging changed or if there were ETL failures."

Step 3: Segment the Decline

Decompose the metric to find where the drop is concentrated:

Common segmentation dimensions:

Platform (iOS, Android, Web)
Geography (country, region)
User type (new vs returning)
Traffic source (organic, paid, referral)
Device (mobile, desktop)
Time (weekday vs weekend, hour of day)

MECE Principle: Mutually Exclusive, Collectively Exhaustive - segments shouldn't overlap and should cover everything.

Example decomposition:
Total DAU: -20%
├── Mobile: -25%
│   ├── iOS: -5%
│   └── Android: -40% ← Problem here!
└── Desktop: -10%

Step 4: Generate Hypotheses

Based on segmentation, brainstorm possible causes:

Segment Finding	Possible Causes
Android-specific drop	App update bug, Play Store issue, competitor launch
New users only	Marketing campaign ended, onboarding change
Specific geography	Local event, payment processor issue, regulatory change
Specific time	Server outage, traffic spike, API failure

Structure your hypotheses:

Internal technical (bugs, performance)
Internal product (feature changes, UI updates)
External market (competitor, economy)
External technical (platform changes, API updates)

Step 5: Test with Data

Prioritize hypotheses by:

Likelihood (based on pattern)
Data availability (can we check this?)
Actionability (can we fix it?)

Example testing:

Hypothesis: Android app update caused bug
Test: Check app version in data
Finding: 95% of drop is from v4.2 released Monday
Conclusion: Likely root cause found

Interview Example

Question: "Homepage conversion rate dropped 15% yesterday. Walk me through your investigation."

Strong answer:

"I'd approach this systematically:

Clarify: Is this 15% relative (5% → 4.25%) or absolute? What's the confidence interval?

Validate: Check if tracking code changed, look for data pipeline issues.

Segment: Break down by:

Device: Did one platform drop more?
Traffic source: Organic vs paid vs email
User type: New vs returning
Geography: Regional issues?
Time of day: Sudden drop or gradual?

Hypotheses based on pattern:

If sudden drop at specific time → likely technical (deploy, outage)
If gradual all day → likely product/external
If one traffic source → likely attribution or marketing issue

Test: I'd query each hypothesis:

SELECT
    DATE_TRUNC('hour', timestamp),
    platform,
    COUNT(*) as sessions,
    SUM(converted) / COUNT(*) as conversion_rate
FROM homepage_visits
WHERE date >= CURRENT_DATE - 2
GROUP BY 1, 2
ORDER BY 1, 2

Root cause + recommendation: Based on findings, identify the cause and recommend either a revert, fix, or further investigation."

The best answers show structured thinking and data-driven hypothesis testing, not jumping to conclusions. :::