AI Product Metrics & Evaluation

AI-Specific Success Metrics

5 min read

Traditional product metrics (conversion, engagement, retention) still matter. But AI features need additional metrics to track model health.

The AI Metrics Hierarchy

┌─────────────────────────────────────┐
│      Business Metrics               │  ← What leadership cares about
│  (Revenue, Conversion, Retention)   │
├─────────────────────────────────────┤
│      Product Metrics                │  ← What PMs track
│  (Engagement, Task Completion)      │
├─────────────────────────────────────┤
│      Model Metrics                  │  ← What ML teams track
│  (Accuracy, Precision, Recall)      │
├─────────────────────────────────────┤
│      Operational Metrics            │  ← What keeps things running
│  (Latency, Throughput, Cost)        │
└─────────────────────────────────────┘

Core Model Metrics

Accuracy vs Precision vs Recall

These measure different aspects of correctness:

Metric Formula Use When
Accuracy (Correct predictions) / (Total predictions) Classes are balanced
Precision (True positives) / (All predicted positives) False positives are costly
Recall (True positives) / (All actual positives) Missing cases is costly
F1 Score 2 × (Precision × Recall) / (Precision + Recall) You need balance

Real-World Examples

Spam Filter:

  • High precision = Few legitimate emails marked as spam (users trust the filter)
  • High recall = Few spam emails get through (inbox stays clean)
  • Priority: Precision (users hate losing real emails)

Fraud Detection:

  • High precision = Few false fraud alerts (less customer friction)
  • High recall = Catching most actual fraud (protecting revenue)
  • Priority: Recall (missing fraud is expensive)

Content Moderation:

  • High precision = Few wrongful takedowns (user trust)
  • High recall = Removing most violations (platform safety)
  • Priority: Depends on platform values

Operational Metrics

Latency

How long predictions take:

Metric Definition Typical Targets
p50 latency 50th percentile response time <100ms for real-time
p95 latency 95th percentile response time <500ms for real-time
p99 latency 99th percentile response time <1000ms acceptable

Why it matters: Slow AI kills user experience. A recommendation that takes 2 seconds feels broken.

Throughput

How many predictions per second your system handles:

Scale Typical Needs
Small app 10-100 requests/second
Medium app 100-1,000 requests/second
Large app 1,000-10,000+ requests/second

Cost Per Inference

What each prediction costs:

Cost per inference = (API cost OR infrastructure cost) / number of predictions

Example calculations:

Provider Pricing 1M predictions/month
GPT-4o ~$10/1M input + $30/1M output tokens ~$2,000-4,000
Claude 3.5 Sonnet ~$3/1M input + $15/1M output tokens ~$900-1,800
Self-hosted Llama ~$5,000/month infrastructure ~$5,000 fixed

Quality Over Time Metrics

AI models degrade. Track these:

Model Drift

When model performance changes over time:

Type Cause Detection
Data drift Input data distribution changes Compare input distributions
Concept drift What "correct" means changes Track accuracy over time
Label drift How humans label changes Audit labeling consistency

Monitoring Dashboard Essentials

Metric Frequency Alert Threshold
Accuracy/F1 Daily >5% drop
Latency p99 Real-time >2x baseline
Error rate Real-time >1%
Cost per day Daily >20% increase
Input distribution Weekly Significant shift

Connecting AI Metrics to Business Metrics

The metrics that matter most depend on your use case:

AI Feature Key AI Metric Business Metric Link
Search ranking Click-through rate Revenue per search
Recommendations Precision@K Conversion rate
Content moderation Recall Platform trust/safety
Chatbot Resolution rate Support cost reduction
Fraud detection False positive rate Customer friction

Setting Metric Targets

Use this framework:

1. Find the Baseline

What's current performance without AI (or with the old system)?

2. Research Benchmarks

What do similar systems achieve? Industry standards:

Task Good Excellent
Text classification 85% 95%
Sentiment analysis 80% 90%
Named entity recognition 85% 95%
Content moderation 90% 97%

3. Calculate Business Impact

If we improve [metric] from X% to Y%,
we expect [business outcome] to improve by Z%
because [reasoning].

4. Set Realistic Targets

Timeline Target
MVP Beat baseline by 10%+
V1 Reach industry average
Mature Reach top quartile

Key Takeaway

AI metrics exist on multiple levels. Model metrics (accuracy) feed into product metrics (task completion) which drive business metrics (revenue). Track all layers, but always connect back to user and business value.


Next: How do you design user experiences for AI features that don't always get it right? :::

Quiz

Module 3: AI Product Metrics & Evaluation

Take Quiz