AI Product Metrics & Evaluation
AI-Specific Success Metrics
Traditional product metrics (conversion, engagement, retention) still matter. But AI features need additional metrics to track model health.
The AI Metrics Hierarchy
┌─────────────────────────────────────┐
│ Business Metrics │ ← What leadership cares about
│ (Revenue, Conversion, Retention) │
├─────────────────────────────────────┤
│ Product Metrics │ ← What PMs track
│ (Engagement, Task Completion) │
├─────────────────────────────────────┤
│ Model Metrics │ ← What ML teams track
│ (Accuracy, Precision, Recall) │
├─────────────────────────────────────┤
│ Operational Metrics │ ← What keeps things running
│ (Latency, Throughput, Cost) │
└─────────────────────────────────────┘
Core Model Metrics
Accuracy vs Precision vs Recall
These measure different aspects of correctness:
| Metric | Formula | Use When |
|---|---|---|
| Accuracy | (Correct predictions) / (Total predictions) | Classes are balanced |
| Precision | (True positives) / (All predicted positives) | False positives are costly |
| Recall | (True positives) / (All actual positives) | Missing cases is costly |
| F1 Score | 2 × (Precision × Recall) / (Precision + Recall) | You need balance |
Real-World Examples
Spam Filter:
- High precision = Few legitimate emails marked as spam (users trust the filter)
- High recall = Few spam emails get through (inbox stays clean)
- Priority: Precision (users hate losing real emails)
Fraud Detection:
- High precision = Few false fraud alerts (less customer friction)
- High recall = Catching most actual fraud (protecting revenue)
- Priority: Recall (missing fraud is expensive)
Content Moderation:
- High precision = Few wrongful takedowns (user trust)
- High recall = Removing most violations (platform safety)
- Priority: Depends on platform values
Operational Metrics
Latency
How long predictions take:
| Metric | Definition | Typical Targets |
|---|---|---|
| p50 latency | 50th percentile response time | <100ms for real-time |
| p95 latency | 95th percentile response time | <500ms for real-time |
| p99 latency | 99th percentile response time | <1000ms acceptable |
Why it matters: Slow AI kills user experience. A recommendation that takes 2 seconds feels broken.
Throughput
How many predictions per second your system handles:
| Scale | Typical Needs |
|---|---|
| Small app | 10-100 requests/second |
| Medium app | 100-1,000 requests/second |
| Large app | 1,000-10,000+ requests/second |
Cost Per Inference
What each prediction costs:
Cost per inference = (API cost OR infrastructure cost) / number of predictions
Example calculations:
| Provider | Pricing | 1M predictions/month |
|---|---|---|
| GPT-4o | ~$10/1M input + $30/1M output tokens | ~$2,000-4,000 |
| Claude 3.5 Sonnet | ~$3/1M input + $15/1M output tokens | ~$900-1,800 |
| Self-hosted Llama | ~$5,000/month infrastructure | ~$5,000 fixed |
Quality Over Time Metrics
AI models degrade. Track these:
Model Drift
When model performance changes over time:
| Type | Cause | Detection |
|---|---|---|
| Data drift | Input data distribution changes | Compare input distributions |
| Concept drift | What "correct" means changes | Track accuracy over time |
| Label drift | How humans label changes | Audit labeling consistency |
Monitoring Dashboard Essentials
| Metric | Frequency | Alert Threshold |
|---|---|---|
| Accuracy/F1 | Daily | >5% drop |
| Latency p99 | Real-time | >2x baseline |
| Error rate | Real-time | >1% |
| Cost per day | Daily | >20% increase |
| Input distribution | Weekly | Significant shift |
Connecting AI Metrics to Business Metrics
The metrics that matter most depend on your use case:
| AI Feature | Key AI Metric | Business Metric Link |
|---|---|---|
| Search ranking | Click-through rate | Revenue per search |
| Recommendations | Precision@K | Conversion rate |
| Content moderation | Recall | Platform trust/safety |
| Chatbot | Resolution rate | Support cost reduction |
| Fraud detection | False positive rate | Customer friction |
Setting Metric Targets
Use this framework:
1. Find the Baseline
What's current performance without AI (or with the old system)?
2. Research Benchmarks
What do similar systems achieve? Industry standards:
| Task | Good | Excellent |
|---|---|---|
| Text classification | 85% | 95% |
| Sentiment analysis | 80% | 90% |
| Named entity recognition | 85% | 95% |
| Content moderation | 90% | 97% |
3. Calculate Business Impact
If we improve [metric] from X% to Y%,
we expect [business outcome] to improve by Z%
because [reasoning].
4. Set Realistic Targets
| Timeline | Target |
|---|---|
| MVP | Beat baseline by 10%+ |
| V1 | Reach industry average |
| Mature | Reach top quartile |
Key Takeaway
AI metrics exist on multiple levels. Model metrics (accuracy) feed into product metrics (task completion) which drive business metrics (revenue). Track all layers, but always connect back to user and business value.
Next: How do you design user experiences for AI features that don't always get it right? :::