Introduction to LLMOps
Key Metrics for LLM Quality
3 min read
How do you know if your LLM application is "good"? Quality isn't a single number—it's a balance of multiple dimensions.
The Four Pillars of LLM Quality
| Pillar | What It Measures | Example Metric |
|---|---|---|
| Latency | How fast responses are | P95 response time < 2s |
| Cost | How much it costs per request | Cost per 1K requests < $5 |
| Quality | How good the outputs are | Faithfulness score > 0.85 |
| Safety | How safe the outputs are | No harmful content |
Latency Metrics
Users expect fast responses. Track these percentiles:
- P50 (median): Half of requests are faster than this
- P95: 95% of requests are faster than this
- P99: 99% of requests are faster than this
Example threshold: P95 < 3 seconds for chatbot responses
Cost Metrics
Every LLM call has a cost based on tokens:
- Input tokens: What you send to the model
- Output tokens: What the model generates
- Cost per request: (input_tokens × input_price) + (output_tokens × output_price)
Track:
- Daily/weekly spend: Are we on budget?
- Cost per user action: How much does each chat turn cost?
- Cost anomalies: Sudden spikes in token usage
Quality Metrics
Quality is multidimensional. Common metrics include:
| Metric | What It Measures |
|---|---|
| Faithfulness | Does the response stick to provided context? |
| Answer Relevancy | Does the response address the question? |
| Coherence | Is the response logically structured? |
| Correctness | Is the information accurate? |
| Completeness | Does it cover all aspects of the question? |
Safety Metrics
Production systems must filter harmful content:
- Toxicity detection: Hate speech, harassment, violence
- PII detection: Personal identifiable information leakage
- Prompt injection detection: Attempts to manipulate the system
- Refusal rate: How often the model appropriately refuses
Choosing Your Metrics
Not every metric matters for every use case:
| Use Case | Priority Metrics |
|---|---|
| Customer support chatbot | Latency, Relevancy, Safety |
| Code generation assistant | Correctness, Cost |
| RAG document Q&A | Faithfulness, Completeness |
| Creative writing helper | Coherence, Relevancy |
Start simple: Pick 2-3 metrics that matter most. Add more as you mature.
In the next lesson, we'll explore evaluation-driven development—the practice of putting evaluation at the center of your workflow. :::