Key Metrics for LLM Quality

How do you know if your LLM application is "good"? Quality isn't a single number—it's a balance of multiple dimensions.

The Four Pillars of LLM Quality

Users expect fast responses. Track these percentiles:

Example threshold: P95 < 3 seconds for chatbot responses

Every LLM call has a cost based on tokens:

Input tokens: What you send to the model
Output tokens: What the model generates
Cost per request: (input_tokens × input_price) + (output_tokens × output_price)

Track:

Quality is multidimensional. Common metrics include:

Metric	What It Measures
Faithfulness	Does the response stick to provided context?
Answer Relevancy	Does the response address the question?
Coherence	Is the response logically structured?
Correctness	Is the information accurate?
Completeness	Does it cover all aspects of the question?

Production systems must filter harmful content:

Not every metric matters for every use case:

Start simple: Pick 2-3 metrics that matter most. Add more as you mature.

In the next lesson, we'll explore evaluation-driven development—the practice of putting evaluation at the center of your workflow. :::