Introduction to LLMOps

Key Metrics for LLM Quality

3 min read

How do you know if your LLM application is "good"? Quality isn't a single number—it's a balance of multiple dimensions.

The Four Pillars of LLM Quality

Pillar What It Measures Example Metric
Latency How fast responses are P95 response time < 2s
Cost How much it costs per request Cost per 1K requests < $5
Quality How good the outputs are Faithfulness score > 0.85
Safety How safe the outputs are No harmful content

Latency Metrics

Users expect fast responses. Track these percentiles:

  • P50 (median): Half of requests are faster than this
  • P95: 95% of requests are faster than this
  • P99: 99% of requests are faster than this

Example threshold: P95 < 3 seconds for chatbot responses

Cost Metrics

Every LLM call has a cost based on tokens:

  • Input tokens: What you send to the model
  • Output tokens: What the model generates
  • Cost per request: (input_tokens × input_price) + (output_tokens × output_price)

Track:

  • Daily/weekly spend: Are we on budget?
  • Cost per user action: How much does each chat turn cost?
  • Cost anomalies: Sudden spikes in token usage

Quality Metrics

Quality is multidimensional. Common metrics include:

Metric What It Measures
Faithfulness Does the response stick to provided context?
Answer Relevancy Does the response address the question?
Coherence Is the response logically structured?
Correctness Is the information accurate?
Completeness Does it cover all aspects of the question?

Safety Metrics

Production systems must filter harmful content:

  • Toxicity detection: Hate speech, harassment, violence
  • PII detection: Personal identifiable information leakage
  • Prompt injection detection: Attempts to manipulate the system
  • Refusal rate: How often the model appropriately refuses

Choosing Your Metrics

Not every metric matters for every use case:

Use Case Priority Metrics
Customer support chatbot Latency, Relevancy, Safety
Code generation assistant Correctness, Cost
RAG document Q&A Faithfulness, Completeness
Creative writing helper Coherence, Relevancy

Start simple: Pick 2-3 metrics that matter most. Add more as you mature.

In the next lesson, we'll explore evaluation-driven development—the practice of putting evaluation at the center of your workflow. :::

Quiz

Module 1: Introduction to LLMOps

Take Quiz