Introduction to LLMOps

Key Metrics for LLM Quality

3 min read

How do you know if your LLM application is "good"? Quality isn't a single number—it's a balance of multiple dimensions.

The Four Pillars of LLM Quality

PillarWhat It MeasuresExample Metric
LatencyHow fast responses areP95 response time < 2s
CostHow much it costs per requestCost per 1K requests < $5
QualityHow good the outputs areFaithfulness score > 0.85
SafetyHow safe the outputs areNo harmful content

Latency Metrics

Users expect fast responses. Track these percentiles:

  • P50 (median): Half of requests are faster than this
  • P95: 95% of requests are faster than this
  • P99: 99% of requests are faster than this

Example threshold: P95 < 3 seconds for chatbot responses

Cost Metrics

Every LLM call has a cost based on tokens:

  • Input tokens: What you send to the model
  • Output tokens: What the model generates
  • Cost per request: (input_tokens × input_price) + (output_tokens × output_price)

Track:

  • Daily/weekly spend: Are we on budget?
  • Cost per user action: How much does each chat turn cost?
  • Cost anomalies: Sudden spikes in token usage

Quality Metrics

Quality is multidimensional. Common metrics include:

MetricWhat It Measures
FaithfulnessDoes the response stick to provided context?
Answer RelevancyDoes the response address the question?
CoherenceIs the response logically structured?
CorrectnessIs the information accurate?
CompletenessDoes it cover all aspects of the question?

Safety Metrics

Production systems must filter harmful content:

  • Toxicity detection: Hate speech, harassment, violence
  • PII detection: Personal identifiable information leakage
  • Prompt injection detection: Attempts to manipulate the system
  • Refusal rate: How often the model appropriately refuses

Choosing Your Metrics

Not every metric matters for every use case:

Use CasePriority Metrics
Customer support chatbotLatency, Relevancy, Safety
Code generation assistantCorrectness, Cost
RAG document Q&AFaithfulness, Completeness
Creative writing helperCoherence, Relevancy

Start simple: Pick 2-3 metrics that matter most. Add more as you mature.

In the next lesson, we'll explore evaluation-driven development—the practice of putting evaluation at the center of your workflow. :::

Quick check: how does this lesson land for you?

Quiz

Module 1: Introduction to LLMOps

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.