Introduction to LLMOps
Evaluation-Driven Development
3 min read
Traditional software development has unit tests. LLM development has evaluations. The principle is the same: define what "correct" means before you build.
The Problem with "Vibes-Based" Development
Many teams develop LLM applications like this:
- Write a prompt
- Try it a few times
- "It looks good to me!"
- Deploy to production
- Hope for the best
This approach fails because:
- LLMs are non-deterministic—the same prompt can give different results
- Edge cases are hard to spot with manual testing
- Quality degrades silently over time
- No way to compare versions objectively
Evaluation-First Approach
Evaluation-driven development flips the workflow:
- Define success criteria - What does a "good" response look like?
- Build test cases - Create examples of inputs and expected outputs
- Create evaluators - Write code or prompts that score responses
- Baseline your current state - Run evaluations before making changes
- Iterate with confidence - Make changes and measure the impact
# Example: Define what we're measuring before we build
evaluation_criteria = {
"faithfulness": "Response must only use information from context",
"relevancy": "Response must directly answer the question",
"format": "Response must be under 100 words"
}
What Unit Tests Are to Code, Evaluations Are to LLMs
| Software Development | LLM Development |
|---|---|
| Unit tests | Single-turn evaluations |
| Integration tests | Multi-turn evaluations |
| End-to-end tests | Full workflow evaluations |
| Test coverage | Dataset coverage |
| CI/CD gates | Evaluation thresholds |
Building Your First Evaluation Dataset
Start with 10-20 high-quality examples:
- Happy path cases: Normal, expected queries
- Edge cases: Unusual but valid queries
- Adversarial cases: Tricky inputs that might confuse the model
- Failure cases: Queries the model should refuse or handle carefully
Pro tip: Add real production failures to your dataset. Every bug becomes a test case.
Continuous Evaluation
Don't just evaluate before deployment:
- On every code change: Run evaluations in CI/CD
- On schedule: Daily or weekly regression checks
- On production traffic: Sample and evaluate real requests
- On model updates: Re-evaluate when upgrading models
The Payoff
Teams using evaluation-driven development report:
- Faster iteration: Confidence to make changes quickly
- Fewer production issues: Catching problems before deployment
- Better collaboration: Shared understanding of quality standards
- Measurable progress: Clear metrics showing improvement over time
In the next module, we'll dive deep into LLM evaluation fundamentals—the techniques and patterns that make this approach work. :::