Production Monitoring & Next Steps
CI/CD for LLM Quality
3 min read
Integrate LLM evaluation into your CI/CD pipeline to catch quality regressions before they reach production.
Why CI/CD for LLM Quality?
| Traditional CI | LLM CI |
|---|---|
| Unit tests | Evaluation suites |
| Code coverage | Quality metrics |
| Linting | Prompt validation |
| Build checks | Model endpoint health |
GitHub Actions Example
Run evaluations on every PR:
# .github/workflows/llm-evaluation.yml
name: LLM Quality Check
on:
pull_request:
paths:
- 'prompts/**'
- 'src/llm/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install langsmith mlflow weave openai
- name: Run evaluation suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
run: |
python scripts/run_evaluations.py
- name: Check quality thresholds
run: |
python scripts/check_thresholds.py
Evaluation Script
# scripts/run_evaluations.py
import json
from langsmith.evaluation import evaluate
from my_app import my_llm_function
from my_evaluators import accuracy_scorer, helpfulness_scorer
def run_ci_evaluation():
"""Run evaluation suite for CI."""
results = evaluate(
my_llm_function,
data="ci-test-dataset",
evaluators=[accuracy_scorer, helpfulness_scorer],
experiment_prefix="ci-check"
)
# Save results for threshold checking
with open("eval_results.json", "w") as f:
json.dump(results.metrics, f)
return results
if __name__ == "__main__":
run_ci_evaluation()
Threshold Checking
# scripts/check_thresholds.py
import json
import sys
THRESHOLDS = {
"accuracy": 0.85,
"helpfulness": 0.80,
}
def check_thresholds():
with open("eval_results.json") as f:
results = json.load(f)
failures = []
for metric, threshold in THRESHOLDS.items():
value = results.get(f"{metric}/mean", 0)
if value < threshold:
failures.append(f"{metric}: {value:.2f} < {threshold}")
if failures:
print("Quality check FAILED:")
for f in failures:
print(f" - {f}")
sys.exit(1)
else:
print("Quality check PASSED")
print(f" Accuracy: {results.get('accuracy/mean', 0):.2f}")
print(f" Helpfulness: {results.get('helpfulness/mean', 0):.2f}")
if __name__ == "__main__":
check_thresholds()
PR Comments
Add evaluation results to PR comments:
- name: Comment on PR
uses: actions/github-script@v7
with:
script: |
const results = require('./eval_results.json');
const comment = `
## LLM Quality Report
| Metric | Score | Threshold | Status |
|--------|-------|-----------|--------|
| Accuracy | ${results['accuracy/mean'].toFixed(2)} | 0.85 | ${results['accuracy/mean'] >= 0.85 ? '✅' : '❌'} |
| Helpfulness | ${results['helpfulness/mean'].toFixed(2)} | 0.80 | ${results['helpfulness/mean'] >= 0.80 ? '✅' : '❌'} |
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: comment
});
Staged Rollouts
Use CI results to control deployments:
deploy:
needs: evaluate
if: success()
runs-on: ubuntu-latest
steps:
- name: Deploy to staging
if: github.event_name == 'pull_request'
run: ./deploy.sh staging
- name: Deploy to production
if: github.ref == 'refs/heads/main'
run: ./deploy.sh production
Evaluation Datasets for CI
Maintain a CI-specific dataset:
| Dataset Type | Size | Purpose |
|---|---|---|
| CI quick | 20-50 examples | Fast feedback on PRs |
| CI full | 100-200 examples | Merge gate |
| Nightly | 500+ examples | Comprehensive check |
# Select dataset based on context
import os
if os.environ.get("CI_QUICK"):
dataset = "ci-quick-dataset"
elif os.environ.get("CI_FULL"):
dataset = "ci-full-dataset"
else:
dataset = "nightly-dataset"
Best Practices
| Practice | Why |
|---|---|
| Keep CI fast | Use small datasets for PRs |
| Cache when possible | Don't re-evaluate unchanged code |
| Version datasets | Track what you tested against |
| Fail fast | Block merges on quality drops |
| Track trends | Store results for historical comparison |
Handling Flakiness
LLMs can produce variable outputs:
def run_stable_evaluation(n_runs=3):
"""Run evaluation multiple times and take average."""
all_results = []
for _ in range(n_runs):
results = evaluate(model, data=dataset, evaluators=evaluators)
all_results.append(results.metrics)
# Average across runs
avg_metrics = {}
for key in all_results[0].keys():
avg_metrics[key] = sum(r[key] for r in all_results) / n_runs
return avg_metrics
Tip: Start with a small, high-quality CI dataset. Expand it as you discover edge cases in production.
Finally, let's recap what you've learned and explore next steps in your LLMOps journey. :::