CI/CD for LLM Quality

Integrate LLM evaluation into your CI/CD pipeline to catch quality regressions before they reach production.

Why CI/CD for LLM Quality?

Traditional CI	LLM CI
Unit tests	Evaluation suites
Code coverage	Quality metrics
Linting	Prompt validation
Build checks	Model endpoint health

GitHub Actions Example

Run evaluations on every PR:

# .github/workflows/llm-evaluation.yml
name: LLM Quality Check

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/llm/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install langsmith mlflow weave openai

      - name: Run evaluation suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
        run: |
          python scripts/run_evaluations.py

      - name: Check quality thresholds
        run: |
          python scripts/check_thresholds.py

Evaluation Script

# scripts/run_evaluations.py
import json
from langsmith.evaluation import evaluate
from my_app import my_llm_function
from my_evaluators import accuracy_scorer, helpfulness_scorer

def run_ci_evaluation():
    """Run evaluation suite for CI."""
    results = evaluate(
        my_llm_function,
        data="ci-test-dataset",
        evaluators=[accuracy_scorer, helpfulness_scorer],
        experiment_prefix="ci-check"
    )

    # Save results for threshold checking
    with open("eval_results.json", "w") as f:
        json.dump(results.metrics, f)

    return results

if __name__ == "__main__":
    run_ci_evaluation()

Threshold Checking

# scripts/check_thresholds.py
import json
import sys

THRESHOLDS = {
    "accuracy": 0.85,
    "helpfulness": 0.80,
}

def check_thresholds():
    with open("eval_results.json") as f:
        results = json.load(f)

    failures = []
    for metric, threshold in THRESHOLDS.items():
        value = results.get(f"{metric}/mean", 0)
        if value < threshold:
            failures.append(f"{metric}: {value:.2f} < {threshold}")

    if failures:
        print("Quality check FAILED:")
        for f in failures:
            print(f"  - {f}")
        sys.exit(1)
    else:
        print("Quality check PASSED")
        print(f"  Accuracy: {results.get('accuracy/mean', 0):.2f}")
        print(f"  Helpfulness: {results.get('helpfulness/mean', 0):.2f}")

if __name__ == "__main__":
    check_thresholds()

PR Comments

Add evaluation results to PR comments:

- name: Comment on PR
  uses: actions/github-script@v7
  with:
    script: |
      const results = require('./eval_results.json');
      const comment = `
      ## LLM Quality Report

      | Metric | Score | Threshold | Status |
      |--------|-------|-----------|--------|
      | Accuracy | ${results['accuracy/mean'].toFixed(2)} | 0.85 | ${results['accuracy/mean'] >= 0.85 ? '✅' : '❌'} |
      | Helpfulness | ${results['helpfulness/mean'].toFixed(2)} | 0.80 | ${results['helpfulness/mean'] >= 0.80 ? '✅' : '❌'} |
      `;

      github.rest.issues.createComment({
        issue_number: context.issue.number,
        owner: context.repo.owner,
        repo: context.repo.repo,
        body: comment
      });

Staged Rollouts

Use CI results to control deployments:

deploy:
  needs: evaluate
  if: success()
  runs-on: ubuntu-latest
  steps:
    - name: Deploy to staging
      if: github.event_name == 'pull_request'
      run: ./deploy.sh staging

    - name: Deploy to production
      if: github.ref == 'refs/heads/main'
      run: ./deploy.sh production

Evaluation Datasets for CI

Maintain a CI-specific dataset:

Dataset Type	Size	Purpose
CI quick	20-50 examples	Fast feedback on PRs
CI full	100-200 examples	Merge gate
Nightly	500+ examples	Comprehensive check

# Select dataset based on context
import os

if os.environ.get("CI_QUICK"):
    dataset = "ci-quick-dataset"
elif os.environ.get("CI_FULL"):
    dataset = "ci-full-dataset"
else:
    dataset = "nightly-dataset"

Best Practices

Practice	Why
Keep CI fast	Use small datasets for PRs
Cache when possible	Don't re-evaluate unchanged code
Version datasets	Track what you tested against
Fail fast	Block merges on quality drops
Track trends	Store results for historical comparison

Handling Flakiness

LLMs can produce variable outputs:

def run_stable_evaluation(n_runs=3):
    """Run evaluation multiple times and take average."""
    all_results = []

    for _ in range(n_runs):
        results = evaluate(model, data=dataset, evaluators=evaluators)
        all_results.append(results.metrics)

    # Average across runs
    avg_metrics = {}
    for key in all_results[0].keys():
        avg_metrics[key] = sum(r[key] for r in all_results) / n_runs

    return avg_metrics

Tip: Start with a small, high-quality CI dataset. Expand it as you discover edge cases in production.

Finally, let's recap what you've learned and explore next steps in your LLMOps journey. :::