Production Monitoring & Next Steps

CI/CD for LLM Quality

3 min read

Integrate LLM evaluation into your CI/CD pipeline to catch quality regressions before they reach production.

Why CI/CD for LLM Quality?

Traditional CI LLM CI
Unit tests Evaluation suites
Code coverage Quality metrics
Linting Prompt validation
Build checks Model endpoint health

GitHub Actions Example

Run evaluations on every PR:

# .github/workflows/llm-evaluation.yml
name: LLM Quality Check

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/llm/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install langsmith mlflow weave openai

      - name: Run evaluation suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
        run: |
          python scripts/run_evaluations.py

      - name: Check quality thresholds
        run: |
          python scripts/check_thresholds.py

Evaluation Script

# scripts/run_evaluations.py
import json
from langsmith.evaluation import evaluate
from my_app import my_llm_function
from my_evaluators import accuracy_scorer, helpfulness_scorer

def run_ci_evaluation():
    """Run evaluation suite for CI."""
    results = evaluate(
        my_llm_function,
        data="ci-test-dataset",
        evaluators=[accuracy_scorer, helpfulness_scorer],
        experiment_prefix="ci-check"
    )

    # Save results for threshold checking
    with open("eval_results.json", "w") as f:
        json.dump(results.metrics, f)

    return results

if __name__ == "__main__":
    run_ci_evaluation()

Threshold Checking

# scripts/check_thresholds.py
import json
import sys

THRESHOLDS = {
    "accuracy": 0.85,
    "helpfulness": 0.80,
}

def check_thresholds():
    with open("eval_results.json") as f:
        results = json.load(f)

    failures = []
    for metric, threshold in THRESHOLDS.items():
        value = results.get(f"{metric}/mean", 0)
        if value < threshold:
            failures.append(f"{metric}: {value:.2f} < {threshold}")

    if failures:
        print("Quality check FAILED:")
        for f in failures:
            print(f"  - {f}")
        sys.exit(1)
    else:
        print("Quality check PASSED")
        print(f"  Accuracy: {results.get('accuracy/mean', 0):.2f}")
        print(f"  Helpfulness: {results.get('helpfulness/mean', 0):.2f}")

if __name__ == "__main__":
    check_thresholds()

PR Comments

Add evaluation results to PR comments:

- name: Comment on PR
  uses: actions/github-script@v7
  with:
    script: |
      const results = require('./eval_results.json');
      const comment = `
      ## LLM Quality Report

      | Metric | Score | Threshold | Status |
      |--------|-------|-----------|--------|
      | Accuracy | ${results['accuracy/mean'].toFixed(2)} | 0.85 | ${results['accuracy/mean'] >= 0.85 ? '✅' : '❌'} |
      | Helpfulness | ${results['helpfulness/mean'].toFixed(2)} | 0.80 | ${results['helpfulness/mean'] >= 0.80 ? '✅' : '❌'} |
      `;

      github.rest.issues.createComment({
        issue_number: context.issue.number,
        owner: context.repo.owner,
        repo: context.repo.repo,
        body: comment
      });

Staged Rollouts

Use CI results to control deployments:

deploy:
  needs: evaluate
  if: success()
  runs-on: ubuntu-latest
  steps:
    - name: Deploy to staging
      if: github.event_name == 'pull_request'
      run: ./deploy.sh staging

    - name: Deploy to production
      if: github.ref == 'refs/heads/main'
      run: ./deploy.sh production

Evaluation Datasets for CI

Maintain a CI-specific dataset:

Dataset Type Size Purpose
CI quick 20-50 examples Fast feedback on PRs
CI full 100-200 examples Merge gate
Nightly 500+ examples Comprehensive check
# Select dataset based on context
import os

if os.environ.get("CI_QUICK"):
    dataset = "ci-quick-dataset"
elif os.environ.get("CI_FULL"):
    dataset = "ci-full-dataset"
else:
    dataset = "nightly-dataset"

Best Practices

Practice Why
Keep CI fast Use small datasets for PRs
Cache when possible Don't re-evaluate unchanged code
Version datasets Track what you tested against
Fail fast Block merges on quality drops
Track trends Store results for historical comparison

Handling Flakiness

LLMs can produce variable outputs:

def run_stable_evaluation(n_runs=3):
    """Run evaluation multiple times and take average."""
    all_results = []

    for _ in range(n_runs):
        results = evaluate(model, data=dataset, evaluators=evaluators)
        all_results.append(results.metrics)

    # Average across runs
    avg_metrics = {}
    for key in all_results[0].keys():
        avg_metrics[key] = sum(r[key] for r in all_results) / n_runs

    return avg_metrics

Tip: Start with a small, high-quality CI dataset. Expand it as you discover edge cases in production.

Finally, let's recap what you've learned and explore next steps in your LLMOps journey. :::

Quiz

Module 6: Production Monitoring & Next Steps

Take Quiz