Production Monitoring & Next Steps

CI/CD for LLM Quality

3 min read

Integrate LLM evaluation into your CI/CD pipeline to catch quality regressions before they reach production.

Why CI/CD for LLM Quality?

Traditional CILLM CI
Unit testsEvaluation suites
Code coverageQuality metrics
LintingPrompt validation
Build checksModel endpoint health

GitHub Actions Example

Run evaluations on every PR:

# .github/workflows/llm-evaluation.yml
name: LLM Quality Check

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/llm/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install langsmith mlflow weave openai

      - name: Run evaluation suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
        run: |
          python scripts/run_evaluations.py

      - name: Check quality thresholds
        run: |
          python scripts/check_thresholds.py

Evaluation Script

# scripts/run_evaluations.py
import json
from langsmith.evaluation import evaluate
from my_app import my_llm_function
from my_evaluators import accuracy_scorer, helpfulness_scorer

def run_ci_evaluation():
    """Run evaluation suite for CI."""
    results = evaluate(
        my_llm_function,
        data="ci-test-dataset",
        evaluators=[accuracy_scorer, helpfulness_scorer],
        experiment_prefix="ci-check"
    )

    # Save results for threshold checking
    with open("eval_results.json", "w") as f:
        json.dump(results.metrics, f)

    return results

if __name__ == "__main__":
    run_ci_evaluation()

Threshold Checking

# scripts/check_thresholds.py
import json
import sys

THRESHOLDS = {
    "accuracy": 0.85,
    "helpfulness": 0.80,
}

def check_thresholds():
    with open("eval_results.json") as f:
        results = json.load(f)

    failures = []
    for metric, threshold in THRESHOLDS.items():
        value = results.get(f"{metric}/mean", 0)
        if value < threshold:
            failures.append(f"{metric}: {value:.2f} < {threshold}")

    if failures:
        print("Quality check FAILED:")
        for f in failures:
            print(f"  - {f}")
        sys.exit(1)
    else:
        print("Quality check PASSED")
        print(f"  Accuracy: {results.get('accuracy/mean', 0):.2f}")
        print(f"  Helpfulness: {results.get('helpfulness/mean', 0):.2f}")

if __name__ == "__main__":
    check_thresholds()

PR Comments

Add evaluation results to PR comments:

- name: Comment on PR
  uses: actions/github-script@v7
  with:
    script: |
      const results = require('./eval_results.json');
      const comment = `
      ## LLM Quality Report

      | Metric | Score | Threshold | Status |
      |--------|-------|-----------|--------|
      | Accuracy | ${results['accuracy/mean'].toFixed(2)} | 0.85 | ${results['accuracy/mean'] >= 0.85 ? '✅' : '❌'} |
      | Helpfulness | ${results['helpfulness/mean'].toFixed(2)} | 0.80 | ${results['helpfulness/mean'] >= 0.80 ? '✅' : '❌'} |
      `;

      github.rest.issues.createComment({
        issue_number: context.issue.number,
        owner: context.repo.owner,
        repo: context.repo.repo,
        body: comment
      });

Staged Rollouts

Use CI results to control deployments:

deploy:
  needs: evaluate
  if: success()
  runs-on: ubuntu-latest
  steps:
    - name: Deploy to staging
      if: github.event_name == 'pull_request'
      run: ./deploy.sh staging

    - name: Deploy to production
      if: github.ref == 'refs/heads/main'
      run: ./deploy.sh production

Evaluation Datasets for CI

Maintain a CI-specific dataset:

Dataset TypeSizePurpose
CI quick20-50 examplesFast feedback on PRs
CI full100-200 examplesMerge gate
Nightly500+ examplesComprehensive check
# Select dataset based on context
import os

if os.environ.get("CI_QUICK"):
    dataset = "ci-quick-dataset"
elif os.environ.get("CI_FULL"):
    dataset = "ci-full-dataset"
else:
    dataset = "nightly-dataset"

Best Practices

PracticeWhy
Keep CI fastUse small datasets for PRs
Cache when possibleDon't re-evaluate unchanged code
Version datasetsTrack what you tested against
Fail fastBlock merges on quality drops
Track trendsStore results for historical comparison

Handling Flakiness

LLMs can produce variable outputs:

def run_stable_evaluation(n_runs=3):
    """Run evaluation multiple times and take average."""
    all_results = []

    for _ in range(n_runs):
        results = evaluate(model, data=dataset, evaluators=evaluators)
        all_results.append(results.metrics)

    # Average across runs
    avg_metrics = {}
    for key in all_results[0].keys():
        avg_metrics[key] = sum(r[key] for r in all_results) / n_runs

    return avg_metrics

Tip: Start with a small, high-quality CI dataset. Expand it as you discover edge cases in production.

Finally, let's recap what you've learned and explore next steps in your LLMOps journey. :::

Quick check: how does this lesson land for you?

Quiz

Module 6: Production Monitoring & Next Steps

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.