Promptfoo Tutorial: Test LLM Prompts in CI (2026)

May 22, 2026

Promptfoo Tutorial: Test LLM Prompts in CI (2026)

Promptfoo is an open-source CLI that lets you test LLM prompts the way unit tests check code: you declare what good output looks like, run those checks against your prompts and models, and catch regressions before they ship. This tutorial builds a working eval and a GitHub Actions quality gate from scratch.

TL;DR

This hands-on tutorial shows how to test LLM prompts with automated, repeatable checks instead of eyeballing outputs. You will build a promptfoo eval for a support-ticket classifier prompt, add deterministic and model-graded assertions, run it locally, deliberately break a prompt to watch the regression get caught, and wire a GitHub Actions gate that fails the build when prompt quality drops. It uses promptfoo 0.121.121, an OpenAI API key, and about 20 minutes. Every command and config block is copy-paste runnable and verified against the promptfoo docs as of 22 May 2026.

What you'll learn

  • Set up a promptfoo project and write a promptfooconfig.yaml from scratch
  • Write deterministic assertions, including JSON-schema validation with is-json
  • Add a model-graded llm-rubric assertion for output quality you can't check with a regex
  • Run an eval and compare two prompt versions side by side
  • Catch a real prompt regression before it reaches production
  • Gate a GitHub Actions CI pipeline on prompt quality

Prerequisites

  • Node.js 20.20.0+ or 22.22.0+ — promptfoo's engines field requires ^20.20.0 || >=22.22.01. Check with node --version.
  • An OpenAI API key — this tutorial uses OpenAI models. Promptfoo supports 60+ providers, so you can swap in Anthropic, Google, or a local Ollama model later.2
  • A terminal and a GitHub repository for the CI step.

You do not need to install promptfoo globally — every command uses npx promptfoo@latest, which runs the current published release without a global install.2

A quick note on the project: promptfoo was founded in 2024, and OpenAI announced its acquisition of Promptfoo in March 2026.3 The tool remains open source and MIT licensed.

Step 1 — Set up the project

Create a project directory with a prompts/ folder and export your API key:

mkdir prompt-eval && cd prompt-eval
mkdir prompts
export OPENAI_API_KEY="sk-your-key-here"
node --version   # must be v20.20.0+ or v22.22.0+

Promptfoo reads OPENAI_API_KEY from the environment, so there is nothing else to authenticate.2 You can also run npx promptfoo@latest init for an interactive config walkthrough, but writing the config by hand teaches you the structure faster — which is what the next step does.

Step 2 — Write your first promptfooconfig.yaml

Promptfoo evaluations live in a single YAML file. Create prompts/classifier_v1.txt with the prompt you want to test — a support-ticket classifier that must return machine-readable JSON:

You are a support-ticket classifier. Read the customer message and respond with
ONLY a JSON object, with no prose and no markdown fences, in exactly this shape:
{"category": "<billing|technical|account|other>", "priority": "<low|normal|high|urgent>"}

Customer message:
{{ticket}}

The {{ticket}} placeholder is a promptfoo variable — double curly braces mark values that get substituted per test case.2 Now create promptfooconfig.yaml in the project root:

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: Support-ticket classifier eval

prompts:
  - file://prompts/classifier_v1.txt

providers:
  - openai:chat:gpt-5.4-mini

tests:
  - vars:
      ticket: 'I was charged twice for my subscription this month.'
  - vars:
      ticket: 'The mobile app crashes every time I open the reports tab.'
  - vars:
      ticket: 'How do I change the email address on my account?'

Three keys do the work here. prompts points at the prompt file, providers names the model (openai:chat:gpt-5.4-mini is promptfoo's provider string for OpenAI's fast, lower-cost GPT-5.4 model2), and tests lists the inputs. Run it:

npx promptfoo@latest eval

Promptfoo calls the model once per test case and prints a results table — one row per test, one column per prompt. There are no assertions yet, so every row just shows the raw output for you to eyeball. That is the "looks good to me" stage. The next steps replace it with checks that run themselves.

Step 3 — Add deterministic assertions

Deterministic assertions are programmatic checks — no LLM involved — that pass or fail the same way every run.4 For a classifier, the contract is strict: the output must be valid JSON with the right fields and allowed values. Replace your promptfooconfig.yaml with this version:

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: Support-ticket classifier eval

prompts:
  - file://prompts/classifier_v1.txt

providers:
  - openai:chat:gpt-5.4-mini

defaultTest:
  assert:
    - type: is-json
      value:
        type: object
        required: [category, priority]
        additionalProperties: false
        properties:
          category:
            type: string
            enum: [billing, technical, account, other]
          priority:
            type: string
            enum: [low, normal, high, urgent]
    - type: cost
      threshold: 0.01

tests:
  - vars:
      ticket: 'I was charged twice for my subscription this month.'
    assert:
      - type: javascript
        value: '(() => { try { return JSON.parse(output).category === "billing"; } catch { return false; } })()'
  - vars:
      ticket: 'The mobile app crashes every time I open the reports tab.'
    assert:
      - type: javascript
        value: '(() => { try { return JSON.parse(output).category === "technical"; } catch { return false; } })()'
  - vars:
      ticket: 'How do I change the email address on my account?'
    assert:
      - type: javascript
        value: '(() => { try { return JSON.parse(output).category === "account"; } catch { return false; } })()'

defaultTest applies its assertions to every test case.4 The is-json assertion does double duty: it confirms the output parses as JSON and validates it against the JSON Schema in value — so a wrong category enum or a missing priority field fails automatically.4 The cost assertion fails the test if a single inference costs more than one cent in USD, a cheap guardrail against an accidental model swap blowing your budget.4

Each test then adds a javascript assertion that checks the specific category that ticket should produce. The check is wrapped in try/catch so a malformed, unparseable output fails cleanly instead of throwing — a habit worth keeping in any assertion that calls JSON.parse. Run npx promptfoo@latest eval again — now every row reports a real pass or fail instead of a wall of text.

Step 4 — Add a model-graded assertion (llm-rubric)

Some output quality cannot be checked with a regex. "Is the assigned priority sensible?" is a judgement call. That is what model-graded assertions are for: they send the output to an LLM judge and grade it against a rubric you write in plain English.4 Add one more assertion to the defaultTest.assert list:

    - type: llm-rubric
      value: >-
        The JSON assigns a priority proportionate to the message: billing
        errors and crashes should be high or urgent; general how-to questions
        should be low or normal.

The llm-rubric type is promptfoo's LLM-as-a-judge check.5 Model-graded assertions need an LLM provider to do the grading. promptfoo picks the grader from whichever provider key it finds in your environment — with OPENAI_API_KEY set it grades with an OpenAI model, so your existing key already covers it — and you can point the grader elsewhere with an assertion-level provider field.5 Use deterministic assertions wherever you can — they are free, instant, and never flaky — and reserve llm-rubric for the genuinely subjective checks.

Step 5 — Run the eval and read the results

Run the full eval:

npx promptfoo@latest eval

Promptfoo prints a summary with pass/fail counts and a table of every test. To explore results interactively — filter failures, diff outputs, read the grader's reasoning — open the web viewer:

npx promptfoo@latest view

This opens a local result viewer in your browser. Promptfoo caches successful API responses on disk (default ~/.promptfoo/cache, with a 14-day TTL), so re-running the same eval is fast and free; error responses are never cached.6 If you want machine-readable output for scripting, write a JSON file:

npx promptfoo@latest eval -o results.json

The JSON contains a results.stats object with successes, failures, and errors counts — the numbers your CI gate will read in Step 7.7

Step 6 — Catch a prompt regression

Here is the scenario the whole tutorial builds toward. A teammate decides the classifier prompt feels "robotic" and rewrites it to be friendlier. Create prompts/classifier_v2.txt:

You are a friendly support assistant. Read the customer message below, briefly
explain in a sentence which team should handle it, and then provide a JSON
object with the category and priority.

Customer message:
{{ticket}}

It reads better to a human — and it quietly breaks the contract, because the output is now a sentence of prose followed by JSON instead of pure JSON. Add it alongside the original so promptfoo evaluates both:

prompts:
  - file://prompts/classifier_v1.txt
  - file://prompts/classifier_v2.txt

Run npx promptfoo@latest eval once more. Promptfoo shows the two prompts side by side, one column each. The v1 column stays green. The v2 column fails is-json on every row — the full output is no longer valid JSON — and the javascript checks fail too, because the prose-wrapped output will not parse. That is a regression caught in seconds instead of in a production incident when a downstream parser starts throwing.4

This is the core value of promptfoo: a prompt change you cannot eyeball your way through gets an objective, repeatable verdict. If you only need to find JSON somewhere in a noisier output, the contains-json assertion is the looser alternative to is-json.4

Step 7 — Gate CI with GitHub Actions

The last step makes the eval mandatory: a CI check that fails the build when prompt quality drops. Save this as .github/workflows/prompt-eval.yml:

name: LLM Prompt Eval

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'promptfooconfig.yaml'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6

      - uses: actions/setup-node@v6
        with:
          node-version: '22'

      - name: Cache promptfoo responses
        uses: actions/cache@v5
        with:
          path: ~/.cache/promptfoo
          key: promptfoo-${{ hashFiles('prompts/**', 'promptfooconfig.yaml') }}
          restore-keys: promptfoo-

      - name: Run prompt eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          PROMPTFOO_CACHE_PATH: ~/.cache/promptfoo
        run: npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json

      - name: Quality gate
        run: |
          FAILURES=$(jq '.results.stats.failures' results.json)
          if [ "$FAILURES" -gt 0 ]; then
            echo "Prompt eval failed: $FAILURES test(s) did not pass."
            exit 1
          fi
          echo "All prompt tests passed."

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v7
        with:
          name: promptfoo-results
          path: results.json

The workflow triggers only when a prompt file or the config changes. It runs the eval, writes results.json, and the "Quality gate" step reads results.stats.failures with jq: any failure exits non-zero and the check goes red, blocking the merge.7 The cache step reuses promptfoo's response cache across runs to keep the workflow fast and cheap. Add your key under the repo's Settings → Secrets and variables → Actions as OPENAI_API_KEY.8

There is also an official promptfoo/promptfoo-action@v1 action. It runs a before/after comparison of edited prompts and posts the results as a pull-request comment, which is excellent for review.8 It is a complement, not a replacement: the comment shows reviewers what changed, while the jq gate above is what actually fails the build. Run both for the full picture.

Verification

Confirm the whole pipeline works locally before relying on CI:

npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
jq '.results.stats' results.json

You should see a results.stats object where failures is greater than zero: promptfoo runs all three tests against both prompts, and the broken v2 prompt fails is-json every time. That non-zero failures count is exactly what the CI gate keys on. Now check the gate logic itself:

FAILURES=$(jq '.results.stats.failures' results.json); echo "exit would be: $([ "$FAILURES" -gt 0 ] && echo 1 || echo 0)"

With the broken v2 prompt present, this prints exit would be: 1 — proof the CI step would block the merge. Delete classifier_v2.txt, drop it from the prompts list, re-run, and the same command prints 0.

Troubleshooting

  • No OpenAI API key / provider auth error. Promptfoo reads OPENAI_API_KEY from the environment. Export it in the same shell you run the eval in, as shown in Step 1, and in CI set it as a repository secret — never commit it.
  • npx fails or promptfoo refuses to start. Promptfoo requires Node.js 20.20.0+ or 22.22.0+.1 Run node --version; if you are on an older release, upgrade before retrying.
  • Want fresh model output even though the inputs are unchanged. Promptfoo caches successful responses on disk for 14 days, keyed on the prompt, provider, provider config, and variables — so editing any of those triggers a fresh call automatically.6 To force fresh calls with otherwise-identical inputs (for example, to resample a non-deterministic model), run npx promptfoo@latest eval --no-cache, or wipe the cache with npx promptfoo@latest cache clear.
  • is-json fails on output that "looks like JSON." The whole output must be valid JSON for is-json to pass. A model that adds a sentence of prose or wraps the JSON in a Markdown code fence fails it — by design. Use contains-json if you only need valid JSON somewhere in the response.4
  • HTTP 429 rate-limit errors on a large suite. Lower concurrency with -j 1 (for example npx promptfoo@latest eval -j 1). Promptfoo already retries 429s with exponential backoff, but fewer parallel calls reduces the pressure.6

Next steps

You can now test LLM prompts automatically — a suite that runs locally and blocks bad merges in CI. From here:

  • Add a second provider and let promptfoo grade both columns at once — a fast way to A/B a model migration. Routing several providers through one endpoint pairs well with an LLM gateway like LiteLLM Proxy.
  • Point a provider at a local model to test offline; see our guide to running local AI with Ollama.
  • For retrieval apps, promptfoo's context-faithfulness and answer-relevance assertions grade grounding and relevance — useful alongside a robust RAG system.

Treat prompts like code: version them, test them, and never let an "improvement" reach production untested.

Footnotes

  1. promptfoo npm package — version 0.121.12 (latest dist-tag) and engines.node: "^20.20.0 || >=22.22.0", verified via npm view promptfoo on 2026-05-22. https://www.npmjs.com/package/promptfoo 2 3

  2. "Getting started," Promptfoo docs (last updated 2026-05-21) — config structure, provider strings (openai:chat:gpt-5.4-mini), {{variable}} syntax, npx promptfoo@latest commands, 60+ providers. https://www.promptfoo.dev/docs/getting-started/ 2 3 4 5

  3. "OpenAI to acquire Promptfoo," OpenAI, announced 2026-03-09. Promptfoo remains open source and MIT licensed. https://openai.com/index/openai-to-acquire-promptfoo/

  4. "Assertions & metrics," Promptfoo docs (last updated 2026-05-19) — defaultTest, is-json with JSON-schema validation, cost, javascript, contains-json, and the llm-rubric model-graded assertion. https://www.promptfoo.dev/docs/configuration/expected-outputs/ 2 3 4 5 6 7 8

  5. "LLM Rubric," Promptfoo docs (last updated 2026-05-12) — llm-rubric is the LLM-as-a-judge grader; the grading model is chosen from the available provider key (gpt-5 when an OpenAI key is set) and can be overridden with the assertion provider field. https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/llm-rubric/ 2

  6. "Caching," Promptfoo docs (last updated 2026-05-18) — disk cache default ~/.promptfoo/cache, 14-day TTL, --no-cache, error responses not cached, 429 backoff. https://www.promptfoo.dev/docs/configuration/caching/ 2 3

  7. "CI/CD Integration for LLM Evaluation and Security," Promptfoo docs (last updated 2026-05-01) — JSON output schema (results.stats.failures) and the jq-based quality gate. https://www.promptfoo.dev/docs/integrations/ci-cd/ 2

  8. "Testing Prompts with GitHub Actions," Promptfoo docs (last updated 2026-05-18) — the official promptfoo/promptfoo-action@v1 action and its before/after PR comment. https://www.promptfoo.dev/docs/integrations/github-action/ 2


FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.