Promptfoo Tutorial: Test LLM Prompts in CI (2026)
May 22, 2026
Promptfoo is an open-source CLI that lets you test LLM prompts the way unit tests check code: you declare what good output looks like, run those checks against your prompts and models, and catch regressions before they ship. This tutorial builds a working eval and a GitHub Actions quality gate from scratch.
TL;DR
This hands-on tutorial shows how to test LLM prompts with automated, repeatable checks instead of eyeballing outputs. You will build a promptfoo eval for a support-ticket classifier prompt, add deterministic and model-graded assertions, run it locally, deliberately break a prompt to watch the regression get caught, and wire a GitHub Actions gate that fails the build when prompt quality drops. It uses promptfoo 0.121.121, an OpenAI API key, and about 20 minutes. Every command and config block is copy-paste runnable and verified against the promptfoo docs as of 22 May 2026.
What you'll learn
- Set up a promptfoo project and write a
promptfooconfig.yamlfrom scratch - Write deterministic assertions, including JSON-schema validation with
is-json - Add a model-graded
llm-rubricassertion for output quality you can't check with a regex - Run an eval and compare two prompt versions side by side
- Catch a real prompt regression before it reaches production
- Gate a GitHub Actions CI pipeline on prompt quality
Prerequisites
- Node.js 20.20.0+ or 22.22.0+ — promptfoo's
enginesfield requires^20.20.0 || >=22.22.01. Check withnode --version. - An OpenAI API key — this tutorial uses OpenAI models. Promptfoo supports 60+ providers, so you can swap in Anthropic, Google, or a local Ollama model later.2
- A terminal and a GitHub repository for the CI step.
You do not need to install promptfoo globally — every command uses npx promptfoo@latest, which runs the current published release without a global install.2
A quick note on the project: promptfoo was founded in 2024, and OpenAI announced its acquisition of Promptfoo in March 2026.3 The tool remains open source and MIT licensed.
Step 1 — Set up the project
Create a project directory with a prompts/ folder and export your API key:
mkdir prompt-eval && cd prompt-eval
mkdir prompts
export OPENAI_API_KEY="sk-your-key-here"
node --version # must be v20.20.0+ or v22.22.0+
Promptfoo reads OPENAI_API_KEY from the environment, so there is nothing else to authenticate.2 You can also run npx promptfoo@latest init for an interactive config walkthrough, but writing the config by hand teaches you the structure faster — which is what the next step does.
Step 2 — Write your first promptfooconfig.yaml
Promptfoo evaluations live in a single YAML file. Create prompts/classifier_v1.txt with the prompt you want to test — a support-ticket classifier that must return machine-readable JSON:
You are a support-ticket classifier. Read the customer message and respond with
ONLY a JSON object, with no prose and no markdown fences, in exactly this shape:
{"category": "<billing|technical|account|other>", "priority": "<low|normal|high|urgent>"}
Customer message:
{{ticket}}
The {{ticket}} placeholder is a promptfoo variable — double curly braces mark values that get substituted per test case.2 Now create promptfooconfig.yaml in the project root:
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: Support-ticket classifier eval
prompts:
- file://prompts/classifier_v1.txt
providers:
- openai:chat:gpt-5.4-mini
tests:
- vars:
ticket: 'I was charged twice for my subscription this month.'
- vars:
ticket: 'The mobile app crashes every time I open the reports tab.'
- vars:
ticket: 'How do I change the email address on my account?'
Three keys do the work here. prompts points at the prompt file, providers names the model (openai:chat:gpt-5.4-mini is promptfoo's provider string for OpenAI's fast, lower-cost GPT-5.4 model2), and tests lists the inputs. Run it:
npx promptfoo@latest eval
Promptfoo calls the model once per test case and prints a results table — one row per test, one column per prompt. There are no assertions yet, so every row just shows the raw output for you to eyeball. That is the "looks good to me" stage. The next steps replace it with checks that run themselves.
Step 3 — Add deterministic assertions
Deterministic assertions are programmatic checks — no LLM involved — that pass or fail the same way every run.4 For a classifier, the contract is strict: the output must be valid JSON with the right fields and allowed values. Replace your promptfooconfig.yaml with this version:
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: Support-ticket classifier eval
prompts:
- file://prompts/classifier_v1.txt
providers:
- openai:chat:gpt-5.4-mini
defaultTest:
assert:
- type: is-json
value:
type: object
required: [category, priority]
additionalProperties: false
properties:
category:
type: string
enum: [billing, technical, account, other]
priority:
type: string
enum: [low, normal, high, urgent]
- type: cost
threshold: 0.01
tests:
- vars:
ticket: 'I was charged twice for my subscription this month.'
assert:
- type: javascript
value: '(() => { try { return JSON.parse(output).category === "billing"; } catch { return false; } })()'
- vars:
ticket: 'The mobile app crashes every time I open the reports tab.'
assert:
- type: javascript
value: '(() => { try { return JSON.parse(output).category === "technical"; } catch { return false; } })()'
- vars:
ticket: 'How do I change the email address on my account?'
assert:
- type: javascript
value: '(() => { try { return JSON.parse(output).category === "account"; } catch { return false; } })()'
defaultTest applies its assertions to every test case.4 The is-json assertion does double duty: it confirms the output parses as JSON and validates it against the JSON Schema in value — so a wrong category enum or a missing priority field fails automatically.4 The cost assertion fails the test if a single inference costs more than one cent in USD, a cheap guardrail against an accidental model swap blowing your budget.4
Each test then adds a javascript assertion that checks the specific category that ticket should produce. The check is wrapped in try/catch so a malformed, unparseable output fails cleanly instead of throwing — a habit worth keeping in any assertion that calls JSON.parse. Run npx promptfoo@latest eval again — now every row reports a real pass or fail instead of a wall of text.
Step 4 — Add a model-graded assertion (llm-rubric)
Some output quality cannot be checked with a regex. "Is the assigned priority sensible?" is a judgement call. That is what model-graded assertions are for: they send the output to an LLM judge and grade it against a rubric you write in plain English.4 Add one more assertion to the defaultTest.assert list:
- type: llm-rubric
value: >-
The JSON assigns a priority proportionate to the message: billing
errors and crashes should be high or urgent; general how-to questions
should be low or normal.
The llm-rubric type is promptfoo's LLM-as-a-judge check.5 Model-graded assertions need an LLM provider to do the grading. promptfoo picks the grader from whichever provider key it finds in your environment — with OPENAI_API_KEY set it grades with an OpenAI model, so your existing key already covers it — and you can point the grader elsewhere with an assertion-level provider field.5 Use deterministic assertions wherever you can — they are free, instant, and never flaky — and reserve llm-rubric for the genuinely subjective checks.
Step 5 — Run the eval and read the results
Run the full eval:
npx promptfoo@latest eval
Promptfoo prints a summary with pass/fail counts and a table of every test. To explore results interactively — filter failures, diff outputs, read the grader's reasoning — open the web viewer:
npx promptfoo@latest view
This opens a local result viewer in your browser. Promptfoo caches successful API responses on disk (default ~/.promptfoo/cache, with a 14-day TTL), so re-running the same eval is fast and free; error responses are never cached.6 If you want machine-readable output for scripting, write a JSON file:
npx promptfoo@latest eval -o results.json
The JSON contains a results.stats object with successes, failures, and errors counts — the numbers your CI gate will read in Step 7.7
Step 6 — Catch a prompt regression
Here is the scenario the whole tutorial builds toward. A teammate decides the classifier prompt feels "robotic" and rewrites it to be friendlier. Create prompts/classifier_v2.txt:
You are a friendly support assistant. Read the customer message below, briefly
explain in a sentence which team should handle it, and then provide a JSON
object with the category and priority.
Customer message:
{{ticket}}
It reads better to a human — and it quietly breaks the contract, because the output is now a sentence of prose followed by JSON instead of pure JSON. Add it alongside the original so promptfoo evaluates both:
prompts:
- file://prompts/classifier_v1.txt
- file://prompts/classifier_v2.txt
Run npx promptfoo@latest eval once more. Promptfoo shows the two prompts side by side, one column each. The v1 column stays green. The v2 column fails is-json on every row — the full output is no longer valid JSON — and the javascript checks fail too, because the prose-wrapped output will not parse. That is a regression caught in seconds instead of in a production incident when a downstream parser starts throwing.4
This is the core value of promptfoo: a prompt change you cannot eyeball your way through gets an objective, repeatable verdict. If you only need to find JSON somewhere in a noisier output, the contains-json assertion is the looser alternative to is-json.4
Step 7 — Gate CI with GitHub Actions
The last step makes the eval mandatory: a CI check that fails the build when prompt quality drops. Save this as .github/workflows/prompt-eval.yml:
name: LLM Prompt Eval
on:
pull_request:
paths:
- 'prompts/**'
- 'promptfooconfig.yaml'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: actions/setup-node@v6
with:
node-version: '22'
- name: Cache promptfoo responses
uses: actions/cache@v5
with:
path: ~/.cache/promptfoo
key: promptfoo-${{ hashFiles('prompts/**', 'promptfooconfig.yaml') }}
restore-keys: promptfoo-
- name: Run prompt eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
PROMPTFOO_CACHE_PATH: ~/.cache/promptfoo
run: npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
- name: Quality gate
run: |
FAILURES=$(jq '.results.stats.failures' results.json)
if [ "$FAILURES" -gt 0 ]; then
echo "Prompt eval failed: $FAILURES test(s) did not pass."
exit 1
fi
echo "All prompt tests passed."
- name: Upload report
if: always()
uses: actions/upload-artifact@v7
with:
name: promptfoo-results
path: results.json
The workflow triggers only when a prompt file or the config changes. It runs the eval, writes results.json, and the "Quality gate" step reads results.stats.failures with jq: any failure exits non-zero and the check goes red, blocking the merge.7 The cache step reuses promptfoo's response cache across runs to keep the workflow fast and cheap. Add your key under the repo's Settings → Secrets and variables → Actions as OPENAI_API_KEY.8
There is also an official promptfoo/promptfoo-action@v1 action. It runs a before/after comparison of edited prompts and posts the results as a pull-request comment, which is excellent for review.8 It is a complement, not a replacement: the comment shows reviewers what changed, while the jq gate above is what actually fails the build. Run both for the full picture.
Verification
Confirm the whole pipeline works locally before relying on CI:
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
jq '.results.stats' results.json
You should see a results.stats object where failures is greater than zero: promptfoo runs all three tests against both prompts, and the broken v2 prompt fails is-json every time. That non-zero failures count is exactly what the CI gate keys on. Now check the gate logic itself:
FAILURES=$(jq '.results.stats.failures' results.json); echo "exit would be: $([ "$FAILURES" -gt 0 ] && echo 1 || echo 0)"
With the broken v2 prompt present, this prints exit would be: 1 — proof the CI step would block the merge. Delete classifier_v2.txt, drop it from the prompts list, re-run, and the same command prints 0.
Troubleshooting
No OpenAI API key/ provider auth error. Promptfoo readsOPENAI_API_KEYfrom the environment. Export it in the same shell you run the eval in, as shown in Step 1, and in CI set it as a repository secret — never commit it.npxfails or promptfoo refuses to start. Promptfoo requires Node.js 20.20.0+ or 22.22.0+.1 Runnode --version; if you are on an older release, upgrade before retrying.- Want fresh model output even though the inputs are unchanged. Promptfoo caches successful responses on disk for 14 days, keyed on the prompt, provider, provider config, and variables — so editing any of those triggers a fresh call automatically.6 To force fresh calls with otherwise-identical inputs (for example, to resample a non-deterministic model), run
npx promptfoo@latest eval --no-cache, or wipe the cache withnpx promptfoo@latest cache clear. is-jsonfails on output that "looks like JSON." The whole output must be valid JSON foris-jsonto pass. A model that adds a sentence of prose or wraps the JSON in a Markdown code fence fails it — by design. Usecontains-jsonif you only need valid JSON somewhere in the response.4- HTTP 429 rate-limit errors on a large suite. Lower concurrency with
-j 1(for examplenpx promptfoo@latest eval -j 1). Promptfoo already retries 429s with exponential backoff, but fewer parallel calls reduces the pressure.6
Next steps
You can now test LLM prompts automatically — a suite that runs locally and blocks bad merges in CI. From here:
- Add a second provider and let promptfoo grade both columns at once — a fast way to A/B a model migration. Routing several providers through one endpoint pairs well with an LLM gateway like LiteLLM Proxy.
- Point a provider at a local model to test offline; see our guide to running local AI with Ollama.
- For retrieval apps, promptfoo's
context-faithfulnessandanswer-relevanceassertions grade grounding and relevance — useful alongside a robust RAG system.
Treat prompts like code: version them, test them, and never let an "improvement" reach production untested.
Footnotes
-
promptfoo npm package — version
0.121.12(latestdist-tag) andengines.node: "^20.20.0 || >=22.22.0", verified vianpm view promptfooon 2026-05-22. https://www.npmjs.com/package/promptfoo ↩ ↩2 ↩3 -
"Getting started," Promptfoo docs (last updated 2026-05-21) — config structure, provider strings (
openai:chat:gpt-5.4-mini),{{variable}}syntax,npx promptfoo@latestcommands, 60+ providers. https://www.promptfoo.dev/docs/getting-started/ ↩ ↩2 ↩3 ↩4 ↩5 -
"OpenAI to acquire Promptfoo," OpenAI, announced 2026-03-09. Promptfoo remains open source and MIT licensed. https://openai.com/index/openai-to-acquire-promptfoo/ ↩
-
"Assertions & metrics," Promptfoo docs (last updated 2026-05-19) —
defaultTest,is-jsonwith JSON-schema validation,cost,javascript,contains-json, and thellm-rubricmodel-graded assertion. https://www.promptfoo.dev/docs/configuration/expected-outputs/ ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 -
"LLM Rubric," Promptfoo docs (last updated 2026-05-12) —
llm-rubricis the LLM-as-a-judge grader; the grading model is chosen from the available provider key (gpt-5when an OpenAI key is set) and can be overridden with the assertionproviderfield. https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/llm-rubric/ ↩ ↩2 -
"Caching," Promptfoo docs (last updated 2026-05-18) — disk cache default
~/.promptfoo/cache, 14-day TTL,--no-cache, error responses not cached, 429 backoff. https://www.promptfoo.dev/docs/configuration/caching/ ↩ ↩2 ↩3 -
"CI/CD Integration for LLM Evaluation and Security," Promptfoo docs (last updated 2026-05-01) — JSON output schema (
results.stats.failures) and thejq-based quality gate. https://www.promptfoo.dev/docs/integrations/ci-cd/ ↩ ↩2 -
"Testing Prompts with GitHub Actions," Promptfoo docs (last updated 2026-05-18) — the official
promptfoo/promptfoo-action@v1action and its before/after PR comment. https://www.promptfoo.dev/docs/integrations/github-action/ ↩ ↩2