Capstone — port one prompt across 8 models

The comparison rubric — scoring eight outputs

4 min read

You have eight raw captures. You need to score them on the dimensions that actually matter to your application. The temptation is to read the outputs and form a gut impression. Resist it. Build a rubric, score each output against the rubric blind to the model name, then look at the scores aggregated.

Five rubric dimensions you actually grade on

Most objective

Constraint compliance

Scoring1 point per constraint followed
SubjectivityLow
Example4-rule prompt: 4/4, 3/4, 0/4
Pros
  • Easy to score blind
  • Catches dialect quirks directly
0–3 scale

Faithfulness

Scoring0=fabricated, 3=literal
SubjectivityLow-medium
ExampleHerbal-tea summary drift
Pros
  • Critical for compliance/legal
  • Catches subtle drift
0–3 scale

Tone match

Scoring0=embarrassing, 3=on-voice
SubjectivityMedium-high
ExampleBayt Coffee voice on a refund reply
Cons
  • Needs anchored examples
  • One scorer for all rows
0–3 scale

Format usability

Scoring0=unparseable, 3=parses directly
SubjectivityLow
ExampleWrapped JSON fence, prose preamble
Pros
  • Mechanical to test
  • Direct cost in pre-processing time
The big one

Cost per acceptable output

Formula$ per call ÷ acceptance rate
SubjectivityLow (once acceptance defined)
Example$0.001 × 60% acceptance = $0.0017 effective
Pros
  • Aligns model choice with budget
  • Reveals cheap-but-broken patterns

The five dimensions to score on

Five dimensions cover almost every production prompt. Pick the four or five most relevant to your specific task and ignore the rest.

1. Constraint compliance. If your prompt has hard constraints (word count, format, forbidden words, line counts), did the output follow each constraint? Score 1 point per constraint followed, sum them. This is the most objective dimension.

2. Faithfulness. Did the output add facts, soften claims, strengthen claims, or hallucinate content? You saw this measured directly in Module 2 lesson 3 (the herbal tea summary). Score 0-3: 3 = perfect faithfulness, 2 = minor drift, 1 = noticeable drift, 0 = made up content.

3. Tone match. Does the output sound like your brand or your application's voice? This is more subjective. Use a 0-3 score with explicit anchors: 3 = on-voice, 2 = neutral but acceptable, 1 = off-voice, 0 = embarrassing.

4. Format usability. Can your downstream code consume the output without pre-processing? Score: 3 = parses directly, 2 = needs a regex or simple strip, 1 = needs structured rewriting, 0 = unparseable.

5. Cost-per-acceptable-output. This is the big one. Compute: dollar cost of one call divided by the probability the output is acceptable. A model that costs $0.001 but only produces an acceptable answer 60% of the time effectively costs $0.0017 per acceptable answer. A model that costs $0.005 but produces an acceptable answer 95% of the time effectively costs $0.0053. The cheaper-per-call model can be more expensive in practice.

How to keep scoring honest

Three rules:

  1. Score blind. Strip the model labels before scoring. Use a column header like "Output A, B, C..." and only reveal which model produced which after the scores are locked. This eliminates the bias toward the model you expected to win.

  2. Use the same scorer for all rows. Different humans will score subjective dimensions differently. If the rubric needs subjective scoring, one person does all of it. If you can use an LLM as a scorer, even better — but pick a different model from the ones you are evaluating, otherwise the scorer is biased toward its own dialect.

  3. Score on at least 20 inputs, not 1. A single input is anecdotal. Twenty captures give you a real distribution. The eight-models-by-twenty-inputs grid is 160 cells, which is a few hours of work but produces a defensible report.

What the rubric is not

The rubric is not a leaderboard for the field. It is a leaderboard for your specific prompt. A model that scores low on your rubric might be the best in the world at a different task. Hagar's report is "for the prompts our application sends, here is what we should use" — not "here is what humanity should use".

This narrowness is the report's strength. A general "Claude vs GPT vs Gemini" comparison is impossible to ship a recommendation from. A specific "for our customer-reply prompt sent 50,000 times a day, here is the routing" is shippable on Friday.

Next: building the report — the structure that survives a leadership read. :::

Quiz

Module 6: Capstone — port one prompt across 8 models

Take Quiz
Was this lesson helpful?

Sign in to rate

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.