The comparison rubric — scoring eight outputs — Cross-Model Mastery — The Prompt Engineering Path — Nerd Level Tech

You have eight raw captures. You need to score them on the dimensions that actually matter to your application. The temptation is to read the outputs and form a gut impression. Resist it. Build a rubric, score each output against the rubric blind to the model name, then look at the scores aggregated.

Five rubric dimensions you actually grade on

Most objective

Constraint compliance

Scoring1 point per constraint followed

SubjectivityLow

Example4-rule prompt: 4/4, 3/4, 0/4

Pros

Easy to score blind
Catches dialect quirks directly

0–3 scale

Faithfulness

Scoring0=fabricated, 3=literal

SubjectivityLow-medium

ExampleHerbal-tea summary drift

Pros

Critical for compliance/legal
Catches subtle drift

0–3 scale

Tone match

Scoring0=embarrassing, 3=on-voice

SubjectivityMedium-high

ExampleBayt Coffee voice on a refund reply

Cons

Needs anchored examples
One scorer for all rows

0–3 scale

Format usability

Scoring0=unparseable, 3=parses directly

SubjectivityLow

ExampleWrapped JSON fence, prose preamble

Pros

Mechanical to test
Direct cost in pre-processing time

The big one

Cost per acceptable output

Formula$ per call ÷ acceptance rate

SubjectivityLow (once acceptance defined)

Example$0.001 × 60% acceptance = $0.0017 effective

Pros

Aligns model choice with budget
Reveals cheap-but-broken patterns

The five dimensions to score on

Five dimensions cover almost every production prompt. Pick the four or five most relevant to your specific task and ignore the rest.

1. Constraint compliance. If your prompt has hard constraints (word count, format, forbidden words, line counts), did the output follow each constraint? Score 1 point per constraint followed, sum them. This is the most objective dimension.

2. Faithfulness. Did the output add facts, soften claims, strengthen claims, or hallucinate content? You saw this measured directly in Module 2 lesson 3 (the herbal tea summary). Score 0-3: 3 = perfect faithfulness, 2 = minor drift, 1 = noticeable drift, 0 = made up content.

3. Tone match. Does the output sound like your brand or your application's voice? This is more subjective. Use a 0-3 score with explicit anchors: 3 = on-voice, 2 = neutral but acceptable, 1 = off-voice, 0 = embarrassing.

4. Format usability. Can your downstream code consume the output without pre-processing? Score: 3 = parses directly, 2 = needs a regex or simple strip, 1 = needs structured rewriting, 0 = unparseable.

5. Cost-per-acceptable-output. This is the big one. Compute: dollar cost of one call divided by the probability the output is acceptable. A model that costs $0.001 but only produces an acceptable answer 60% of the time effectively costs $0.0017 per acceptable answer. A model that costs $0.005 but produces an acceptable answer 95% of the time effectively costs $0.0053. The cheaper-per-call model can be more expensive in practice.

How to keep scoring honest

Three rules:

Score blind. Strip the model labels before scoring. Use a column header like "Output A, B, C..." and only reveal which model produced which after the scores are locked. This eliminates the bias toward the model you expected to win.
Use the same scorer for all rows. Different humans will score subjective dimensions differently. If the rubric needs subjective scoring, one person does all of it. If you can use an LLM as a scorer, even better — but pick a different model from the ones you are evaluating, otherwise the scorer is biased toward its own dialect.
Score on at least 20 inputs, not 1. A single input is anecdotal. Twenty captures give you a real distribution. The eight-models-by-twenty-inputs grid is 160 cells, which is a few hours of work but produces a defensible report.

What the rubric is not

The rubric is not a leaderboard for the field. It is a leaderboard for your specific prompt. A model that scores low on your rubric might be the best in the world at a different task. Hagar's report is "for the prompts our application sends, here is what we should use" — not "here is what humanity should use".

This narrowness is the report's strength. A general "Claude vs GPT vs Gemini" comparison is impossible to ship a recommendation from. A specific "for our customer-reply prompt sent 50,000 times a day, here is the routing" is shippable on Friday.

Next: building the report — the structure that survives a leadership read. :::

The comparison rubric — scoring eight outputs

Five rubric dimensions you actually grade on

Constraint compliance

Faithfulness

Tone match

Format usability

Cost per acceptable output

The five dimensions to score on

How to keep scoring honest

What the rubric is not

Quiz

Stay on the Nerd Track