Capstone — port one prompt across 8 models
Building the comparison report
You have the captures and the scores. The report is what makes them actionable. A report that does not produce a decision is wasted work. Here is the structure that consistently produces a decision.
The one-page report structure
Section 1 — The recommendation, in one sentence. Lead with the answer. "Switch the [task name] from [current model] to [recommended model]; estimated monthly savings: $[X]; expected quality change: [no change / +N points / -N points]." This sentence is what gets quoted in the leadership Slack. Everything else is supporting evidence.
Section 2 — The prompt, verbatim. Include the exact prompt the application sends. Five lines or fifty. This is the only thing that makes the report reproducible — anyone reading later can run the same comparison and verify the result.
Section 3 — The scoreboard. A table with one row per model and columns for each rubric dimension plus the cost-per-acceptable-output number. This is the most important visual in the report. Sort it by cost-per-acceptable-output ascending. The row at the top is your recommendation.
Section 4 — Why this model and not the cheaper one above. If your recommendation is not the cheapest model on the scoreboard, explain why. The cheaper model failed on which constraint, on which input, what fraction of the time. This is where a leadership reader will push back, so pre-empt the question.
Section 5 — Risk and rollout plan. Three subsections. (a) What we lose if we are wrong: name the worst-case quality regression in user-facing terms. (b) How we measure: what production metric will tell us within a week if the switch is bad. (c) Rollout plan: shadow-test 1% of traffic for a week, then ramp.
Section 6 — Appendix: 5-10 example outputs. Verbatim, side by side, the same prompt run on the recommended model and the previous model. Let the reader see the actual outputs. This converts skeptics faster than any aggregate score.
What to leave out
A surprising amount.
Leave out general benchmark scores from the model providers. They are about other tasks. Leave out cost-per-million-tokens comparisons in isolation — they are misleading without quality. Leave out detailed dialect commentary about which model is "better" overall. Your report is about one prompt.
Leave out comparisons of models you did not actually test. If you only ran six of the eight models from lesson 1, do not speculate on the other two. The whole point of the report is "we measured it". Speculation undermines the measurement.
How long should it be
Two pages, plus appendix. A leadership reader gives a comparison report 90 seconds before deciding whether to read the rest. Section 1 has to land in those 90 seconds. Sections 2-5 fit on the second page. The appendix can run as long as it needs.
If your report is longer than two pages of body text, you have buried the recommendation. Cut.
What ships next to the report
Two artifacts go alongside:
-
The capture file — JSON of all the raw outputs, latencies, token counts, and costs. Anyone re-running the comparison can diff against this baseline. Same shape as the
latest.jsonyou read at the start of this course. -
The routing change — the actual code change in your application that implements the new routing. A report without an associated PR is a report that does not get adopted.
The capstone for this course is to produce all three: the report, the captures, and the routing change. They form the complete artifact your team would actually merge.
Next: framing the recommendation — making it land. :::
Sign in to rate