Capstone — port one prompt across 8 models
Framing the recommendation
The capture is data. The scoreboard is analysis. The recommendation is a decision. Decisions land or do not land based on how they are framed. Here is the framing pattern that consistently survives leadership review.
The four sentences that matter
Every recommendation in your report should be expressible in exactly four sentences. If you cannot, you have not finished thinking yet.
- What we are changing. "Switch the customer-reply prompt from Claude Sonnet 4.5 to GPT-4o-mini."
- What we expect to gain. "Estimated monthly API spend reduction: $4,200, based on current volume of 1.2M requests per month."
- What we expect to lose. "Quality scores on our 20-input rubric drop from 2.85 to 2.74 average — within the noise band of human reviewer disagreement."
- How we know if we are wrong. "If complaint rate for AI-generated replies rises by more than 0.3 percentage points in the first 14 days, we revert."
Every sentence has a number. Every sentence is falsifiable. Every sentence can be agreed or disagreed with on its own.
Why this format works
Leadership reviewers fail recommendations for predictable reasons. They cannot tell what is being decided ("are we switching all models or just one?"). They cannot tell the size of the upside ("is this $50 a month or $50,000?"). They cannot tell the risk ("if this fails, what happens to users?"). They cannot tell when to revisit ("how do we know this is working?").
The four-sentence format addresses each failure directly. It is not the only format that works, but it is the simplest format that works.
Three things that kill recommendations
1. "We tested several models and Claude was best." This is not a recommendation; it is an observation. There is no action attached. Add the action: "...so we are switching X to Claude" or "...so we are keeping X on Claude". The "so" matters.
2. Recommendations without a downside. If your report says "this change has no downside, we should ship it", leadership will not believe you. Every change has a downside. If you cannot name one, you have not looked. Common downsides: tighter vendor lock-in, less predictable behaviour, slightly lower quality on edge cases, latency increase.
3. Recommendations that depend on conditions outside your control. "We should switch as long as Anthropic does not raise prices" is not a recommendation, it is a hope. Take a position on what you would do if the assumption breaks.
When the answer is "do not change"
Sometimes the right answer to your CTO's question is "no". The capstone is not biased toward producing a switch — it is biased toward producing a decision. If your eight-model comparison shows Claude is genuinely the right model for the prompt and the cost is justified by the quality, ship that recommendation:
"Keep [task] on Claude Sonnet 4.5. Tested seven alternatives; closest competitor scored 2.4 vs Claude's 2.85, with downstream user-impact estimated at -7% conversion. The premium of $0.0027 per request is justified by the quality delta on this prompt."
That recommendation is just as valuable as a switch. It defends the current spend with evidence and it documents the reasoning so the question does not get re-litigated next quarter.
Hagar's last move
Hagar walks into next Monday's standup. She presents the report. The CTO reads section 1 in fifteen seconds. He asks two clarifying questions. He approves the routing change. The PR merges that afternoon. The shadow rollout starts the same day.
Three weeks later, the metrics confirm the prediction. The savings are real. The quality regression is in the noise. Hagar's next project is on the stack.
That is the loop this course was built to teach. Three frontier APIs side-by-side. Eight models in the capstone. One report. One decision. One PR.
The course ends here. The capstone keeps going — for as many prompts as your application sends. :::
Sign in to rate