Capstone — port one prompt across 8 models
The capstone — port one prompt across 8 models
This is Hagar's final task. Her CTO is still asking the same question — Claude or GPT, what about Gemini, what about open-weight, what should we actually do? She is going to answer it the only way that survives a board meeting: with a real comparison report based on real captures.
The capstone is small enough to finish in an afternoon, large enough to be a usable artifact for her team.
Capstone evaluation pipeline
The structure
Pick one prompt that represents a real task at your company. Not a toy puzzle. The actual prompt your application sends thousands of times a day. Could be a summarisation prompt, a classification prompt, a tone-rewrite prompt, a structured-extraction prompt, or a customer-reply prompt. Whichever one is the highest-cost or highest-volume item in your stack.
Then run that prompt against eight models. The recommended slate:
| Model | Vendor | Tier | Why include it |
|---|---|---|---|
| Claude Sonnet 4.5 | Anthropic | Frontier | Course baseline; instruction-discipline benchmark |
| GPT-4o-mini | OpenAI | Frontier (cheap) | Course baseline; cost-efficient default |
| GPT-4o | OpenAI | Frontier | What you escalate to from mini |
| Gemini 2.5 Flash | Frontier (cheap) | Course baseline; latency benchmark | |
| Gemini 2.5 Pro | Frontier | What you escalate to from Flash | |
| Grok 3 (or 4) | xAI | Frontier | Different training data, different dialect |
| Llama 4 70B | Meta (open) | Open-weight | Self-hostable, high quality |
| Qwen 3 32B | Alibaba (open) | Open-weight | Permissive licence, code-strong |
Slate verified 2026-04-27. Model availability changes — confirm each one is still served by its vendor before you start the run, and substitute the closest available model for any that has been deprecated.
Eight is the right number. It is enough to cover the dialect spread you saw in the course. It is small enough to fit in one comparison spreadsheet.
The mechanics
Send the exact same prompt to all eight. Capture the raw output. Save the latency. Save the input/output token counts. Save the dollar cost per request (computed from each vendor's published pricing — pricing changes frequently, always check the vendor's pricing page on the day of your run).
For reference only — illustrative price ranges as of the course's last verified date (2026-04-27). Do not use these numbers in your capstone report; pull current prices from each vendor's pricing page.
| Model | Input price /1M tokens | Output price /1M tokens |
|---|---|---|
| Claude Sonnet 4.5 | ~$3 | ~$15 |
| GPT-4o-mini | ~$0.15 | ~$0.60 |
| GPT-4o | ~$2.50 | ~$10 |
| Gemini 2.5 Flash | ~$0.075 | ~$0.30 |
| Gemini 2.5 Pro | ~$1.25 | ~$5 |
| Grok 3 | (check vendor) | (check vendor) |
| Llama 4 70B (self-hosted on g5.12xlarge) | ~$0.003 (compute amortised) | ~$0.003 |
| Qwen 3 32B (self-hosted) | ~$0.002 (compute amortised) | ~$0.002 |
⚠ Prices change frequently. The values above are for illustration only and may be out of date. Always verify current pricing directly with the provider before making cost decisions: Anthropic · OpenAI · Google Gemini · Google Vertex AI · AWS Bedrock · Azure OpenAI · Mistral · Cohere · Together AI · DeepSeek · Groq · Fireworks AI · Perplexity · xAI · Cursor · GitHub Copilot · Windsurf.
Do not edit the model outputs. The whole point is that each output has its own dialect, and the dialect is the answer. If a model produces 600 words when you wanted 100, that is data. If a model wraps JSON in a fence, that is data. If a model truncates, that is data.
What this is for
The deliverable is a one-page report your CTO can read in five minutes. It has a recommendation, a cost-savings estimate, and a risk note. It does NOT have a model leaderboard or a generic "Claude is best" verdict. It has a routing decision for one specific prompt, with the evidence underneath.
If the answer turns out to be "GPT-4o-mini handles this fine, switch from Claude and save 95% of the API spend on this task" — ship that recommendation. If the answer is "Claude is the only one that follows the rules; the cost is justified" — ship that. If the answer is "Llama 4 70B self-hosted handles 90% of cases, escalate the other 10% to Claude" — ship that. The course taught you to gather the evidence. The capstone is gathering it.
The next three lessons cover the rubric, the report structure, and the recommendation framing.
Next: a comparison rubric — how to score eight outputs without bias. :::
Sign in to rate