Capstone — port one prompt across 8 models

The capstone — port one prompt across 8 models

5 min read

This is Hagar's final task. Her CTO is still asking the same question — Claude or GPT, what about Gemini, what about open-weight, what should we actually do? She is going to answer it the only way that survives a board meeting: with a real comparison report based on real captures.

The capstone is small enough to finish in an afternoon, large enough to be a usable artifact for her team.

Capstone evaluation pipeline

Input
8 parallel runs
Capture per run
Output

The structure

Pick one prompt that represents a real task at your company. Not a toy puzzle. The actual prompt your application sends thousands of times a day. Could be a summarisation prompt, a classification prompt, a tone-rewrite prompt, a structured-extraction prompt, or a customer-reply prompt. Whichever one is the highest-cost or highest-volume item in your stack.

Then run that prompt against eight models. The recommended slate:

ModelVendorTierWhy include it
Claude Sonnet 4.5AnthropicFrontierCourse baseline; instruction-discipline benchmark
GPT-4o-miniOpenAIFrontier (cheap)Course baseline; cost-efficient default
GPT-4oOpenAIFrontierWhat you escalate to from mini
Gemini 2.5 FlashGoogleFrontier (cheap)Course baseline; latency benchmark
Gemini 2.5 ProGoogleFrontierWhat you escalate to from Flash
Grok 3 (or 4)xAIFrontierDifferent training data, different dialect
Llama 4 70BMeta (open)Open-weightSelf-hostable, high quality
Qwen 3 32BAlibaba (open)Open-weightPermissive licence, code-strong

Slate verified 2026-04-27. Model availability changes — confirm each one is still served by its vendor before you start the run, and substitute the closest available model for any that has been deprecated.

Eight is the right number. It is enough to cover the dialect spread you saw in the course. It is small enough to fit in one comparison spreadsheet.

The mechanics

Send the exact same prompt to all eight. Capture the raw output. Save the latency. Save the input/output token counts. Save the dollar cost per request (computed from each vendor's published pricing — pricing changes frequently, always check the vendor's pricing page on the day of your run).

For reference only — illustrative price ranges as of the course's last verified date (2026-04-27). Do not use these numbers in your capstone report; pull current prices from each vendor's pricing page.

ModelInput price /1M tokensOutput price /1M tokens
Claude Sonnet 4.5~$3~$15
GPT-4o-mini~$0.15~$0.60
GPT-4o~$2.50~$10
Gemini 2.5 Flash~$0.075~$0.30
Gemini 2.5 Pro~$1.25~$5
Grok 3(check vendor)(check vendor)
Llama 4 70B (self-hosted on g5.12xlarge)~$0.003 (compute amortised)~$0.003
Qwen 3 32B (self-hosted)~$0.002 (compute amortised)~$0.002

⚠ Prices change frequently. The values above are for illustration only and may be out of date. Always verify current pricing directly with the provider before making cost decisions: Anthropic · OpenAI · Google Gemini · Google Vertex AI · AWS Bedrock · Azure OpenAI · Mistral · Cohere · Together AI · DeepSeek · Groq · Fireworks AI · Perplexity · xAI · Cursor · GitHub Copilot · Windsurf.

Do not edit the model outputs. The whole point is that each output has its own dialect, and the dialect is the answer. If a model produces 600 words when you wanted 100, that is data. If a model wraps JSON in a fence, that is data. If a model truncates, that is data.

What this is for

The deliverable is a one-page report your CTO can read in five minutes. It has a recommendation, a cost-savings estimate, and a risk note. It does NOT have a model leaderboard or a generic "Claude is best" verdict. It has a routing decision for one specific prompt, with the evidence underneath.

If the answer turns out to be "GPT-4o-mini handles this fine, switch from Claude and save 95% of the API spend on this task" — ship that recommendation. If the answer is "Claude is the only one that follows the rules; the cost is justified" — ship that. If the answer is "Llama 4 70B self-hosted handles 90% of cases, escalate the other 10% to Claude" — ship that. The course taught you to gather the evidence. The capstone is gathering it.

The next three lessons cover the rubric, the report structure, and the recommendation framing.

Next: a comparison rubric — how to score eight outputs without bias. :::

Quiz

Module 6: Capstone — port one prompt across 8 models

Take Quiz
Was this lesson helpful?

Sign in to rate

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.