The capstone — port one prompt across 8 models — Cross-Model Mastery — The Prompt Engineering Path — Nerd Level Tech

This is Hagar's final task. Her CTO is still asking the same question — Claude or GPT, what about Gemini, what about open-weight, what should we actually do? She is going to answer it the only way that survives a board meeting: with a real comparison report based on real captures.

The capstone is small enough to finish in an afternoon, large enough to be a usable artifact for her team.

Capstone evaluation pipeline

Input

8 parallel runs

Capture per run

Output

The structure

Pick one prompt that represents a real task at your company. Not a toy puzzle. The actual prompt your application sends thousands of times a day. Could be a summarisation prompt, a classification prompt, a tone-rewrite prompt, a structured-extraction prompt, or a customer-reply prompt. Whichever one is the highest-cost or highest-volume item in your stack.

Then run that prompt against eight models. The recommended slate:

Model	Vendor	Tier	Why include it
Claude Sonnet 4.5	Anthropic	Frontier	Course baseline; instruction-discipline benchmark
GPT-4o-mini	OpenAI	Frontier (cheap)	Course baseline; cost-efficient default
GPT-4o	OpenAI	Frontier	What you escalate to from mini
Gemini 2.5 Flash	Google	Frontier (cheap)	Course baseline; latency benchmark
Gemini 2.5 Pro	Google	Frontier	What you escalate to from Flash
Grok 3 (or 4)	xAI	Frontier	Different training data, different dialect
Llama 4 70B	Meta (open)	Open-weight	Self-hostable, high quality
Qwen 3 32B	Alibaba (open)	Open-weight	Permissive licence, code-strong

Slate verified 2026-04-27. Model availability changes — confirm each one is still served by its vendor before you start the run, and substitute the closest available model for any that has been deprecated.

Eight is the right number. It is enough to cover the dialect spread you saw in the course. It is small enough to fit in one comparison spreadsheet.

The mechanics

Send the exact same prompt to all eight. Capture the raw output. Save the latency. Save the input/output token counts. Save the dollar cost per request (computed from each vendor's published pricing — pricing changes frequently, always check the vendor's pricing page on the day of your run).

For reference only — illustrative price ranges as of the course's last verified date (2026-04-27). Do not use these numbers in your capstone report; pull current prices from each vendor's pricing page.

Model	Input price /1M tokens	Output price /1M tokens
Claude Sonnet 4.5	~$3	~$15
GPT-4o-mini	~$0.15	~$0.60
GPT-4o	~$2.50	~$10
Gemini 2.5 Flash	~$0.075	~$0.30
Gemini 2.5 Pro	~$1.25	~$5
Grok 3	(check vendor)	(check vendor)
Llama 4 70B (self-hosted on g5.12xlarge)	~$0.003 (compute amortised)	~$0.003
Qwen 3 32B (self-hosted)	~$0.002 (compute amortised)	~$0.002

⚠ Prices change frequently. The values above are for illustration only and may be out of date. Always verify current pricing directly with the provider before making cost decisions: Anthropic · OpenAI · Google Gemini · Google Vertex AI · AWS Bedrock · Azure OpenAI · Mistral · Cohere · Together AI · DeepSeek · Groq · Fireworks AI · Perplexity · xAI · Cursor · GitHub Copilot · Windsurf.

Do not edit the model outputs. The whole point is that each output has its own dialect, and the dialect is the answer. If a model produces 600 words when you wanted 100, that is data. If a model wraps JSON in a fence, that is data. If a model truncates, that is data.

What this is for

The deliverable is a one-page report your CTO can read in five minutes. It has a recommendation, a cost-savings estimate, and a risk note. It does NOT have a model leaderboard or a generic "Claude is best" verdict. It has a routing decision for one specific prompt, with the evidence underneath.

If the answer turns out to be "GPT-4o-mini handles this fine, switch from Claude and save 95% of the API spend on this task" — ship that recommendation. If the answer is "Claude is the only one that follows the rules; the cost is justified" — ship that. If the answer is "Llama 4 70B self-hosted handles 90% of cases, escalate the other 10% to Claude" — ship that. The course taught you to gather the evidence. The capstone is gathering it.

The next three lessons cover the rubric, the report structure, and the recommendation framing.

Next: a comparison rubric — how to score eight outputs without bias. :::

The capstone — port one prompt across 8 models

Capstone evaluation pipeline

The structure

The mechanics

What this is for

Quiz

Stay on the Nerd Track