Capstone — three production prompts
The 10-point rubric
Every prompt you ship in the capstone gets graded against the same ten criteria. One point per criterion, no half points. Anything below 8/10 means the prompt is not ready — redesign it before claiming it shipped.
The rubric
| # | Criterion | Points |
|---|---|---|
| 1 | Role / persona stated | 1 |
| 2 | Specific task with measurable outcome | 1 |
| 3 | Input format defined | 1 |
| 4 | Output format defined (length, structure, tone) | 1 |
| 5 | At least 2 hard constraints (must / must-not) | 1 |
| 6 | At least 1 worked example (input → expected output) | 1 |
| 7 | Refusal or fallback when out-of-scope | 1 |
| 8 | "I don't know" trigger when info is missing | 1 |
| 9 | Tested on at least one real input — output meets spec | 1 |
| 10 | Total length under 400 words (system prompts that don't fit in working memory rot) | 1 |
Passing = 8/10 (80%). Anything below means redesign before shipping. (The 80% pass bar is consistent across all capstones in the Prompt Engineering Path — Code uses 40/50, Cross-Model uses a five-row rubric — same bar, different scales.)
The 10-point rubric — grouped by what each criterion guards against
Structure (5 points)
- Anchors who and what
- Defines shape end-to-end
- Keeps it readable
Behaviour (3 points)
- Stops tone drift
- Stops off-brand wandering
- Stops confident hallucination
Evidence (2 points)
- Locks the joint shape
- Caught real failures before ship
- Promotes rubric to reality
How to grade each item, honestly
The trap with self-grading is being generous with yourself. The criteria are written so that "almost" does not score. Here is how to grade strictly:
- Role stated — there is a single sentence in the prompt that names the role and the brand. "You are helpful" does not count.
- Specific task with measurable outcome — could a teammate write a one-line check that decides if the output passes? If not, the task is not measurable.
- Input format defined — the prompt names what the user message will look like (a paragraph, a JSON, a code block, a transcript). If you would not know how to feed it, fail it.
- Output format defined — length AND structure AND tone are all specified. Two of three fails.
- 2 hard constraints — count them. They must be specific ("never use 'unfortunately'") not vague ("be friendly").
- 1 worked example — a real input and a real expected output, not a description of "what good output looks like".
- Refusal scope — there is a line covering what the assistant will NOT do.
- "I don't know" trigger — there is a line covering what the assistant says when it lacks the info.
- Tested on real input — you actually fed it a real message and read the output, not "I think it would work".
- Under 400 words — open the prompt in a word counter. Be honest.
Why this rubric exists
The rubric is reverse-engineered from the failure modes of prompts that crashed in production. Each criterion maps to a class of bug:
- Missing role → tone drift across replies.
- Vague task → outputs that "look reasonable" but do not solve the user's problem.
- No format → markdown headings everywhere, or none at all.
- No examples → tone matches the rules but the shape is wrong.
- No refusal → the assistant cheerfully answers off-brand questions.
- No "I don't know" trigger → confident hallucination.
- Untested → all of the above shipped at once.
- Over 400 words → contradictions inside the prompt that you cannot see at a glance.
If you score 8 or above on a prompt, you have avoided all eight failure classes. That is the bar.
Next: three example scenarios you can study or use as templates. :::
Sign in to rate