The 10-point rubric

Every prompt you ship in the capstone gets graded against the same ten criteria. One point per criterion, no half points. Anything below 8/10 means the prompt is not ready — redesign it before claiming it shipped.

The rubric

#	Criterion	Points
1	Role / persona stated	1
2	Specific task with measurable outcome	1
3	Input format defined	1
4	Output format defined (length, structure, tone)	1
5	At least 2 hard constraints (must / must-not)	1
6	At least 1 worked example (input → expected output)	1
7	Refusal or fallback when out-of-scope	1
8	"I don't know" trigger when info is missing	1
9	Tested on at least one real input — output meets spec	1
10	Total length under 400 words (system prompts that don't fit in working memory rot)	1

Passing = 8/10 (80%). Anything below means redesign before shipping. (The 80% pass bar is consistent across all capstones in the Prompt Engineering Path — Code uses 40/50, Cross-Model uses a five-row rubric — same bar, different scales.)

The 10-point rubric — grouped by what each criterion guards against

1, 2, 3, 4, 10

Structure (5 points)

Role stated1 pt

Specific task1 pt

Input format1 pt

Output format1 pt

Under 400 words1 pt

Pros

Anchors who and what
Defines shape end-to-end
Keeps it readable

5, 7, 8

Behaviour (3 points)

Hard constraints1 pt

Refusal scope1 pt

I don't know trigger1 pt

Pros

Stops tone drift
Stops off-brand wandering
Stops confident hallucination

6, 9

Evidence (2 points)

Worked example1 pt

Tested on real input1 pt

Pros

Locks the joint shape
Caught real failures before ship
Promotes rubric to reality

How to grade each item, honestly

The trap with self-grading is being generous with yourself. The criteria are written so that "almost" does not score. Here is how to grade strictly:

Role stated — there is a single sentence in the prompt that names the role and the brand. "You are helpful" does not count.
Specific task with measurable outcome — could a teammate write a one-line check that decides if the output passes? If not, the task is not measurable.
Input format defined — the prompt names what the user message will look like (a paragraph, a JSON, a code block, a transcript). If you would not know how to feed it, fail it.
Output format defined — length AND structure AND tone are all specified. Two of three fails.
2 hard constraints — count them. They must be specific ("never use 'unfortunately'") not vague ("be friendly").
1 worked example — a real input and a real expected output, not a description of "what good output looks like".
Refusal scope — there is a line covering what the assistant will NOT do.
"I don't know" trigger — there is a line covering what the assistant says when it lacks the info.
Tested on real input — you actually fed it a real message and read the output, not "I think it would work".
Under 400 words — open the prompt in a word counter. Be honest.

Why this rubric exists

The rubric is reverse-engineered from the failure modes of prompts that crashed in production. Each criterion maps to a class of bug:

Missing role → tone drift across replies.
Vague task → outputs that "look reasonable" but do not solve the user's problem.
No format → markdown headings everywhere, or none at all.
No examples → tone matches the rules but the shape is wrong.
No refusal → the assistant cheerfully answers off-brand questions.
No "I don't know" trigger → confident hallucination.
Untested → all of the above shipped at once.
Over 400 words → contradictions inside the prompt that you cannot see at a glance.

If you score 8 or above on a prompt, you have avoided all eight failure classes. That is the bar.

Next: three example scenarios you can study or use as templates. :::

The rubric

The 10-point rubric — grouped by what each criterion guards against

Structure (5 points)

Behaviour (3 points)

Evidence (2 points)

How to grade each item, honestly

Why this rubric exists

Quiz

Stay on the Nerd Track