Designing Rubrics That Work Well With AI Marking

The most common reason AI marking underperforms is not the AI — it is the rubric. Vague, ambiguous, or incomplete marking schemes confuse human markers too. AI just makes the problem more visible.

After working with marking schemes across physics, mathematics, computer science, and engineering courses, we have noticed consistent patterns in what makes a rubric work well — and what causes it to fail — when used with AI-assisted marking.

What AI can and cannot assess

Before designing a rubric, it helps to understand what AI vision models are good at:

Strong performance:

Checking for specific formulas, equations, or values
Verifying that a required method or approach was used
Identifying presence or absence of key terms, steps, or components
Detecting common errors in structured derivations
Assessing whether a diagram includes required elements (labels, axes, units)

Weaker performance:

Evaluating the quality of an argument or the sophistication of reasoning
Judging creativity or originality
Assessing whether an unconventional approach is valid
Handling ambiguous or heavily stylised handwriting
Interpreting highly contextual or discipline-specific shorthand

This suggests a practical division of labour: structure your rubric so that AI handles the mechanical, verifiable checks, and the human reviewer focuses attention on the judgment calls.

Write checks, not questions

A rubric check should describe a specific, observable criterion — not a question the AI has to answer.

Weak (question form):

“Did the student apply Newton’s second law correctly?”

Strong (check form):

“The answer includes F = ma (or equivalent form) with correct identification of F, m, and a in the context of the problem. Net force must be specified, not just one component.”

The difference matters because the check form tells the AI what to look for. The question form leaves it to the AI to infer what “correctly” means, which varies by model and prompt.

Specify common errors explicitly

If you know that students frequently make a particular mistake, add it as a named error in the rubric check. This does two things: it tells the AI what to watch for, and it generates more useful feedback for the student.

Example:

Check: Student correctly calculates the net force on the block. Common errors: Using mass instead of weight for the gravitational component. Ignoring friction when the problem specifies a rough surface.

AI models can identify named errors more reliably than they can spontaneously recognise novel mistakes. Explicitly naming errors also makes the feedback to students more specific and actionable.

Use partial credit ranges, not binary marks

AI marking is more reliable when it can award partial credit on a defined scale rather than making a binary correct/incorrect decision.

Instead of:

“2 marks if correct, 0 otherwise”

Consider:

“2 marks: correct method and correct answer. 1 mark: correct method but arithmetic error. 0 marks: incorrect method.”

With defined tiers, the AI can confidently place a response in a category. Binary marking forces a harder decision, and the AI is more likely to be wrong at the boundary.

Test with sample submissions before the full batch

Before running AI marking on all submissions, test your rubric on five to ten papers — ideally ones where you already know the correct marks. Look at:

Where the AI agrees with your marks
Where it disagrees and why
Whether the explanations reference the right parts of the rubric

Rubric iteration before the full marking run is much faster than correcting marks after. AEMS’s memory system records your corrections, but it is more efficient to refine the rubric upfront than to rely on post-hoc adjustment for systematic errors.

Keep the number of checks per question manageable

There is no hard limit on the number of checks in a rubric, but very long check lists (more than eight checks per question) tend to produce noisy results. The AI may apply some checks inconsistently, or the checks may overlap in ways that create conflicting signals.

A practical target: no more than five checks for a question worth ten marks. Group related criteria into a single check rather than listing every sub-component separately. Reserve separate checks for genuinely independent criteria.

The rubric is a teaching document too

A well-designed AI-compatible rubric is also better for students. When feedback is linked to explicit, named criteria, students understand what they got wrong and why. Feedback like “Newton’s second law: F = ma not stated — 0/2” is more useful than “incorrect method — 0/2.”

The discipline required to write a precise, check-based rubric pays off in three ways: better AI marking, better human marking consistency, and more useful student feedback. The first year of running a course with AEMS is often the first year the rubric is written clearly enough to be genuinely useful as a teaching document.