When the rubric is the wrong instrument: contextual grading for homeworks and lab reports

I am a professor of Solid Mechanics at KTH. Most of what I have written about AEMS so far has focused on exams: 200 to 300 submissions, a handful of defined problems per paper, each with intermediate steps the rubric can check. That is the case the rubric instrument is built for, and AEMS handles it well.

Homeworks and lab reports are a different shape of problem. The rubric, applied mechanically, gives results that are either too coarse to be useful or so granular that the rubric itself becomes longer than the student’s submission. I want to explain why, and to describe the contextual grading path that AEMS now offers as an alternative.

What a rubric is good at

A rubric encodes assessment as a list of independent checks. Each check has a clear question (is the free body diagram correct, did the student carry through the integration, is the final value within tolerance), a point value, and a verdict. The instrument works when three conditions hold.

The problem decomposes naturally into discrete steps.
Each step has a defensible correct answer, or a small set of acceptable answers.
The student’s work for each step can be located on a specific portion of the page.

Most engineering exam problems satisfy all three. Calculate a stress, find a moment, solve a differential equation. The work is local, the checks are independent, the marking is mechanical. AEMS exists in large part because this case is so well-defined that a vision-capable model, given the rubric, can do roughly 80 percent of the work as a first pass (I described those early experiments in an earlier post).

Where the rubric fails

A homework that asks the student to derive a constitutive model from first principles, or a lab report that asks the student to interpret experimental data and discuss the sources of error, does not satisfy any of the three conditions.

(1) The work does not decompose. The student builds an argument over several pages. The correctness of step seven depends on choices made in steps one and three. A rubric line “step seven receives 2 points if the boundary condition is correctly applied” is meaningless when the student set up the wrong free body diagram on the first page and is otherwise reasoning consistently.

(2) There is no single defensible answer. A discussion of measurement uncertainty in a fatigue test can be correct in several different framings. The rubric author cannot enumerate them in advance without writing a textbook chapter. The usual fallback, a vague rubric line such as “thorough discussion of error sources (3 points)”, pushes the judgement back onto the grader and offers no real guidance.

(3) The work is not local. A figure on page four is referenced on page eight. A symbol introduced in the introduction reappears in the conclusion. The student’s reasoning is non-linear, and the rubric, which assumes a linear flow of small independent checks, cannot represent it.

As written, a rubric for long-form work usually ends up doing one of two things. It is either coarse (introduction 2 points, methods 5 points, results 10 points, discussion 8 points) and tells the grader nothing about what to look for, or it is very fine (forty checks per report) and no human grader can apply it consistently across thirty submissions. Both failure modes I have seen first-hand on courses I have taught.

The contextual grading path

The alternative that AEMS supports is what we call contextual grading. The instructor still provides a structured definition of the assessment. The structure is different: instead of a list of point-bearing checks, the input is closer to a marking guide. A description of the learning outcomes, a reference solution or example argument, and a small set of qualitative criteria that the grader would use when reading the submission.

The AI model then reads the submission as a coherent document. It locates where the student addresses each learning outcome, evaluates the quality of the reasoning against the marking guide, and produces a mark with supporting commentary anchored to specific pages and paragraphs of the submission. The human grader reviews the proposed mark, the model’s reasoning, and the page anchors before anything reaches the student.

This is closer to how a human examiner actually marks a lab report. The examiner does not work through forty rubric lines mechanically. The examiner reads the report, forms a judgement about whether the student has demonstrated the intended skill, and assigns a mark within a defensible range. Contextual grading mirrors that process.

What the contextual path gives up

I want to be honest about the tradeoff.

A well-designed rubric, applied by a careful grader (human or AI), is reproducible. Run it twice on the same submission and you get the same mark. Contextual grading is less deterministic. The same model on the same submission can produce marks that differ by one or two points across runs, because reading comprehension and qualitative judgement are not fully reducible to a fixed procedure.

For exam grading, that level of variance is unacceptable. Students have legal rights to consistent marking, and contesting a grade requires a clear paper trail of why each point was awarded. The rubric instrument is the right tool there, and AEMS will continue to support and emphasise it.

For homeworks and lab reports, the situation is different. Most institutions do not require the same level of reproducibility for formative assessment, and the variance introduced by contextual grading is comparable to the variance between human graders on the same long-form work. The pedagogical value, in my view, comes from the quality of the feedback rather than from the exact numerical mark.

How to choose between the two

A simple decision rule covers most cases.

If the assessment can be broken into discrete, locally verifiable steps with a small number of acceptable answers per step, use the rubric path. This is the default for exam problems and short technical questions.

If the assessment asks the student to construct an extended argument, interpret data, or write a coherent report, use the contextual path. This is the default for lab reports, take-home homeworks beyond a few pages, and project deliverables.

When in doubt, write the rubric first and ask yourself whether you could realistically apply it to thirty submissions in a single afternoon without drifting. If the answer is no, the assessment is probably too long-form for a rubric, and contextual grading is the better instrument.

What is still being calibrated

Contextual grading in AEMS is newer than the rubric path. Page anchoring (telling the grader where in the submission the model found the relevant work) is the part I have spent the most engineering time on, because feedback without precise location is feedback that does not survive contact with students. Qualitative consistency across runs is the part I am still iterating on. Both will improve.

The point of this post is not to claim that contextual grading is solved. The point is that rubrics are not the only instrument, that long-form work requires a different one, and that AEMS now supports both. Choose the instrument that fits the assessment.