Building a memory system for examiner corrections
One of the most common questions about AI-assisted marking is whether the system “learns.” The answer depends on what learning means in this context.
AEMS does not fine-tune or retrain the underlying language model. The base model (Claude, GPT-4o, Gemini, or a local model) remains unchanged. What AEMS does is record every correction an examiner makes and use those corrections to inform future grading prompts. This is not model training. It is structured memory that operates at the prompt level.
The distinction matters because model training requires large datasets, significant compute, and raises questions about whose data was used. Prompt-level memory requires only the examiner’s own corrections, runs at zero additional cost, and stays entirely within the examiner’s control.
The problem memory solves
Consider a rubric check that says: “The student correctly applies the equilibrium condition.” An AI model will interpret this check according to its general understanding of equilibrium. But a specific examiner in a specific course may have particular expectations: that the equilibrium equation should include reaction forces, that the sign convention should be stated explicitly, or that a free body diagram is expected even if the problem does not explicitly require one.
These expectations are implicit knowledge that the examiner carries but the rubric does not capture. When the examiner corrects an AI-proposed mark, the correction encodes this implicit knowledge: “I expected a free body diagram here, the student did not provide one, so this check should be marked as incorrect even though the final answer is numerically correct.”
Without a memory system, this correction helps only the current submission. The next submission with the same issue will receive the same incorrect proposal, and the examiner will make the same correction again. Over 300 submissions, this repetitive correction wastes the time that AI-assisted marking was supposed to save.
The four-tier structure
AEMS organises memory in four tiers, from most general to most specific:
Course memory. Applies to all exams in a given course. Records patterns that are consistent across the course, such as notation conventions, expected level of rigour, and recurring student misconceptions. Example: “Students in this course frequently confuse force and pressure. When checking dimensional analysis, verify that units are consistent.”
Exam memory. Applies to a specific exam within a course. Records patterns specific to particular problems or problem types. Example: “In the 2026 midterm, question 3 uses a non-standard coordinate system. Accept both the standard and rotated conventions as correct.”
Question memory. Applies to a specific question within an exam. Records fine-grained marking decisions for individual rubric checks. Example: “For check 3b (boundary conditions), accept both clamped and simply-supported conditions because the problem statement is ambiguous.”
User memory. Applies to a specific examiner’s preferences across all their courses. Records individual calibration tendencies. Example: “This examiner consistently awards partial credit for correct method with arithmetic errors. Weight partial credit proposals accordingly.”
When grading a submission, AEMS constructs the prompt by layering applicable memories from all four tiers. The most specific tier takes precedence when memories conflict.
How corrections become memories
The memory creation process is deliberate, not automatic. When an examiner overrides an AI-proposed mark, the system records the correction with context: the original proposal, the examiner’s decision, the rubric check involved, and the content of the submission at that point.
Periodically (or on demand), the accumulated corrections are analysed to identify patterns. If the same type of correction appears across multiple submissions, the system generates a candidate memory entry and presents it to the examiner for review. The examiner can accept, modify, or discard the suggestion.
This human-in-the-loop approach to memory creation is important. Automatic memory generation risks encoding mistakes (an examiner might make an incorrect correction on one submission) or overfitting to edge cases. By requiring examiner approval, the system ensures that only validated patterns enter the memory.
The memory format
Memories are stored as YAML files in the configuration directory. A typical memory entry looks like this:
- context: "Equilibrium check for beam problems"
guidance: >
When the problem involves a simply supported beam, accept
both the three-equation approach (sum of forces in x, y,
and sum of moments) and the direct moment equilibrium
approach. Both are valid methods taught in the course.
source: "Examiner correction on 2026-01-15, confirmed across 12 submissions"
tier: exam
The YAML format was chosen for several reasons. It is human-readable, so examiners can inspect and edit memories directly. It is version-controllable, so memories can be tracked in Git alongside the rubric. It is portable, so memories can be shared between examiners (in Department and Institutional deployments) by copying files.
Measured impact
Over two exam cycles where memory was active, we observed the following:
First exam cycle (no memory). The examiner corrected approximately 18 percent of AI-proposed marks across 240 submissions. Most corrections were systematic: the same issue appeared repeatedly across different submissions.
Second exam cycle (with memory from first cycle). The correction rate dropped to approximately 7 percent. The remaining corrections were predominantly edge cases and genuinely ambiguous submissions rather than systematic rubric interpretation errors.
The time saving was proportional. If each correction takes 30 seconds (review the AI proposal, read the student’s work, decide on the correct mark, enter the correction), reducing corrections from 18 percent to 7 percent across 240 submissions with 5 checks each saves roughly 2 hours of manual correction time.
These numbers are from a single course (solid mechanics, undergraduate level) and should not be generalised without caution. The improvement will vary depending on the rubric quality, the consistency of the student population, and how well the initial rubric captures the examiner’s expectations.
Memory and privacy
Because memories are stored locally (in the Personal tier) or within the institutional deployment (in the Department and Institutional tiers), they do not create additional privacy concerns. Memories reference rubric checks and marking patterns, not individual student submissions. A memory entry that says “accept both coordinate conventions” does not contain any student data.
However, memories do encode the examiner’s marking standards, which could be considered pedagogically sensitive. An examiner’s calibration tendencies, the adjustments they make, and the patterns they establish are professional information. The memory system treats this information with the same care as student data: it stays where the examiner puts it and is not shared without explicit action.
Future directions
The current memory system is effective but limited. It operates at the text level, injecting guidance into prompts. A more sophisticated approach would integrate memory into the structured extraction stage, informing the vision model about expected notation and layout before extraction begins. We are exploring this but have not yet found an approach that improves extraction quality without increasing prompt complexity to the point where it degrades other aspects of performance.
Another direction is collaborative memory for departmental use. When multiple examiners grade the same course (a common pattern for large courses with teaching assistants), their individual corrections could be aggregated into a shared course memory. This requires a consensus mechanism to handle conflicting corrections, which is a non-trivial problem both technically and pedagogically.