Reading handwritten exams with vision models: what works and what does not

The central technical challenge in AI-assisted exam marking is not grading. It is reading. Before any rubric can be applied, the system must extract the student’s work from a scanned PDF page with sufficient accuracy to support meaningful assessment. For typed submissions, this is straightforward. For handwritten exams, which remain the norm in many STEM disciplines, it is considerably harder.

This post describes what we learned about using vision-capable language models (Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro and their successors) to read handwritten exam submissions, and the engineering that was required to make the extraction reliable enough for production use.

The naive approach and its failure modes

The simplest possible pipeline is: take a PDF page, convert it to an image, send it to a vision model with the instruction “read the student’s work,” and use the output as input for grading.

This works surprisingly well for clean, well-lit scans of legible handwriting. For roughly 70 percent of typical exam submissions, the naive approach produces acceptable extraction. The remaining 30 percent is where the problems live.

Low-contrast scans. Many university scanning systems produce PDFs with inconsistent brightness and contrast. Pencil writing on grey paper, or blue ink on off-white paper, can fall below the contrast threshold where vision models extract reliably. Characters are dropped, subscripts are misread, and mathematical operators are confused.

Overlapping content. Students frequently write corrections above or beside their original work, cross out lines, draw arrows to indicate insertions, or squeeze additional steps into margins. Vision models handle linear text well. Non-linear layout with corrections and annotations is significantly more challenging.

Mathematical notation. Standard handwritten text is one domain. Mathematical notation is another. A handwritten integral sign, a partial derivative symbol, or a tensor index can look very different from one student to the next. The models have improved substantially over the past year, but subscript/superscript confusion and operator misidentification remain common failure modes.

Multi-page answers. When a student’s answer to a single question spans two or more pages, the extraction must be combined coherently. This is not simply concatenation. The model must understand that the content continues from the previous page and maintain context across the page boundary.

What we do differently

AEMS does not rely on a single extraction call. The pipeline includes several stages designed to improve reliability.

Preprocessing. Before any vision API call, the PDF page is converted to a high-resolution image with standardised contrast and brightness. This is not sophisticated image processing. It is basic normalisation that eliminates the most common scanning artefacts. The difference in extraction quality between a raw scan and a normalised image is measurable and consistent.

Structured extraction. Rather than asking the model to “read the student’s work” as free text, we provide a JSON schema that specifies the expected structure: which question is being answered, what components are expected (equation, diagram, numerical result, textual explanation), and how to flag uncertain content. This structured prompt consistently outperforms open-ended extraction, because the model has explicit guidance on what to look for and how to organise its output.

Confidence signalling. The extraction prompt instructs the model to flag any content it is uncertain about. Low-confidence segments are highlighted in the review interface, so the examiner can verify them before grading proceeds. This is preferable to silent misreading, which would propagate errors into the grading stage without any indication that something went wrong.

Invisible text detection. A less obvious problem is invisible text embedded in PDFs. Some PDF generators insert hidden text layers (for accessibility or OCR purposes) that are not visible to the student but are readable by the vision model. In adversarial cases, this could be exploited to inject content that influences grading. AEMS includes a detection layer that identifies and filters white-on-white, black-on-black, and other invisible text patterns before extraction. This is a security measure, not a quality measure, but it is important for the integrity of the process.

The role of the vision cache

Vision API calls are expensive in both time and cost. A single page extraction using GPT-4o or Claude 3.5 costs between 0.01 and 0.05 USD depending on image resolution and response length. For 300 submissions with an average of 6 pages each, the vision extraction alone can cost 20 to 90 USD.

AEMS caches every vision extraction result in a local SQLite database, keyed by a SHA-256 hash of the image content, the prompt, and the model identifier. If the same page is processed again with the same model and prompt (for example, during rubric iteration or re-grading), the cached result is returned immediately at zero cost.

The cache also enables a workflow that would otherwise be impractical: iterating on the rubric. When an examiner adjusts a rubric check and re-runs the grading, the vision extraction is not repeated. Only the grading stage runs again, using the cached extraction. This makes rubric refinement fast and inexpensive.

Quantitative observations

Over several exam cycles involving approximately 2,000 individual page extractions, we observed the following patterns:

Extraction accuracy by content type:

Printed or typed text: 97 to 99 percent character accuracy
Clean handwriting (pen on white paper): 90 to 95 percent
Pencil handwriting or low-contrast scans: 80 to 90 percent
Mathematical notation (formulas, equations): 85 to 93 percent
Diagrams with labels: 75 to 85 percent for label text, structural elements generally correct

Model comparison (as of early 2026):

Claude 3.5 Sonnet and successors: strongest on mathematical notation and structured extraction
GPT-4o: strongest on varied handwriting styles, slightly weaker on complex formulas
Gemini 1.5 Pro and 2.0: good general performance, competitive cost, occasional formatting inconsistencies in structured output

These numbers are approximate and depend heavily on the specific exam, student population, and scanning quality. They are offered as practical guidance, not as benchmarks.

What remains difficult

Despite the improvements, several categories of handwritten content remain challenging for all current vision models:

Heavily abbreviated notation where the meaning depends on course-specific conventions
Overwritten corrections where the original and corrected content overlap spatially
Very small writing in margins or between lines
Non-Latin scripts mixed with mathematical notation
Circuit diagrams, free body diagrams, and other technical drawings where spatial relationships carry meaning

For these cases, the human review step is not a safety net. It is an essential part of the pipeline. The goal of the extraction stage is not perfection. It is to handle the straightforward 70 to 80 percent reliably, flag the uncertain portions explicitly, and let the examiner focus attention where it is most needed.