Why a mechanics professor built an exam grading tool

I am a professor of Solid Mechanics at KTH Royal Institute of Technology in Stockholm. My research involves fracture mechanics, computational methods, and material modelling. None of that has anything to do with building software products. And yet, here we are.

The story begins where most practical tools begin: with a problem that would not go away.

The marking problem

Every examination period at KTH follows the same rhythm. Students sit the exam. Papers are collected, scanned, and delivered as PDFs. Then the marking begins.

For a typical undergraduate course in mechanics, this means 200 to 300 submissions, each containing five to eight problems with multiple sub-questions. A single exam might require checking 1,500 individual rubric items. The marking takes a week of focused work, sometimes more.

The cognitive load is substantial. Each problem demands re-reading the rubric, interpreting the student’s notation, verifying intermediate steps, and assigning a mark. By the third day, the examiner’s internal calibration drifts. The standard applied to paper number 250 is not the same standard applied to paper number 20. This is not a hypothesis. It is a repeatedly measured phenomenon in educational assessment research.

I have marked exams this way for years. My colleagues have done the same. The process is unchanged from what it was thirty years ago, except that the papers are now PDFs instead of physical sheets.

The experiment

In late 2024, when vision-capable language models became broadly available, I ran an informal experiment. I took a rubric from a recent mechanics exam, formatted it as a structured prompt, and fed ten student submissions through GPT-4V. I compared the AI-proposed marks against my own.

The results were uneven but interesting. For calculation-heavy questions with well-defined steps, the AI agreed with my marks roughly 80 to 85 percent of the time. For partial credit decisions and interpretation of ambiguous notation, the agreement dropped to around 60 percent. For open-ended discussion questions, the AI was unreliable.

But the 80 percent figure was significant. It suggested that for the mechanical, repetitive portion of marking, a well-prompted AI model could serve as a credible first pass. Not a replacement for human judgment, but a starting point that would reduce the examiner’s task from marking to reviewing.

From script to system

The first version was a Python script. It read a PDF, split it into pages, sent each page to a vision API, and produced a JSON file with proposed marks and explanations. I used it for one exam cycle with my own submissions, correcting the AI’s proposals manually and recording the adjustments.

Two things became clear immediately. First, the tool saved real time. Reviewing and correcting a pre-marked submission took roughly one-third the time of marking from scratch. Second, the consistency improved. Because the AI applied the same rubric mechanically to every submission, the type of drift I described earlier was eliminated at the first-pass level. My corrections added nuance, but the baseline was uniform.

The script grew. I added rubric parsing, so the marking scheme could be defined in a structured YAML file rather than embedded in the prompt. I added PDF annotation, so the proposed marks appeared directly on the student’s submission as colour-coded overlays. I added a review interface, so accepting or rejecting each mark was a single click rather than a manual comparison.

At some point, the script was no longer a script. It was a system with a database, a web interface, and configuration for multiple AI providers. The engineering had taken on a life of its own.

Why build rather than buy

The obvious question is why I did not simply adopt an existing tool. The answer has three parts.

First, the existing tools in this space were designed primarily for multiple-choice and short-answer assessment. They were not built for the kind of structured, multi-step technical problems that appear in engineering and physics exams. Checking whether a student correctly applied a free body diagram, identified boundary conditions, and carried through a derivation requires domain-aware rubric checks, not pattern matching.

Second, privacy. Student exam data at a Swedish university is subject to GDPR, the Swedish Education Act, and institutional data governance policies. Uploading student submissions to a third-party service hosted outside the EU was not acceptable for my department. I needed a tool that could run locally, using my own API keys or entirely local models.

Third, integration with Canvas. KTH, like many European universities, uses Canvas as its learning management system. The grading workflow needed to pull submissions from Canvas, apply marks, and push results back, all within the existing administrative infrastructure that examiners and students already use.

No existing tool addressed all three requirements simultaneously.

What I learned about building software as an academic

Building AEMS taught me several things that my training in continuum mechanics did not prepare me for.

Software is never finished. In research, a model converges to a solution. In software, every solution generates new requirements. The annotation system worked, so users wanted better annotation placement. The Canvas integration worked, so users wanted batch processing. Each capability created demand for the next.

Testing is not optional. Early versions had intermittent bugs that appeared only with certain PDF layouts or handwriting styles. Building a comprehensive test suite (the current count is over 6,000 tests) was not academic perfectionism. It was the only way to ship changes without breaking existing functionality.

Privacy is an architecture decision, not a feature. You cannot add privacy to a system that was designed to aggregate data centrally. AEMS was built from the beginning with the assumption that exam data should stay as close to the examiner as possible. That decision shaped every subsequent architectural choice.

The current state

AEMS now supports multiple AI providers (Anthropic Claude, OpenAI GPT-4o, Google Gemini, and local models via Ollama), full Canvas LMS integration with a five-step grading wizard, PDF annotation with coordinate-refined placement, a four-tier memory system that learns from examiner corrections, and deployment options ranging from a desktop application to a full on-premises institutional installation.

It is used for real exams with real students. It saves real time. And it started because I was tired of spending a week on marking when I could have been doing research.

The rest of this blog will document the technical decisions, design trade-offs, and lessons learned along the way. If you are an academic who has ever looked at a stack of exams and thought “there must be a better way,” this project is an answer to that thought.