Six thousand tests and counting: testing an AI-powered application

When people learn that AEMS has over 6,600 automated tests, the first reaction is usually some variant of “that seems excessive for a grading tool.” It is a fair reaction. The test count is high relative to the size of the codebase. But it reflects a specific challenge: testing software whose core functionality depends on non-deterministic AI models that can produce different outputs for the same input.

This post describes the testing strategy, why each layer exists, and what we learned about building reliable tests for AI-powered applications.

The testing pyramid, adapted

The standard testing pyramid (many unit tests, fewer integration tests, fewest end-to-end tests) applies to AEMS, but with an additional layer that is specific to AI-dependent applications: model behaviour tests.

Unit tests (~4,000). These test individual functions and classes in isolation. Rubric parsing, PDF coordinate transformation, Canvas API response handling, database model operations, configuration management. None of these tests involve an AI model. They test the deterministic parts of the system and run in seconds.

Integration tests (~1,500). These test interactions between components. The grading pipeline with mocked AI responses, the Canvas workflow with a simulated API, the web interface with a test database. The AI model is replaced with a fixture that returns predetermined responses, so the tests verify that the system handles model output correctly without depending on an actual model.

Model behaviour tests (~300). These test the system’s interaction with real AI models, but in a controlled way. They use cached model responses (stored in the vision cache) and verify that the grading logic produces consistent results for known inputs. These tests are slower and more expensive to run, so they execute separately from the main test suite.

End-to-end tests (~800). These test the full user workflow through the web interface using Playwright. They verify that an examiner can navigate the Canvas wizard, configure grading, review results, and publish grades. The AI model is mocked, but everything else (database, web server, file system) is real.

Why so many unit tests

The high unit test count reflects three factors.

First, the system has many configuration paths. AEMS supports four AI providers, three deployment tiers, multiple rubric formats, various PDF layouts, and several Canvas API versions. Each combination creates a distinct code path that must be tested. The combinatorial expansion is significant.

Second, PDF processing is subtle. Converting between coordinate systems (PDF uses bottom-left origin, web canvas uses top-left), placing annotations at precise locations on a page, and handling different PDF generators’ idiosyncrasies requires extensive test coverage. A single-pixel offset in annotation placement is invisible in testing but visible to the examiner. These tests catch regressions that would otherwise appear as cosmetic defects.

Third, the security surface is broad. Path validation, CSRF protection, SQL injection prevention, XSS mitigation, and the invisible text detection system each have dedicated test suites. Security tests are not optional, and they do not tolerate gaps.

The challenge of testing non-deterministic output

The fundamental difficulty in testing AI-powered applications is that the same input can produce different outputs across model versions, prompt variations, or even identical API calls. A test that verifies the exact text of an AI response will break every time the model is updated.

AEMS addresses this with a layered approach:

Structural tests. Rather than verifying exact output text, these tests verify the structure of the output. Did the model return a valid JSON response? Does it contain the expected fields? Are the mark values within the valid range? Is the explanation non-empty? Structural tests are robust to model updates because they test the format, not the content.

Semantic tests. For critical grading decisions, tests verify semantic properties rather than exact values. If a submission contains a clear arithmetic error, the test verifies that the relevant check is marked as incorrect, without specifying the exact explanation text. These tests use cached model responses, so they are deterministic within a model version but must be updated when the model changes.

Boundary tests. Edge cases (empty submissions, submissions in unexpected languages, submissions with only diagrams, submissions that span many pages) are tested with both mocked and real model responses. The mocked tests verify that the system handles unusual input gracefully. The real-response tests verify that the model produces usable output for edge cases.

Regression tests. When a bug is discovered and fixed, the specific input that triggered the bug is added as a test case. This prevents regressions and builds a library of known challenging inputs over time.

What the tests caught

Several categories of bugs were caught exclusively by automated tests:

Coordinate system errors. PDF annotation placement involves multiple coordinate transformations. A test that renders an annotation and verifies its pixel position on the page caught a regression where annotations were placed 15 pixels too high on pages with non-standard margins. This would have been invisible in development (where all test PDFs had standard margins) and visible only to examiners using specific PDF templates.

Unicode handling in rubric parsing. Swedish course materials frequently contain characters like å, ä, ö. A rubric check that included these characters was parsed correctly in isolation but caused a JSON encoding error when embedded in a prompt. The unit test for rubric-to-prompt conversion caught this before any examiner encountered it.

Canvas API pagination. Canvas paginates API responses for courses with many submissions. The initial implementation handled the first page correctly but dropped subsequent pages. An integration test with a mock Canvas API returning 150 submissions (three pages) caught the bug immediately.

Race conditions in batch grading. When grading multiple submissions concurrently, two grading tasks could write to the same database record if they completed at the same instant. A stress test that ran 50 concurrent grading tasks with a shared database caught this within the first test cycle.

Testing on Windows

AEMS runs primarily on Windows (because most examiners at KTH use Windows workstations). Testing on Windows introduces platform-specific challenges that do not appear on Linux or macOS.

File path handling. Windows uses backslashes in file paths, but many Python libraries expect forward slashes. Tests that pass on Linux can fail on Windows due to path separator issues. All file path operations in AEMS use pathlib.Path to abstract the separator, and the test suite runs on Windows as the primary platform.

Process cleanup. Some test fixtures start background processes (schedulers, cache cleaners). On Windows, process termination is less reliable than on Linux, and orphaned processes can cause file locks that break subsequent tests. The test infrastructure includes explicit cleanup with retry logic for locked files.

Encoding defaults. Windows defaults to system locale encoding rather than UTF-8 for file operations. Tests that read or write files with non-ASCII content must specify the encoding explicitly. This is easy to forget and consistently causes failures in CI environments with different locale settings.

The cost of comprehensive testing

Running 6,600 tests takes approximately 8 minutes on a modern workstation using parallel execution. This is acceptable for a pre-commit check but too slow for rapid iteration during development. The test suite is organised into marks (unit, integration, e2e, model) so that developers can run relevant subsets during development and the full suite before committing.

Maintaining 6,600 tests also has a cost in development velocity. Any change to a core interface can break dozens or hundreds of tests, each of which must be updated. This creates a tension between comprehensive testing and rapid iteration that every growing codebase must manage.

The trade-off is worth it. The test suite has caught more bugs than any other quality measure. It provides confidence that changes to one part of the system do not break another part. And it serves as living documentation of how the system is expected to behave, which is valuable both for the current team and for anyone who works on the code in the future.