Bias, Fairness & Transparency

Our approach: companion, not judge

AEMS does not autonomously decide grades. The AI drafts marks and feedback based on your rubric, and presents them as proposals. Every mark requires human review before it reaches a student. This is not a philosophical position; it is a hard architectural constraint enforced by the review workflow.

Rubric traceability

Every mark AEMS proposes is tied to a specific rubric criterion. The AI does not produce a single holistic score. Instead, it evaluates each rubric step independently and shows which criteria were met, partially met, or not met.

This means marks are explainable: you can see exactly why the AI proposed a particular score, and students can see which rubric points they earned or missed.

Override audit trail

When an instructor changes an AI-proposed mark, AEMS logs:

Who made the change (user ID, name, role)
When (timestamp)
What changed (before and after values)
Why (reason field, required for all modifications)

These logs are tamper-evident, using a SHA-256 hash chain where each entry references the hash of the previous entry. This makes it possible to verify that no records have been altered after the fact.

Confidence scoring

AEMS assigns confidence levels to its proposed marks. Submissions where the AI is less confident are flagged for closer review. This means the hardest-to-grade papers, such as those with messy handwriting, unusual notation, or ambiguous answers, are prioritised for human attention rather than quietly assigned a potentially wrong score.

What the AI does not see

The AI receives only the exam content for grading. It does not see:

Student names or identifiers
Demographics or protected characteristics
Previous grades or academic history
Course standing or enrolment status

Grading is based solely on the rubric and the submitted work.

Model versioning

Each grading session records which AI model and version produced the marks. This means:

Results are reproducible: you can trace any grade to the exact model that produced it
Model updates do not silently change grading behaviour mid-session
Institutions can audit which model versions are approved for use

Known limitations

We believe honesty about limitations is more important than marketing claims. Here is what we know:

Handwriting quality matters. Very messy handwriting, unusual notation styles, or poor scan quality can reduce OCR accuracy. AEMS flags low-confidence items, but it cannot guarantee perfect reading of every handwriting style
Rubric coverage is bounded. If a student uses a valid approach not covered by the rubric, the AI may not recognise it. Human review catches these cases
Language and notation conventions vary. Mathematical notation differs across countries and traditions. We test against common STEM notation but cannot claim coverage of every convention
No formal bias audit yet. We have not yet completed a large-scale formal bias audit across demographic groups. We track AI-vs-human agreement metrics and override rates, which provide directional signals, but a formal study requires larger pilot data sets

What we are building toward

Formal bias testing across handwriting styles, notation conventions, and scan qualities as pilot data accumulates
AI-vs-human agreement dashboards accessible to department heads
Override pattern analysis to detect systematic grading tendencies
Third-party fairness audits as the user base grows

Analytics already available

AEMS includes built-in analytics comparing AI-proposed marks against human-reviewed final marks:

Agreement rates (with configurable tolerance thresholds)
Overgrade and undergrade rates
Override frequency by question and rubric criterion
Score correlation analysis
Per-question difficulty and discrimination indices

These metrics are available to instructors and department heads through the analytics dashboard.

Contact

Questions about fairness or bias: privacy@aems.app