AI Grading Gets a Reality Check

The study demonstrates a clear trajectory of improvement in automated short-answer scoring, as evidenced by increasingly positive normalized scores-calculated as [latex](X-\text{Human})/|\text{Baseline}-\text{Human}|[/latex]- achieved by large language models on the ASAP-SAS dataset; a transformer ensemble (point A, Ormerod, 2022) and a fine-tuned GPT ensemble (point B, Ormerod & Kwako, 2024) progressively surpass earlier baselines-with a subsequent GPT-4 implementation (point C, Jiang & Bosch, 2024) projected to reach a normalized score of -1.52-indicating a substantial, ongoing refinement in automated assessment capabilities.

A comprehensive analysis reveals that current artificial intelligence systems struggle to accurately assess short-answer responses, exposing fundamental limitations in their understanding of meaning.