Abstract
Generative AI (genAI) tools are increasingly used by doctoral candidates to obtain formative feedback, yet their reliability in thesis assessment remains unclear. This single-case study compared five genAI models: ChatGPT-o3, ChatGPT-5, ChatGPT-5 Thinking, Claude Opus 4.1, and Gemini 2.5 Pro, using five controlled prompts, including persona-based roles, to evaluate a doctoral confirmation thesis against Grainger’s F.A.C.T. rubric. Feedback varied by platform: Gemini 2.5 Pro produced uniform perfect scores (14/14), offering little diagnostic value, while GPT-5 Thinking and Claude Opus 4.1 were relatively stable, and ChatGPT-o3 and GPT-5 were more volatile. Persona prompts increased variability, reducing reliability compared to the neutral baseline. Alignment with the human reviewer was limited, with genAI feedback lacking the nuance, depth, and relational framing central to doctoral supervision. Friedman tests confirmed that Gemini’s ceiling effect drove the only significant omnibus result; excluding Gemini, no prompt- or model-level differences were significant. Descriptive effect sizes suggested moderate contrasts among non-Gemini models, but the small N constrained inference. We conclude that while genAI tools may supplement supervision by providing rapid feedback, they cannot replace human judgement. Responsible use requires feedback literacy training and safeguards against over-reliance.