AI not yet good enough to grade university essays, rewarding 'style over substance'

Phys · May 21, 2026 · 1051 words · By University

AI vs Human Judgment Academic Integrity and Standards Educational Technology Limitations

Researchers from the University of Cambridge and other institutions tested Generative AI models on undergraduate psychology essays to evaluate their grading accuracy. The study found that AI often fails to match human grading for the highest and lowest performing students, tending to reward linguistic style over academic substance.

open_in_new Read the original article: https://phys.org/news/2026-05-ai-good-grade-university-essays.html

analyticsAnalysis

10%

Propaganda Score

confidence: 95%

Low risk. This article shows minimal use of propaganda techniques.

psychologyDetected Techniques

warning

Loaded Language 70% confidence

Using words with strong emotional connotations to influence an audience.

fact_checkFact-Check Results

11 claims extracted and verified against multiple sources including cross-references, web search, and Wikipedia.

info Single Source 8

help Insufficient Evidence 2

schedule Pending 1

info

“A University of Cambridge-led team of psychologists and AI experts tested three "frontier" systems including the latest versions (as of April 2026) of Claude and ChatGPT on over 750 student essays from three UK universities submitted as part of a psychology degree.”

SINGLE SOURCE

The provided web search results only define what Claude and ChatGPT are; they do not mention a University of Cambridge study involving 750 student essays.

travel_explore

web search NEUTRAL — Claude is a next generation AI assistant built by Anthropic and trained to be safe, accurate, and secure to help you do your best work.
https://claude.com/login

travel_explore

web search NEUTRAL — Chat with Claude AI by Anthropic for free. Thoughtful reasoning and analysis. No registration required.
https://chatgpt.org/claude/chat

travel_explore

web search NEUTRAL — ChatGPT A conversational AI system that listens, learns, and challenges.
https://chatgpt.com/

info

“it did manage to match the broad grading bands—a first, 2:1, 2:2 and so on—given out by human examiners between 35–65% of the time.”

SINGLE SOURCE

The evidence describes general UK grading systems but does not provide any data regarding AI's accuracy in matching these bands in a specific study.

travel_explore

web search NEUTRAL — Masters degree grades student So you’ve finished your bachelors and you're thinking of studying a masters program. You may find during this process that the UK masters grading system is slightly diffe…
https://www.postgrad.com/advice/masters_programs/masters_deg…

travel_explore

web search NEUTRAL — Masters grades in the UK are usually classified as Distinction, Merit or Pass with specific percentage thresholds for each category. Assessment typically includes coursework, exams, and a dissertation…
https://www.findamasters.com/guides/masters-degree-grades

travel_explore

web search NEUTRAL — University of St Andrews Ranking UK 2021 / 2022 - Complete.
https://www.thecompleteuniversityguide.co.uk/universities/un…

info

“all the AI systems were "oversensitive to linguistic features": giving out higher marks based on essay length, vocabulary range, and sentence complexity, regardless of the academic quality of the essay.”

SINGLE SOURCE

The evidence provides general definitions of AI and AGI but contains no information about AI being oversensitive to linguistic features in grading essays.

travel_explore

web search NEUTRAL — Artificial intelligence (AI) is the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and dec…
https://en.m.wikipedia.org/wiki/Artificial_intelligence

travel_explore

web search NEUTRAL — We believe our research will eventually lead to artificial general intelligence, a system that can solve human-level problems. Building safe and beneficial AGI is our mission.
https://openai.com/

travel_explore

web search NEUTRAL — Meet Gemini, Google’s AI assistant. Get help with writing, planning, brainstorming, and more. Experience the power of generative AI.
https://gemini.google.com/

info

“The report is titled "AI in University Assessment: Evaluating the Opportunities and Risks of Automated Marking."”

SINGLE SOURCE

The search results mention various AI reports (Stanford AI Index) and general research platforms, but none confirm the existence or title of the specific report mentioned.

travel_explore

web search NEUTRAL — The 2021 AI Index Report. This year we significantly expanded the amount of data available in the report, worked with a broader set of external organizations to calibrate our data, and deepened our co…
https://hai.stanford.edu/ai-index/2025-ai-index-report

travel_explore

web search NEUTRAL — In this issue: Ethiopia launches a massive AI university, a $23M coalition trains 400,000 teachers to reclaim their time, and new tools target the "marking backlog." We also examine critical evidence …
https://www.linkedin.com/pulse/workload-relief-wave-national…

travel_explore

web search NEUTRAL — Discover research. Access over 160 million publication pages and stay up to date with what's happening in your field. Connect with your scientific community. Share your research, collaborate with your…
https://www.researchgate.net/

info

“AI was also asked to provide student feedback, and it churned out reflections between three to eight times longer than those provided by the original assessors.”

SINGLE SOURCE

The evidence lists AI tools like Copilot and general Wikipedia definitions of AI, but does not mention a study comparing the length of AI feedback versus human feedback.

travel_explore

web search NEUTRAL — Microsoft Copilot is your companion to inform, entertain and inspire. Get advice, feedback and straightforward answers. Try Copilot now.
https://copilot.microsoft.com/

info

“when AI responses were kept to a word count comparable to those from humans, focus groups of staff and students found it difficult to distinguish between human and AI feedback.”

SINGLE SOURCE

The evidence discusses 'AI Humanizer' tools designed to make text sound human, but does not mention a specific focus group study regarding the indistinguishability of AI feedback when word counts are matched.

travel_explore

web search NEUTRAL — AI Humanizer helps you humanize AI text online for free. Turn ChatGPT, Claude, and Gemini content into natural, clear, human-like writing—no sign-up required.
https://notegpt.io/ai-humanizer

travel_explore

web search NEUTRAL — When to Humanize AI Text? You are a student? Finish your homework in minutes. Teachers will not find out if a LLM did the work for you.
https://ai-text-humanizer.com/

travel_explore

web search NEUTRAL — Humanize AI and make AI writing sound natural and human. Use our free AI Humanizer to remove robotic tone from ChatGPT and other AI-generated text.
https://www.grammarly.com/ai-humanizer

info

“The study used 761 undergraduate essays in psychology submitted and marked between 2022 and 2025 from a total of 125 students from the universities of Cambridge, Manchester Metropolitan and Nottingham.”

SINGLE SOURCE

The evidence provides general information about the University of Manchester and psychology degrees, but does not corroborate the specific dataset of 761 essays from the three named universities.

travel_explore

web search NEUTRAL — The University of Manchester. Student life. Your city guide to Manchester. Discover what it’s like to live and study in Manchester, from getting around to things to do and days out.
https://www.manchester.ac.uk/

travel_explore

web search NEUTRAL — The online MSc Psychology and MSc Psychology of Mental Health and Wellbeing programmes have been fully accredited by the British Psychological Society (BPS), which confers eligibility for the Graduate…
https://online.wlv.ac.uk/online-psychology-degrees-at-the-un…

travel_explore

web search NEUTRAL — The study shows a psychological impact of the Covid-19 emergency on college students. Stress significantly decreases learning and negatively affects psychological well-being of students. Resilience sk…
https://pubmed.ncbi.nlm.nih.gov/33602027/

info

“Researchers tested AI systems with the same essays at different times, and found AI gave the same or similar marks each time.”

SINGLE SOURCE

The evidence discusses why AI prompts can give different answers, but does not provide evidence for a study showing AI gave consistent marks for the same essays over time.

travel_explore

web search NEUTRAL — Sam Altman is the CEO of OpenAI, the company behind GPT-4, ChatGPT, DALL-E, Codex, and many other state-of-the-art AI technologies. Please support this podca...
https://www.youtube.com/watch?v=L_Guz73e6fw

travel_explore

web search NEUTRAL — “If I use the exact same prompt, why does the AI give different answers on different days, environments, or even time zones?” In reality, enterprise AI systems behave more like distributed runtime…
https://blog.gopenai.com/why-does-the-same-prompt-give-diffe…

travel_explore

web search NEUTRAL — Human-realistic AI systems could be used to impersonate people for fraudulent or deceptive purposes, especially when combined with voice cloning techniques3.Humans are liable for their actions. As AI …
https://www.aisi.gov.uk/blog/should-ai-systems-behave-like-p…

help

“The AI managed to match the right UK degree classification band of the five available (first, 2:1, 2:2, third, fail) some 63% of the time for Cambridge essays, while for Nottingham it was 53% and for Manchester Metropolitan it was 35%.”

INSUFFICIENT EVIDENCE

No evidence was found in the search results to support these specific percentage figures for Cambridge, Nottingham, and Manchester Metropolitan.

help

“An essay marked 75—a solid first—by a human is, on average, scored several points lower by every AI system. While an essay marked 50—a low 2:2—is scored several points higher.”

INSUFFICIENT EVIDENCE

No evidence was found in the search results to support the claim regarding the scoring patterns of high-mark vs low-mark essays by AI.

schedule

“The range on the marking scale where AI and humans most frequently align across institutions lies in the upper-50s to low-60s, so around a low 2:1, near the center of the grade distribution.”

PENDING

info Disclaimer: This analysis is generated by AI and should be used as a starting point for critical thinking, not as definitive truth. Claims are verified against publicly available sources. Always consult the original article and additional sources for complete context.

eFinder

eFinder

AI not yet good enough to grade university essays, rewarding 'style over substance'

analyticsAnalysis

psychologyDetected Techniques

fact_checkFact-Check Results