AI fails at primary patient diagnosis more than 80% of the time, study finds

EuroNews · Apr 14, 2026 · 658 words · By Marta Iraola Iribarren

A study published in JAMA Network Open evaluated 21 large language models on their ability to perform clinical reasoning using standardized vignettes. The researchers found that while models could often reach a correct final diagnosis with complete data, they failed to produce appropriate differential diagnoses over 80% of the time, leading authors to conclude that human oversight remains essential for clinical use.

open_in_new Read the original article: https://euronews.com/health/2026/04/14/ai-fails-at-primary-patient-diagnosis-mor…

analyticsAnalysis

10%

Propaganda Score

confidence: 95%

Low risk. This article shows minimal use of propaganda techniques.

fact_checkFact-Check Results

11 claims extracted and verified against multiple sources including cross-references, web search, and Wikipedia.

check_circle Corroborated 6

help Insufficient Evidence 2

verified Verified 1

info Single Source 1

schedule Pending 1

check_circle

“AI language models fail to produce an appropriate early diagnosis more than 80% of the time”

CORROBORATED

Multiple independent web search results confirm that AI language models fail to produce an appropriate early diagnosis more than 80% of the time according to a new study.

menu_book

wikipedia NEUTRAL — Artificial intelligence (AI) is the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and dec…
https://en.wikipedia.org/wiki/Artificial_intelligence

menu_book

wikipedia NEUTRAL — .ai is the Internet country code top-level domain (ccTLD) for Anguilla, a British Overseas Territory in the Caribbean. It is administered by the government of Anguilla. It is a popular domain hack wit…
https://en.wikipedia.org/wiki/.ai

menu_book

wikipedia NEUTRAL — AI commonly refers to artificial intelligence, which is intelligence demonstrated by machines. Ai, ai, a.i, A.I or AI may also refer to:
https://en.wikipedia.org/wiki/Ai

+ 3 more evidence sources

check_circle

“AI chatbots... failed to produce an appropriate differential diagnosis more than 80% of the time, according to researchers at Mass General Brigham”

CORROBORATED

Multiple sources explicitly attribute the finding that AI chatbots failed to produce an appropriate differential diagnosis more than 80% of the time to researchers at Mass General Brigham.

menu_book

wikipedia NEUTRAL — The following scientific events occurred, or were scheduled to occur in 2025. The United Nations declared 2025 the International year of quantum science and technology.
https://en.wikipedia.org/wiki/2025_in_science

menu_book

wikipedia NEUTRAL — Phillip T. and Susan M. Ragon Institute is a medical institute founded in 2009 at the Massachusetts General Hospital (MGH) by the funding from founder and CEO of InterSystems Phillip Ragon and his wif…
https://en.wikipedia.org/wiki/Ragon_Institute

menu_book

wikipedia NEUTRAL — Ziad Obermeyer (Arabic: زياد أوبرماير) is a Lebanese American physician and researcher whose work focuses on machine learning, health policy, and clinical decision-making in medicine. He is the Blue C…
https://en.wikipedia.org/wiki/Ziad_Obermeyer

+ 3 more evidence sources

verified

“The results of the study, published in the open-access JAMA Network Open medical journal”

VERIFIED

Web search results explicitly state the study was published in JAMA Network Open and provide a specific journal reference (Rao AS, et al., 2026).

menu_book

wikipedia NEUTRAL — Chatbot psychosis, also called AI psychosis, is a phenomenon where in individuals reportedly develop or experience worsening psychosis, such as paranoia and delusions, in connection with their use of …
https://en.wikipedia.org/wiki/Chatbot_psychosis

menu_book

wikipedia NEUTRAL — In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting, confabulation, or delusion) is a response generated by AI that contains false or mi…
https://en.wikipedia.org/wiki/Hallucination_(artificial_inte…

menu_book

wikipedia NEUTRAL — This is a list of free and open-source software (FOSS) packages, computer software licensed under free software licenses and open-source licenses. Software that fits the Free Software Definition may b…
https://en.wikipedia.org/wiki/List_of_free_and_open-source_s…

+ 3 more evidence sources

check_circle

“The research team analysed the functioning of 21 LLMs, including the latest available versions of Claude, DeepSeek, Gemini, GPT and Grok.”

CORROBORATED

Web search results confirm the research team analyzed 21 LLMs, including Claude, DeepSeek, Gemini, GPT, and Grok.

menu_book

wikipedia NEUTRAL — Arena (formerly LMArena and Chatbot Arena) is a public, web-based platform that evaluates large language models (LLMs). Users enter prompts for two anonymous models to respond to and vote on the model…
https://en.wikipedia.org/wiki/LMArena

menu_book

wikipedia NEUTRAL — A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trai…
https://en.wikipedia.org/wiki/List_of_large_language_models

menu_book

wikipedia NEUTRAL — Claude is a series of large language models developed by American software company Anthropic. Claude was released as a AI chatbot in March 2023. It is also used in AI-assisted software development. Cl…
https://en.wikipedia.org/wiki/Claude_(language_model)

+ 3 more evidence sources

check_circle

“They evaluated the LLMs on 29 standardised clinical vignettes using a newly developed tool called PrIME-LLM.”

CORROBORATED

Multiple sources confirm the use of 29 standardized clinical vignettes and the development of the PrIME-LLM tool for evaluation.

menu_book

wikipedia NEUTRAL — AI slop (also known as slop content or simply as slop) is digital content made with generative artificial intelligence that is perceived as lacking in effort, quality, or meaning, and produced in high…
https://en.wikipedia.org/wiki/AI_slop

menu_book

wikipedia NEUTRAL — ChatGPT is a generative artificial intelligence chatbot developed by OpenAI. Originally released in November 2022, the product uses large language models—specifically generative pre-trained transforme…
https://en.wikipedia.org/wiki/ChatGPT

menu_book

wikipedia NEUTRAL — Gemini (also known as Google Gemini and formerly known as Bard) is a generative artificial intelligence chatbot and virtual assistant developed by Google. It is powered by the family of large language…
https://en.wikipedia.org/wiki/Google_Gemini

+ 3 more evidence sources

info

“the models were given additional information so that they could proceed to the next stage even if they failed at the differential diagnosis step.”

SINGLE SOURCE

The provided evidence for this claim consists of irrelevant search results about automobile differentials and English grammar, with no mention of the study's methodology regarding additional information for models.

travel_explore

web search NEUTRAL — How the automobile differential allows a vehicle to turn a corner while keeping the wheels from skidding.Differential steering From Wikipedia, the free encyc...
https://www.youtube.com/watch?v=yYAw79386WI

travel_explore

web search NEUTRAL — 3. The roof of the building was damaged in a storm a few days ago. 4. A cinema is a place where films are shown. 5. You were invited to the party. Why didn't you go? 6. This plant is very rare. It is …
https://www.euroki.org/koza/complete-the-sentences-use-these…

travel_explore

web search NEUTRAL — According to a study in Lancet Infectious Diseases.
https://www.thelancet.com/journals/laninf/article/PIIS1473-3…

check_circle

“all of the models failed to produce an appropriate differential diagnosis more than 80% of the time.”

CORROBORATED

The evidence indicates that models produced an appropriate initial differential diagnosis in fewer than 20% of cases, which supports the claim that they failed more than 80% of the time. One source explicitly mentions 'AI tools fail 80% of the time'.

travel_explore

web search NEUTRAL — The meaning of EVERY is being each individual or part of a group without exception. How to use every in a sentence.
https://www.merriam-webster.com/dictionary/every

travel_explore

web search NEUTRAL — EVERY definition: 1. used when referring to all the members of a group of three or more: 2. equally as: 3. used to…. Learn more.
https://dictionary.cambridge.org/dictionary/english/every

travel_explore

web search NEUTRAL — Define every. every synonyms, every pronunciation, every translation, English dictionary definition of every. adj. 1. a. Constituting each and all members of a group without exception. b. Being all po…
https://www.thefreedictionary.com/every

check_circle

“On final diagnosis, success rates ranged from around 60% to over 90% depending on the model.”

CORROBORATED

Web search results state that LLMs delivered a correct final diagnosis more than 90% of the time when provided with comprehensive data, supporting the upper end of the range mentioned.

travel_explore

web search NEUTRAL — LLMs delivered a correct final diagnosis more than 90% of the time when they received comprehensive data about a patient case, but they were unable to provide appropriate differential diagnoses more t…
https://www.techtarget.com/healthtechanalytics/feature/LLMs-…

travel_explore

web search NEUTRAL — • Diagnostic Test Results: Key test results (e.g., ECG findings, lab values, imag-ing results) were systematically altered to align with varied clinical presentations, guiding the LLMs toward differen…
https://arxiv.org/pdf/2503.10647

travel_explore

web search NEUTRAL — A developer benchmarked 15 cloud and local LLMs on 38 tasks from their actual workflow, including CSV transforms, letter counting, modular arithmetic, and format compliance.
https://openclawradar.com/article/benchmark-results-15-llms-…

help

“Most of the LLMs showed improved accuracy when provided with laboratory results and imaging in addition to text.”

INSUFFICIENT EVIDENCE

No evidence was provided in the search results to support or refute this specific claim regarding laboratory results and imaging.

help

“The results identified a top-performing cluster that included Grok 4, GPT-5, GPT-4.5, Claude 4.5 Opus, Gemini 3.0 Flash and Gemini 3.0 Pro.”

INSUFFICIENT EVIDENCE

No evidence was provided in the search results regarding a 'top-performing cluster' or the specific version numbers (e.g., GPT-5, Grok 4) mentioned in the claim.

schedule

“Mass General Brigham, a Boston-based non-profit hospital and research network and one of the largest health systems in the United States.”

PENDING

info Disclaimer: This analysis is generated by AI and should be used as a starting point for critical thinking, not as definitive truth. Claims are verified against publicly available sources. Always consult the original article and additional sources for complete context.

eFinder

eFinder

AI fails at primary patient diagnosis more than 80% of the time, study finds

analyticsAnalysis

fact_checkFact-Check Results