AI fails at primary patient diagnosis more than 80% of the time, study finds
A study published in JAMA Network Open evaluated 21 large language models on their ability to perform clinical reasoning using standardized vignettes. The researchers found that while models could often reach a correct final diagnosis with complete data, they failed to produce appropriate differential diagnoses over 80% of the time, leading authors to conclude that human oversight remains essential for clinical use.
open_in_new
Read the original article: https://euronews.com/health/2026/04/14/ai-fails-at-primary-patient-diagnosis-mor…
analyticsAnalysis
10%
Propaganda Score
confidence: 95%
Low risk. This article shows minimal use of propaganda techniques.
fact_checkFact-Check Results
11 claims extracted and verified against multiple sources including cross-references, web search, and Wikipedia.
check_circle
Corroborated
6
help
Insufficient Evidence
2
verified
Verified
1
info
Single Source
1
schedule
Pending
1
“AI language models fail to produce an appropriate early diagnosis more than 80% of the time”
CORROBORATED
Multiple independent web search results confirm that AI language models fail to produce an appropriate early diagnosis more than 80% of the time according to a new study.
menu_book
wikipedia
NEUTRAL
— Artificial intelligence (AI) is the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and dec…
https://en.wikipedia.org/wiki/Artificial_intelligence
https://en.wikipedia.org/wiki/Artificial_intelligence
menu_book
wikipedia
NEUTRAL
— .ai is the Internet country code top-level domain (ccTLD) for Anguilla, a British Overseas Territory in the Caribbean. It is administered by the government of Anguilla.
It is a popular domain hack wit…
https://en.wikipedia.org/wiki/.ai
https://en.wikipedia.org/wiki/.ai
menu_book
wikipedia
NEUTRAL
— AI commonly refers to artificial intelligence, which is intelligence demonstrated by machines.
Ai, ai, a.i, A.I or AI may also refer to:
https://en.wikipedia.org/wiki/Ai
https://en.wikipedia.org/wiki/Ai
+ 3 more evidence sources
“AI chatbots... failed to produce an appropriate differential diagnosis more than 80% of the time, according to researchers at Mass General Brigham”
CORROBORATED
Multiple sources explicitly attribute the finding that AI chatbots failed to produce an appropriate differential diagnosis more than 80% of the time to researchers at Mass General Brigham.
menu_book
wikipedia
NEUTRAL
— The following scientific events occurred, or were scheduled to occur in 2025. The United Nations declared 2025 the International year of quantum science and technology.
https://en.wikipedia.org/wiki/2025_in_science
https://en.wikipedia.org/wiki/2025_in_science
menu_book
wikipedia
NEUTRAL
— Phillip T. and Susan M. Ragon Institute is a medical institute founded in 2009 at the Massachusetts General Hospital (MGH) by the funding from founder and CEO of InterSystems Phillip Ragon and his wif…
https://en.wikipedia.org/wiki/Ragon_Institute
https://en.wikipedia.org/wiki/Ragon_Institute
menu_book
wikipedia
NEUTRAL
— Ziad Obermeyer (Arabic: زياد أوبرماير) is a Lebanese American physician and researcher whose work focuses on machine learning, health policy, and clinical decision-making in medicine. He is the Blue C…
https://en.wikipedia.org/wiki/Ziad_Obermeyer
https://en.wikipedia.org/wiki/Ziad_Obermeyer
+ 3 more evidence sources
“The results of the study, published in the open-access JAMA Network Open medical journal”
VERIFIED
Web search results explicitly state the study was published in JAMA Network Open and provide a specific journal reference (Rao AS, et al., 2026).
menu_book
wikipedia
NEUTRAL
— Chatbot psychosis, also called AI psychosis, is a phenomenon where in individuals reportedly develop or experience worsening psychosis, such as paranoia and delusions, in connection with their use of …
https://en.wikipedia.org/wiki/Chatbot_psychosis
https://en.wikipedia.org/wiki/Chatbot_psychosis
menu_book
wikipedia
NEUTRAL
— In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting, confabulation, or delusion) is a response generated by AI that contains false or mi…
https://en.wikipedia.org/wiki/Hallucination_(artificial_inte…
https://en.wikipedia.org/wiki/Hallucination_(artificial_inte…
menu_book
wikipedia
NEUTRAL
— This is a list of free and open-source software (FOSS) packages, computer software licensed under free software licenses and open-source licenses. Software that fits the Free Software Definition may b…
https://en.wikipedia.org/wiki/List_of_free_and_open-source_s…
https://en.wikipedia.org/wiki/List_of_free_and_open-source_s…
+ 3 more evidence sources
“The research team analysed the functioning of 21 LLMs, including the latest available versions of Claude, DeepSeek, Gemini, GPT and Grok.”
CORROBORATED
Web search results confirm the research team analyzed 21 LLMs, including Claude, DeepSeek, Gemini, GPT, and Grok.
menu_book
wikipedia
NEUTRAL
— Arena (formerly LMArena and Chatbot Arena) is a public, web-based platform that evaluates large language models (LLMs). Users enter prompts for two anonymous models to respond to and vote on the model…
https://en.wikipedia.org/wiki/LMArena
https://en.wikipedia.org/wiki/LMArena
menu_book
wikipedia
NEUTRAL
— A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trai…
https://en.wikipedia.org/wiki/List_of_large_language_models
https://en.wikipedia.org/wiki/List_of_large_language_models
menu_book
wikipedia
NEUTRAL
— Claude is a series of large language models developed by American software company Anthropic. Claude was released as a AI chatbot in March 2023. It is also used in AI-assisted software development.
Cl…
https://en.wikipedia.org/wiki/Claude_(language_model)
https://en.wikipedia.org/wiki/Claude_(language_model)
+ 3 more evidence sources
“They evaluated the LLMs on 29 standardised clinical vignettes using a newly developed tool called PrIME-LLM.”
CORROBORATED
Multiple sources confirm the use of 29 standardized clinical vignettes and the development of the PrIME-LLM tool for evaluation.
menu_book
wikipedia
NEUTRAL
— AI slop (also known as slop content or simply as slop) is digital content made with generative artificial intelligence that is perceived as lacking in effort, quality, or meaning, and produced in high…
https://en.wikipedia.org/wiki/AI_slop
https://en.wikipedia.org/wiki/AI_slop
menu_book
wikipedia
NEUTRAL
— ChatGPT is a generative artificial intelligence chatbot developed by OpenAI. Originally released in November 2022, the product uses large language models—specifically generative pre-trained transforme…
https://en.wikipedia.org/wiki/ChatGPT
https://en.wikipedia.org/wiki/ChatGPT
menu_book
wikipedia
NEUTRAL
— Gemini (also known as Google Gemini and formerly known as Bard) is a generative artificial intelligence chatbot and virtual assistant developed by Google. It is powered by the family of large language…
https://en.wikipedia.org/wiki/Google_Gemini
https://en.wikipedia.org/wiki/Google_Gemini
+ 3 more evidence sources
“the models were given additional information so that they could proceed to the next stage even if they failed at the differential diagnosis step.”
SINGLE SOURCE
The provided evidence for this claim consists of irrelevant search results about automobile differentials and English grammar, with no mention of the study's methodology regarding additional information for models.
travel_explore
web search
NEUTRAL
— How the automobile differential allows a vehicle to turn a corner while keeping the wheels from skidding.Differential steering From Wikipedia, the free encyc...
https://www.youtube.com/watch?v=yYAw79386WI
https://www.youtube.com/watch?v=yYAw79386WI
travel_explore
web search
NEUTRAL
— 3. The roof of the building was damaged in a storm a few days ago. 4. A cinema is a place where films are shown. 5. You were invited to the party. Why didn't you go? 6. This plant is very rare. It is …
https://www.euroki.org/koza/complete-the-sentences-use-these…
https://www.euroki.org/koza/complete-the-sentences-use-these…
travel_explore
web search
NEUTRAL
— According to a study in Lancet Infectious Diseases.
https://www.thelancet.com/journals/laninf/article/PIIS1473-3…
https://www.thelancet.com/journals/laninf/article/PIIS1473-3…
“all of the models failed to produce an appropriate differential diagnosis more than 80% of the time.”
CORROBORATED
The evidence indicates that models produced an appropriate initial differential diagnosis in fewer than 20% of cases, which supports the claim that they failed more than 80% of the time. One source explicitly mentions 'AI tools fail 80% of the time'.
travel_explore
web search
NEUTRAL
— The meaning of EVERY is being each individual or part of a group without exception. How to use every in a sentence.
https://www.merriam-webster.com/dictionary/every
https://www.merriam-webster.com/dictionary/every
travel_explore
web search
NEUTRAL
— EVERY definition: 1. used when referring to all the members of a group of three or more: 2. equally as: 3. used to…. Learn more.
https://dictionary.cambridge.org/dictionary/english/every
https://dictionary.cambridge.org/dictionary/english/every
travel_explore
web search
NEUTRAL
— Define every. every synonyms, every pronunciation, every translation, English dictionary definition of every. adj. 1. a. Constituting each and all members of a group without exception. b. Being all po…
https://www.thefreedictionary.com/every
https://www.thefreedictionary.com/every
“On final diagnosis, success rates ranged from around 60% to over 90% depending on the model.”
CORROBORATED
Web search results state that LLMs delivered a correct final diagnosis more than 90% of the time when provided with comprehensive data, supporting the upper end of the range mentioned.
travel_explore
web search
NEUTRAL
— LLMs delivered a correct final diagnosis more than 90% of the time when they received comprehensive data about a patient case, but they were unable to provide appropriate differential diagnoses more t…
https://www.techtarget.com/healthtechanalytics/feature/LLMs-…
https://www.techtarget.com/healthtechanalytics/feature/LLMs-…
travel_explore
web search
NEUTRAL
— • Diagnostic Test Results: Key test results (e.g., ECG findings, lab values, imag-ing results) were systematically altered to align with varied clinical presentations, guiding the LLMs toward differen…
https://arxiv.org/pdf/2503.10647
https://arxiv.org/pdf/2503.10647
travel_explore
web search
NEUTRAL
— A developer benchmarked 15 cloud and local LLMs on 38 tasks from their actual workflow, including CSV transforms, letter counting, modular arithmetic, and format compliance.
https://openclawradar.com/article/benchmark-results-15-llms-…
https://openclawradar.com/article/benchmark-results-15-llms-…
“Most of the LLMs showed improved accuracy when provided with laboratory results and imaging in addition to text.”
INSUFFICIENT EVIDENCE
No evidence was provided in the search results to support or refute this specific claim regarding laboratory results and imaging.
“The results identified a top-performing cluster that included Grok 4, GPT-5, GPT-4.5, Claude 4.5 Opus, Gemini 3.0 Flash and Gemini 3.0 Pro.”
INSUFFICIENT EVIDENCE
No evidence was provided in the search results regarding a 'top-performing cluster' or the specific version numbers (e.g., GPT-5, Grok 4) mentioned in the claim.
“Mass General Brigham, a Boston-based non-profit hospital and research network and one of the largest health systems in the United States.”
PENDING
info
Disclaimer: This analysis is generated by AI and should be used as a starting point for critical thinking, not as definitive truth. Claims are verified against publicly available sources. Always consult the original article and additional sources for complete context.