AI fails to reliably detect pediatric pneumonia on X-ray

New research underscores risks of using publicly available large-language AI models in clinical settings.

Media Contact: Barbara Clements, bac60@uw.edu, 253-740-5043


Pneumonia remains one of the leading causes of illness and death among children worldwide. A  study published Sept. 17 in Cureus finds that today’s widely available artificial intelligence models are not up to the task of diagnosing the condition from chest X-rays. 

Researchers tested four popular large-language models — ChatGPT, Claude 3.7 Sonnet, Gemini 2.5 Pro, and Grok 3 — on a set of pediatric chest radiographs previously confirmed by human experts to indicate bacterial pneumonia, viral pneumonia or no abnormality. 

Despite growing interest in applying AI to medical imaging, the models performed at chance or a 50-50 toss-up. 

“These out-of-the-box AI tools really don’t seem to do well at all,” said corresponding author Dr. Thomas Heston, clinical assistant professor in family medicine at the University of Washington School of Medicine in Seattle.  “Overall, even if you take an average, it’s no better than a coin flip.” 

Across all models and image sets, diagnostic accuracy averaged just 31%. Performance was somewhat better in identifying viral pneumonia (54%) but far worse in ruling out disease, with only 18% accuracy for normal X-rays. Internal consistency, the ability of a model to provide the same answer twice to the same image, ranged from 46% to 71%, further undermining reliability, the authors noted. No model matched human experts’ interpretations more than half of the time. 

“General-purpose AI tools available to the public are not ready to independently diagnose pediatric pneumonia from imaging,” the authors noted. “Their low accuracy and inconsistent responses highlight serious risks if deployed in unsupervised clinical or consumer-facing settings.” 

The study is a warning that the use of these systems is “unadvisable for medical imaging,” Heston said.  

It's natural for parents to want a second look at their child's X-rays. Some try using AI to "see if there is anything the doctor missed,” Heston said, adding that AI is more likely to create confusion than provide helpful information. 

The study was cut short after it became clear the models were performing far from expert level. The authors stressed that AI still has potential in medicine, but progress will require purpose-built tools trained on large, diverse medical datasets and always subject to clinical oversight. 

Last year, Heston published two other studies on the use of AI for medical guidance. A study published in PLOS One found that ChatGPT-4 large-language model performed poorly against two standard tools that doctors use to predict risk of a cardiac event. 

Another study looked at the use of chatbots by those seeking mental health counseling. The study found that people should be wary of advice received through online chatbots.  

 

For details about UW Medicine, please visit https://uwmedicine.org/about.


Tags:AI (artificial intelligence)radiologypneumoniapediatrics

UW Medicine