ChatGPT inaccurately diagnoses pediatric medical cases in over 80% of cases
• ChatGPT misdiagnoses medical conditions 80% of the time in pediatric study.
• Researchers though say a more finely trained GPT could probably do significantly better.
• ChatGPT currently misses links between symptoms and medical conditions.
A recent study undertaken by JAMA Pediatrics has found that OpenAI’s ChatGPT may need to go back to medical school, as it failed to diagnose 83% of hypothetical child medical cases. The study, conducted at Cohen Children’s Medical Center in New York, analyzed the language model’s answers to various inquiries regarding the diagnosis of pediatric illnesses, only to discover an alarming error rate.
Researchers studied 100 medical cases known as pediatric case challenges. These were initially posted to different physicians as diagnostic challenges with limited or unconventional information. The medical challenges sampled were published on JAMA Pediatrics and NEJM over the space of ten years (2013 to 2023).
Researchers pasted text from the medical cases into a prompt. From here, two physician researchers monitored ChatGPT’s responses, marking them down as “correct,” “incorrect,” or “did not fully capture the diagnosis.”
Out of the 100 cases studied, ChatGPT gave a completely incorrect diagnosis 72 times. While 11 responses were considered “clinically related” to a correct diagnosis, the answers were deemed overly broad to be considered accurate. Therefore, 83% of diagnoses were found to be incorrect to some significant degree.
Out of the 83 incorrect diagnoses, though, 57% were at least in the same organ system, which shows promise, albeit promise that’s nowhere near efficient enough to be used in live cases. For instance, ChatGPT could identify a general symptom, but one shared by various medical conditions, without specifying the precise ailment.
Why did ChatGPT fail so badly on medical diagnosis?
Although ChatGPT is increasingly advancing, it still has its flaws. Researchers believe the generative AI model is unable to ascertain the connections between specific conditions and preexisting or external factors, which are typically used in clinical diagnosis. That, they say, is why ChatGPT fails to accurately diagnose certain medical conditions.
One example from the study was ChatGPT’s failure to link “neuropsychiatric conditions” such as autism, to frequently observed instances of vitamin deficiency (scurvy) and other conditions related to restrictive diets. ChatGPT’s diagnosis was instead a rare autoimmune condition.
Another instance was an ailment caused by a branchial cleft cyst, a lump below the collarbone or in the neck, when the correct diagnosis was in fact Branchio-oto-renal syndrome, a genetic condition resulting in the development of abnormal tissue in the neck.
Close, but no artificial cigar.
It’s plain to see that ChatGPT failed miserably in this test, but researchers still have hope for the AI model. They believe improvements could be seen if ChatGPT were trained selectively and specifically on trustworthy and accurate medical literature – a kind of MedicGPT. Currently, it is trained on information garnered online, which can often pump misinformation into its musings. After all, would you let “the internet as a whole” diagnose your illness? Researchers also believe that increased real-time access to accurate medical data could help improve AI chatbots.
The authors of the study concluded that “this presents an opportunity for researchers to investigate if specific medical data training and tuning can improve the diagnostic accuracy of LLM-based chatbots.” In the meantime, going down the old-fashioned route of talking to a medical professional seems the best option.
Previous efficacy studies showed promising results
This isn’t the first time AI-based chatbots have been studied for their efficacy in diagnosing certain medical conditions. Generative AI chatbots generally rely on Large Language Models (LLMs), trained using substantial amounts of text data, to understand and generate human-like language. This technology is rapidly advancing, and this was evident in a 2023 study which concluded that generative AI could pass the three-part United States Medical Licensing Exam. Such findings raise hopes that generative AI chatbots could be utilized as a digital assistant to physicians, as well as aiding clinical decision support systems.
Significant criticism of AI’s training limitations and its potential to amplify medical biases remains, but the American Medical Association, along with many other medical organizations, do not perceive AI’s progress as a threat in terms of replacing medical staff. There is instead an optimism surrounding well-trained AI, with many believing it has significant potential for communicative and administrative tasks within the medical industry. For instance, it could be used to explain diagnoses in simpler terms, helping us understand medical cases with greater ease.
Nevertheless, the application of AI in clinical uses, particularly in diagnostics, continues to be a contentious and challenging area of research.
This latest study may show generative AI’s shortcomings, but it is the first report of its kind, one that solely analyzes pediatric medical cases. Further research is required in various medical fields before AI is considered a trustworthy and accurate tool doctors can rely on. Currently, AI has limitations, and even the most advanced publicly accessible AI models fall short of matching the breadth of human expertise.
AI has the potential to cut administrative burdens
AI may have its issues, but with its continued advancement, medical professionals have already been testing its efficacy for news releases. For instance, Dr John Halamka, MD, MS, president of the Mayo Clinic Platform, used ChatGPT to create a news release, which was “perfect, eloquent and compelling.” Here’s the bad news – Dr Halamka said the information was “totally wrong,” so he had to edit each piece of material fact.
Nevertheless, Dr. Halamka was able to finish the task within just five minutes, a considerably shorter time than usual, which he noted is typically around an hour.
AI may not be at a level to diagnose and provide treatment plans for medical cases, but it has the potential to be used for administrative purposes. Those purposes includes generating text that humans can edit to ensure the facts are correct, and the production of CPT codes from operative reports (early reports have found generative AI can complete this with some accuracy). The result? A reduction in clerical burdens. Considering there has been a great resignation in medicine over the last few years, a technological reduction of the burden could prove to be a turning point for the industry.
AI is already being used by hundreds of companies in the fields of healthcare, pharmaceuticals, and technology. These companies use AI systems to conduct research into various avenues. For instance, AiCure in New York City utilizes “video, audio, and behavioral data to better understand the connection between patients, disease and treatment.” In Amsterdam, Netherlands, clinician-oriented Aidence uses AI systems for radiologists to help improve “diagnostics for the treatment of lung cancer.” Bot MD in Singapore builds AI chatbots to “answer clinical questions, transcribe dictated case notes and automatically organize images and files.”
In recent years, generative AI has been applied to predict emerging Covid-19 hotspots, and provide flight traveler data to help combat coronavirus. Companies such as Apple, Google, and BlueDot are combining AI, data analysis, and machine learning to build platforms that help disease control. This is achieved through the identification of diseases and notifying those exposed to a virus.
The potential of AI in global healthcare is limitless, but we are still a long way off replacing qualified medical professionals with AI bots. For now, AI is a promising tool, and one that will likely benefit modern medicine in years to come, rather than pose a hindrance.