E4782. Testing the Ability of ChatGPT to Generate Differential Diagnoses from Transcribed Radiological Findings in Chest and Cardiac Imaging
  1. Shawn Sun; University of California Irvine Medical Center
  2. Kenneth Huynh; University of California Irvine Medical Center
  3. Gillean Cortes; University of California Irvine Medical Center
  4. Robert Hill; University of California Irvine Medical Center
  5. Julia Tran; University of California Irvine Medical Center
  6. Vahid Yaghmai; University of California Irvine Medical Center
  7. Mark Tran; University of California Irvine Medical Center
The burgeoning interest in ChatGPT as a potentially impactful tool in medicine highlights the necessity for systematic evaluation of its capabilities and limitations. Given its ability to receive a large amount of text data and synthesize a seemingly reasonable conclusion, LLMs could potentially be used as a diagnostic tool. This paper aims to quantify ChatGPT's proficiency in formulating differential diagnoses from radiological findings in chest and cardiac radiology and characterize the extent of its limitations. The contents of this study are one step in the comprehensive evaluation of this artificial intelligence (AI) technology to guide its effective and safe application in the medical field.

Materials and Methods:
Fifty-two adult and pediatric chest and cardiac imaging cases were selected from a radiology textbook and converted into standardized prompts describing the cases and a query for the most likely diagnosis, top three differential diagnoses, and explanations and references from the medical literature. Generated responses from the ChatGPT3.5 and ChatGPT4 algorithms were analyzed for accuracy by comparison with the original literature and reliability through manual verification of the information provided. The top 1 accuracy was defined as the percentage of responses that matched the diagnosis provided by the original literature. An additional differential diagnosis score was defined as the proportion of differentials that matched the original literature’s answers for each case. Comparisons were made between the results of the two algorithms using a one-tailed two proportion z-test. Test-retest reliability was measured for 10 repeats of 10 questions, using the average pairwise percent agreement and Krippendorff’s alpha scores, with a code assigned to each unique answer.

The top 1 accuracy for ChatGPT3.5 versus ChatGPT4 was 57.7% compared to 69.2% (p = 0.11). The average differential diagnosis score of ChatGPT3.5 versus ChatGPT4 was 48.1% compared to 55.8% (p = 0.21). Hallucinated references were present in 34.2% versus 9.6% (p = 0.001) in ChatGPT3.5 and ChatGPT4’s responses. A total of 10 versus four false statements were generated by ChatGPT3.5 and ChatGPT4, respectively. The pairwise percent agreement and Krippendorff’s alpha scores for the ChatGPT3.5 top1, top 3 and ChatGPT4 top 1, top 3 responses were 73.8% and 0.72, 39.3% and 0.37, 90.9% and 0.9, and 41.3% and 0.39, respectively.

ChatGPT’s ability to generate differential diagnoses from prompts containing descriptive radiological findings in a cohesive manner is groundbreaking, though the accuracy and reliability still leave much to be desired. The presence of hallucinated statements and the low test-retest statistical scores for the full differential suggest that further development is necessary. Still, ChatGPT and large language models (LLM) have the potential to greatly impact clinical and educational medicine. Knowledge of the accuracy, reliability, and erroneous possibilities of these algorithms is critical to understand the limitations of these new tools.