E4740. Test-Retest Repeatability and Accuracy of ChatGPT Diagnosis of Transcribed Radiological Findings in Abdominal Radiology
  1. Shawn Sun; University of California Irvine Medical Center
  2. Kenneth Huynh; University of California Irvine Medical Center
  3. Gillean Cortes; University of California Irvine Medical Center
  4. Robert Hill; University of California Irvine Medical Center
  5. Julia Tran; University of California Irvine Medical Center
  6. Vahid Yaghmai; University of California Irvine Medical Center
  7. Mark Tran; University of California Irvine Medical Center
The burgeoning interest in ChatGPT as a potentially impactful tool in medicine highlights the necessity for systematic evaluation of its capabilities and limitations. Given its ability to receive a large amount of text data and synthesize a seemingly reasonable conclusion, LLMs could potentially be used as a diagnostic tool. This paper aims to quantify ChatGPT's proficiency in formulating differential diagnoses from radiological findings in abdominal radiology and characterize the extent of its limitations. The contents of this study are one step in the comprehensive evaluation of this AI technology to guide its effective and safe application in the medical field.

Materials and Methods:
Seventy gastrointestinal and genitourinary imaging cases were selected from a radiology textbook and converted into standardized prompts describing the cases and a query for the most likely diagnosis, top three differential diagnoses, and explanations and references from the medical literature. Generated responses from the ChatGPT3.5 and ChatGPT4 algorithms were analyzed for accuracy by comparison with the original literature and reliability through manual verification of the information provided. The top 1 accuracy was defined as the percentage of responses that matched the diagnosis provided by the original literature. An additional differential diagnosis score was defined as the proportion of differentials that matched the original literature’s answers for each case. Comparisons were made between the results of the two algorithms using a one-tailed two proportion z-test. Test-retest repeatability was measured for 10 repeats of 10 cases using the average pairwise percent agreement and Krippendorff’s alpha scores, with a code assigned to each unique answer.

The top 1 accuracy for ChatGPT3.5 versus ChatGPT4 was 35.7% compared to 51.4% (p = 0.031). The average differential diagnosis score of ChatGPT3.5 versus ChatGPT4 was 42.4% compared to 44.7% (p = 0.39). Hallucinated references were present 38.2% versus 18.8% (p = 0.001) in ChatGPT3.5 and ChatGPT4’s responses. A total of 23 versus 4 false statements were generated by ChatGPT3.5 and ChatGPT4, respectively. The pairwise percent agreement and Krippendorff’s alpha scores for ChatGPT3.5 top 1, top 3, and ChatGPT4 top 1, top 3 responses were 68.9% and 0.67, 37.6% and 0.35, 80.2% and 0.79, and 31.1 and 0.29, respectively.

ChatGPT’s ability to generate differential diagnoses from prompts containing descriptive radiological findings in a cohesive manner is groundbreaking, though the accuracy and reliability still leave much to be desired. The presence of hallucinated statements and the low test-retest statistical scores suggest that further development is necessary. Still, ChatGPT and large language models (LLM) have the potential to greatly impact clinical and educational medicine. Knowledge of the accuracy, reliability, and erroneous possibilities of these algorithms is critical to understand the limitations of these new tools.