E4587. Chat-tastic Diagnoses: Assessing Generative AI in Producing Accurate Differential Diagnoses From Musculoskeletal Imaging Findings
  1. Robert Hill; University of California Irvine
  2. Shawn Sun; University of California Irvine
  3. Kenneth Huynh; University of California Irvine
  4. Gillean Cortes; University of California Irvine
  5. Roozbeh Houshyar; University of California Irvine
  6. Vahid Yaghmai; University of California Irvine
  7. Mark Tran; University of California Irvine
To assess the accuracy and repeatability of ChatGPT in generating a differential diagnosis from transcribed imaging findings of specific musculoskeletal radiology cases.

Materials and Methods:
A sample of 50 musculoskeletal cases were selected from a radiology textbook, from which the answers were used as the gold standard. The history and imaging findings were converted into standardized prompts and a query for the most likely diagnosis, top three differential diagnoses, and the corresponding explanations and references from the medical literature. These prompts were fed into the ChatGPT3.5 and ChatGPT4 algorithms. Generated responses were analyzed for accuracy by comparison with the original literature and reliability through manual verification of the generated explanations and citations. The top 1 accuracy and top 3 accuracy were defined as the percentage of generated responses that matched the actual diagnosis and the differential for each case provided by the original literature. Comparisons were made between the results of the two algorithms using a one-tailed two proportion z-test method. Test-retest reliability was measured for 10 repeats of 10 cases using the average pairwise percent agreement and Krippendorff’s alpha scores, with a code assigned to each unique answer.

The top 1 accuracy and top 3 accuracy, for ChatGPT3.5 versus ChatGPT4 were 48.0% compared to 70.0% (p = 0.013) and 8.0% compared to 12.0% (p = 0.251), respectively. ChatGPT3.5 and ChatGPT4 hallucinated 44.3% versus 21.3% (p = 0.00001) of the references provided and generated 9 total false statements versus 1 total false statement, respectively. The pairwise percent agreement and Krippendorff’s alpha scores for the ChatGPT3.5 top 1 responses were 77.8% and 0.761 and top 3 responses were 31.8% and 0.30.

Generative AI such as ChatGPT's ability to produce differential diagnoses from radiological findings can immensely improve educational and clinical medicine. Although the accuracy and reliability of generated responses are currently lacking, the statistically significant improvement from one generation to the next shows exciting potential for the future use of these tools. In addition, the presence of AI hallucinations and low test-retest statistical scores highlights further areas that need to be developed prior to implementation in the clinical and educational setting. Knowledge of the accuracy and erroneous possibilities of these algorithms will provide a better understanding of their limitations.