E4910. Limitations of GPT-3.5 and GPT-4 in Applying Fleischner Society Guidelines for Incidental Lung Nodules
  1. Joel Gamble; Department of Radiology, University of British Columbia
  2. Duncan Ferguson; Department of Radiology, University of British Columbia
  3. Joanna Yuen; Department of Radiology, University of British Columbia
  4. Adnan Sheikh; Department of Radiology, University of British Columbia
This study aims to evaluate the comparative accuracy of GPT-3.5 and GPT-4 in applying Fleischner Society recommendations for incidental lung nodules and assess the performance improvements of GPT-3.5’s finetuning capabilities.

Materials and Methods:
We generated 10 lung nodule descriptions for each of the 12 nodule categories outlined in the 2017 Fleischner Society guidelines, incorporating them into a single fictitious report for an otherwise normal CT pulmonary angiogram (n = 120). GPT-3.5 and GPT-4 were prompted to act as a radiologist and make specific recommendations for the nodules, according to 2017 Fleischner Society Guidelines. We then incorporated the full guidelines into the prompts and resubmitted them to both models. Finally, we resubmitted the prompts to a finetuned GPT-3.5 model constructed using 600 nodules and accompanying correct recommendations. We compared GPT’s recommendations to the Fleischner Society recommendations using binary accuracy analysis and performed statistical analysis in R, using binomial tests for confidence intervals and McNemar’s test for comparisons.

GPT-3.5’s accuracy in applying Fleischner Society guidelines was 0.058 (95% CI: 0.02–0.12). GPT-4 accuracy was marginally though significantly improved at 0.15 (95% CI: 0.09–0.23; p = 0.02 for accuracy comparison). In recommending PET-CT and/or biopsy, both GPT-3.5 and GPT-4 had an F-score of 0.00. After explicitly including the Fleischner Society guidelines in the prompt, GPT-3.5 and GPT-4 improved their accuracy to 0.42 (95% CI: 0.33–0.51; p < 0.001 for accuracy comparison) and 0.66 (95% CI: 0.57–0.74; p < 0.001 for accuracy comparison), respectively. GPT-4 remained significantly better than GPT-3.5 (p < 0.001). The finetuned GPT-3.5 model accuracy was 0.46 (95% CI: 0.37–0.55), not significantly different than the GPT-3.5 model with guidelines included (p = 0.53).

GPT-3.5 and GPT-4 performed poorly in applying widely known consensus-based guidelines to the findings in radiological reports. Both versions never correctly recommended biopsy. The models’ false knowledge and algorithmic reasoning both contributed to their poor performance. Although GPT-4 was more accurate than GPT-3.5, its inaccuracy rate is unacceptable for clinical practice. These results underscore the limitations of large language models for knowledge and reasoning-based tasks.