E5337. GPT as a Tool for Grading Blunt Cerebrovascular Injury: A Feasibility Study
Authors
Kyle Costenbader;
University of Maryland Medical Center
Elham Beheshtian;
University of Maryland Medical Center
Elana Smith;
University of Maryland Medical Center
Thomas Ptak;
University of Maryland Medical Center
Clint Sliker;
University of Maryland Medical Center
Jean Jeudy;
University of Maryland Medical Center
Objective:
Numerous classification systems in radiology help standardize reporting and provide prognostic information for injury severity. Descriptions provided by radiologists could be quickly categorized according to these scales with large language models like the generative pretrained transformer (GPT) thereby improving productivity, reducing cognitive burden, and enhancing communication with clinicians, who rely on the injury grade to determine treatment. The purpose of this study is to evaluate the capability of GPT to grade blunt cerebrovascular injury (BCVI) descriptions according to the Biffl classification system.
Materials and Methods:
Fifty synthetic BCVI descriptions were generated by providing GPT v3.5 with a few-shot prompt to assess baseline GPT’s ability to process the Biffl classification scheme. These reports were regraded by GPT and independently graded by two board-certified trauma radiologists. Additionally, radiologic reports describing BCVI were retrospectively collected from our level I trauma center by searching the reporting database for reports between June 2018 and July 2023 where BCVI was mentioned in the impression. Vascular findings were isolated and protected health information removed. Reports were divided into single vessel and multivessel descriptions and inputted into GPT with a few-shot prompt with grading instructions including the reference Biffl scale. Reports were independently graded by the same two trauma radiologists and discrepancies were adjudicated by a third. Concordance between GPT and trauma radiologists was assessed as the percentage agreement. Cohen’s Kappa score was calculated to evaluate the level of agreement between GPT and the consensus radiologist grading. The recall and precision rate of GPT compared to the radiologists were evaluated by F1 score.
Results:
Grading concordance between GPT and trauma radiologists was 96% for the synthetic data set (Cohen's kappa 0.97). We retrieved 148 reports, including 135 single vessel and 13 multivessel injuries. Grading concordance for single and multivessel injury reports measured 93%, and 85% respectively (Cohen’s kappa 0.97 and 0.94 and weighted average F1 scores 0.98 and 0.93).
Conclusion:
GPT generated a high-quality synthetic dataset of BCVI descriptions and demonstrated 96% concordance with radiologist grading of BCVI. Performance with real data sets varied from 93% for single vessel injuries to 85% for multivessel injuries. GPT demonstrates promising capabilities to assist radiologists with classification schemes, which could improve productivity by saving time and reducing cognitive burden.