E4535. Automatic Generation of Lay Summaries of Chest Radiograph Reports Using Large Language Models
  1. Yash Saboo; University of Texas Health San Antonio
  2. Michael De La Rosa; University of Texas Health San Antonio
  3. Aaron Fanous; University of Texas Health San Antonio
  4. Nathaniel Gill; University of Texas Health San Antonio
  5. Christian Lee; University of Texas Health San Antonio
  6. Eri Osta; University of Texas Health San Antonio
  7. Kal Clark; University of Texas Health San Antonio
Most patients have difficulty in understanding radiology reports, with one study reporting that only 4% of radiology reports could be understood by the average American. Current attempts to simplify radiological jargon either do not utilize artificial intelligence, only simplify the impression section of the report, or simplify reports of CT and MRI scans. We used preference testing to evaluate the ability of two different large language models (LLMs; Llama-2 and GPT 4.0) to translate artificially generated radiology reports of chest x-rays (CXR) into lay language summaries.

Materials and Methods:
We used GPT 4.0 to generate 70 unique CXR radiology reports that featured seven different classes of common findings: benign pulmonary nodule, suspicious pulmonary nodule, pleural effusion, pneumonia, pulmonary edema, cardiomegaly, and normal. All 70 reports were validated by a board-certified radiologist. Next, we generated lay summaries of the reports using default versions of the Llama-2 (13B parameters) and GPT 4.0 models. In a single-blind survey, eight native English speakers of varying medical expertise (none, medical students, and a board-certified radiologist) were presented with both the Llama-2 and GPT 4.0 lay summaries and asked to choose their preferred lay summary (either Llama-2 or GPT 4.0) or no preference. We excluded responses that noted no preference between LLM-generated lay summaries for subsequent statistical analysis. We used logistic regression to generate odds ratios (ORs) to determine the likelihood of either summary being preferred.

Out of 560 unique responses, GPT 4.0 lay summaries were preferred 50% of the time. 35% preferred Llama-2 lay summaries, and 15% had no preference. We found that subjects preferred lay summaries generated by GPT 4.0 (OR 1.42; 95% CI 1.2-1.7; <em>p</em> < 0.001) compared to Llama-2 (OR 0.70; 95% CI 0.58-0.84; <em>p</em> < 0.001).

Our study demonstrates the efficacy of GPT 4.0 and Llama-2 in generating lay summaries for CXR reports, with GPT 4.0 slightly surpassing the performance of Llama-2 for generating more usable reports. Further validation with larger datasets and prospective clinical trials are needed to fully establish the clinical utility and reliability of this approach. By generating instantaneous and accessible lay summaries, we show that GPT 4.0 has the potential to enhance patient understanding.