E1267. Optimizing Convolutional Neural Network Output Styles for Efficient Chest Radiograph Dictation
  1. Carl Sabottke; University of Arizona Department of Medical Imaging
  2. Jason Lee; University of Arizona Department of Medical Imaging
  3. Raza Mushtaq; University of Arizona Department of Medical Imaging
  4. Bradley Spieler; Louisiana State University Health Sciences Center New Orleans Department of Radiology
Optimizing radiology artificial intelligence (AI) integration into radiologist workflows is an ongoing process. Inherent in this process is optimization of not just AI model performance in terms of accuracy or area under the receiver operator characteristic (AUROC), but also how model output design can impact incorporation of a convolutional neural network (CNN) prediction into a radiology report. Radiologist reports have generally accepted best practices for style, but mapping CNN predictions to report statements that can be automatically inserted into a dictation software template is not necessarily a trivial endeavor. For example, the public CheXpert dataset contains a generic “Support Devices” label for chest radiographs, but this is difficult to map to clinically useful radiologist report phrases without adding further model label granularity. Consequently, the exercise of mapping neural network outputs to specific report finding and impression statements can help highlight clinical shortcomings of AI models during the development process.

Materials and Methods:
Using the MIMIC-CXR chest radiograph dataset which contains radiologist report text for 227,835 studies and 65,379 patients, we generated labels pertaining to 14 categories of monitoring and support devices or post-procedural changes which can be readily identified on chest radiographs. These labels were used to train multi-label CNNs using the DenseNet architecture. Label categories include: left ventricular assist devices, intra-aortic balloon pumps, cardiac rhythm devices, prosthetic cardiac valves, neurostimulators, chest ports, dialysis catheters, chest tubes, endotracheal tubes, tracheostomy tubes, endogastric tubes, vertebroplasty changes, spinal fixator hardware, and sternotomy wires. The dataset was partitioned such that 98,804 images were used as training data, 24,701 images were used for the validation dataset, and 150,029 images were reserved as a discretionary internal test dataset. Model performance on subsets of the internal test dataset was reviewed for appropriateness of various report phrase labels depending on model label prediction and prediction confidence.

Across all 14 categories, average validation dataset accuracy was 97% with minimum accuracy of 89.6% for endogastric tubes and maximum accuracy of 99.8% for neurostimulator devices. Average AUROC was 0.958 with minimum AUROC of 0.882 for vertebroplasty changes and maximum AUROC of 0.993 for cardiac rhythm devices. However, with consideration of report phrase mapping, redesign of the CNN with increased label granularity was required for at least 4 of the considered support devices in order to accommodate laterality and positioning (e.g. CNN regression for prediction of endotracheal tube tip position relative to the carina rather than prediction of endotracheal tube presence).

Mapping neural network outputs to specific radiologist report language can highlight important considerations in modeling design. It can also provide insight into how best to develop neural network models to increase radiologist efficiency.