ARRS 2022 Abstracts


ERS3355. Is AI Fair? Evaluation of Bias in a State-of-the-Art Bone Age AI Model
Authors * Denotes Presenting Author
  1. Elham Beheshtian *; University of Maryland Medical Center
  2. Samantha Santomartino; University of Maryland Medical Center
  3. Kristin Putman ; University of Maryland Medical Center
  4. Vishwa Parekh ; Johns Hopkins Medicine
  5. Paul Yi; University of Maryland Medical Center
Artificial intelligence (AI) and deep learning using convolutional neural networks (CNNs) have shown great potential to perform specific radiologic tasks such as predicting pediatric bone age. However, AI models in radiology have demonstrated bias against certain demographic groups that may not have been represented in their original training data. The purpose of this study was to evaluate for the presence of bias in a state-of-the-art bone age AI model when tested against a diverse external test dataset.

Materials and Methods:
We evaluated a publicly available CNN-based bone age algorithm that won the 2017 RSNA Pediatric Bone Age Challenge (concordance of 0.991 with radiologist ground-truth). The algorithm was originally trained on 12,612 pediatric hand radiographs from two US hospitals which did not report demographic representation in the datasets. We used the algorithm to evaluate an external test dataset of 1201 pediatric hand radiographs from a different US hospital with ground-truth defined as average of two radiologists’ predictions. This dataset was curated to be diverse with intentional demographic representation: 607 male (50.5%) and 594 female (49.5%), with 303 Black (25.2%), 286 White (23.8%), 317 Hispanic (26.4%), and 295 Asian (24.6%) patients. Mean absolute difference (MAD) between radiologist ground-truth and the AI’s prediction was calculated for each radiograph. Bias was evaluated for by comparing MAD between sexes, races, and Tanner stage (sexual maturity) using t-tests and ANOVA, as appropriate.

Overall MAD was small (average, 7.5 months), indicating good generalizability of the bone age algorithm. There was no bias detected for the algorithm when comparing performance between males/females and between races (no statistically significant difference). However, there were statistically significant differences between Tanner stages (p<0.0001) with the smallest MAD for stage 4 (4.3 months) and the largest for stages 5 (9.1 months) and 1 (7.4 months).

A state-of-the-art bone age AI algorithm demonstrated good external validity on a diverse dataset and did not demonstrate bias based on sex or race. However, there was bias against the extremes of the Tanner stage groups, which raises concerns about fairness for these groups if this algorithm is deployed clinically. We recommend caution when applying this algorithm to children at either end of the Tanner sexual maturity spectrum, as well as further study to better evaluate for the presence and extent of bias in AI algorithms in radiology.