E5400. Towards Equitable AI: An Enhanced DBT ML Algorithm for Breast Cancer Diagnosis and Triage Across Diverse Racial Demographics
  1. Dogan Polat; UT Southwestern
  2. Alheli Garza; UT Southwestern
  3. Chase Waggener; Medcognetics
  4. Tim Cogan; Medcognetics
  5. Paula Gupta; Medcognetics
  6. Lakshman Tamil; Medcognetics
  7. Basak Dogan; UT Southwestern
To validate the performance of a commercially available mammography machine learning (ML) algorithm on tomosynthesis (DBT) obtained on a geographically diverse multiinstitutional dataset.

Materials and Methods:
A retrospective blind study using a commercially available mammography machine learning (ML) algorithm was performed. Bilateral cases consisting of both craniocaudal (CC) and mediolateral oblique (MLO) 3D views were submitted to the ML algorithm. The algorithm generated a global prediction score of suspicion of malignancy for each case based on a predefined threshold of sensitivity 0.95, 0.98, and 1, aiming to have performance better than radiologists. The test data was collected from multiple radiology clinics located across the United States and Germany. Test data were constructed to make sure that test cohort is representative of population regarding lesion type, lesion size, breast density, age, and race. All cases are assigned a label of benign based on a negative diagnosis (BI-RADS 1 or 2 assessment) throughout 2 years of follow-up or a label of malignant based on a positive biopsy result. Performance of the algorithm is demonstrated as AUC of the ROC. ROC curves are compared in pairwise manner using pROC library in R.

The test dataset had a total of 808 cases, with most cases falling between 50–60 years of age (223, 27.6%) and 60–70 years of age (214, 26.5%). Although lesion type information either was not available or negative for 424 (52.5%) patients, 345 (42.7%) presented as soft tissue abnormality and 39 (4.8%) as calcifications. Race was stratified into eight categories: White (n = 451, 55.8%), Hispanic (n = 191, 23.6%), other (n = 62, 7.7%), Black (n = 42, 5.2%), Asian (n = 21, 2.6%), unknown (n = 11, 1.4%), and American Indian (10, 1.2%). The algorithm achieved an AUC of 0.96 (CI: 0.95–0.97), 0.88 sensitivity, 0.89 specificity, 0.88 NPV, and 0.89 PPV. At 90%, 95%, and 100% sensitivity, 76%, 57%, and 22% of DBT would be triaged, respectively. Performance was comparable across racial groups with the following AUC, sensitivity, specificity, NPV, and PPV, respectively: White (0.97, 0.90, 0.90, 0.90, 0.90), Black (0.96, 0.94, 0.96, 0.96, 0.94), Asian (0.90, 0.94, 0.75, 0.75, 0.94), Hispanic (0.93, 0.81, 0.86, 0.80, 0.86), other (0.97, 0.90, 0.88, 0.95, 0.79), and unknown (0.93, 0.60, 0.82, 0.69, 0.75). All racial groups were compared to global AUC and found no significant difference (p > 0.99 for each).

Our algorithm’s sensitivity and specificity outperform the standard of care as reported in the Breast Cancer Surveillance Consort. The algorithm was adjusted and tested on predefined sensitivities and demonstrated that it can confidently triage significant number of studies as negative, among different races, which can greatly save cost, time, and efforts.