4921. Confidence Calibration: A Novel Method for Assessing Radiologist Performance
  1. Behrang Amini *; MD Anderson Cancer Center
Radiologists routinely make probabilistic forecasts, but rarely analyze their validity. We developed a probabilistic lexicon for reporting local recurrence (LR) of soft tissue sarcomas on MRI following definitive resection to assess radiologist performance.

Materials and Methods:
A lexicon for chance of LR was developed by nine radiologists: low (0-15%), intermediate (15-85%), and high (> 85%) probability and implemented as a PowerScribe macro. After 16 months, radiology reports were reviewed to determine if the study was done to assess for LR and whether the lexicon was used. When the lexicon was not used, chance of LR was inferred from the report. Two musculoskeletal fellows assessed for LR based on review of follow-up reports, images, and biopsy. Questionable cases were discussed with the attending radiologist. Negative cases: no recurrence on follow up or on biopsy. Positive cases: enlarging lesions or positive biopsy. Calibration analysis was used to assess the correlation of predicted probabilities of LR with observed occurrences of LR. Thirteen calibration and discrimination metrics were calculated for the whole group and for each radiologist. Bonferroni correction was applied.

Lexicon use improved over time but remained suboptimal (66%). The LR rate was 7%. Specificity for detection of LR was excellent (0.98, 95% CI: 0.97 0.99), but sensitivity was modest (0.67, 95% CI: 0.55, 0.77). Overall, the group of radiologists over-estimated (over-called) LR (calibration-in-the-large: 0.088, 95% CI: 0.076, 0.099) and was globally under-confident (confidence score: -0.066, 95% CI: - 0.077, 0.055). Two radiologists performed better than their peers, while two performed worse. There was no significant difference in any metric with respect to sarcoma type (myxoid vs. non), reader volume (at/above median vs. below), or presence of advanced imaging.

Confidence calibration can assess performance of radiologists in groups or individually. The techniques can be generalized to other practice settings to assess and improve quality.