Natural Language-based Machine Learning Models for the Annotation of Clinical Radiology Reports.
Radiology. 2018 May;287(2):570-580. doi: 10.1148/radiol.2018171093. Epub 2018 Jan 30. Zech J1, Pain M1, Titano J1, Badgeley M1, Schefflein J1, Su A1, Costa A1, Bederson J1, Lehar J1, Oermann EK1.
Purpose: To compare different methods for generating features from radiology reports and to develop a method to automatically identify findings in these reports.
Materials and Methods: In this study, 96 303 head computed tomography (CT) reports were obtained. The linguistic complexity of these reports was compared with that of alternative corpora. Head CT reports were preprocessed, and machine-analyzable features were constructed by using bag-of-words (BOW), word embedding, and Latent Dirichlet allocation-based approaches. Ultimately, 1004 head CT reports were manually labeled for findings of interest by physicians, and a subset of these were deemed critical findings. Lasso logistic regression was used to train models for physician-assigned labels on 602 of 1004 head CT reports (60%) using the constructed features, and the performance of these models was validated on a held-out 402 of 1004 reports (40%). Models were scored by area under the receiver operating characteristic curve (AUC), and aggregate AUC statistics were reported for (a) all labels, (b) critical labels, and (c) the presence of any critical finding in a report. Sensitivity, specificity, accuracy, and F1 score were reported for the best performing model's (a) predictions of all labels and (b) identification of reports containing critical findings.
Results: The best-performing model (BOW with unigrams, bigrams, and trigrams plus average word embeddings vector) had a held-out AUC of 0.966 for identifying the presence of any critical head CT finding and an average 0.957 AUC across all head CT findings. Sensitivity and specificity for identifying the presence of any critical finding were 92.59% (175 of 189) and 89.67% (191 of 213), respectively. Average sensitivity and specificity across all findings were 90.25% (1898 of 2103) and 91.72% (18 351 of 20 007), respectively. Simpler BOW methods achieved results competitive with those of more sophisticated approaches, with an average AUC for presence of any critical finding of 0.951 for unigram BOW versus 0.966 for the best-performing model. The Yule I of the head CT corpus was 34, markedly lower than that of the Reuters corpus (at 103) or I2B2 discharge summaries (at 271), indicating lower linguistic complexity.
Conclusion: Automated methods can be used to identify findings in radiology reports. The success of this approach benefits from the standardized language of these reports. With this method, a large labeled corpus can be generated for applications such as deep learning. © RSNA, 2018 Online supplemental material is available for this article.