Samuel J. Aronson, A.L.M., M.A., Kalotina Machini, Ph.D., Jiyeon Shin, A.L.M., B.S., Pranav Sriraman, B.S., Sean Hamill, B.Sc., Emma R. Henricks, M.S., C.G.C., Charlotte J. Mailly, B.A., Angie J. Nottage, B.S., Sami S. Amr, Ph.D., Michael Oates, and Matthew S. Lebo, Ph.D.
Large language models (LLMs) hold promise for improving literature review of variants in clinical genetic testing. We analyzed the performance, nondeterminism, and drift of Generative Pretrained Transformer 4 (GPT-4) series models to assess their present suitability for use in complex clinical laboratory processes. We optimized a chained, two-prompt GPT-4 sequence for automated classification of functional genetic evidence in literature using a training set of 45 article–variant pairs. The initial prompt asked GPT-4 to supply all functional evidence present in a given article for the variant of interest or indicate the absence of functional evidence. For articles in which GPT-4 found functional evidence, a second prompt asked GPT-4 to classify the evidence into pathogenic, benign, or intermediate/inconclusive categories. A final test set of 72 manually classified article–variant pairs was used to test model performance over time. During a 2.5-month period from December 2023 to February 2024, we observed substantial variability in the results both within runs made in rapid succession (nondeterminism) and across days (drift). The inconsistencies lessened after January 18, 2024. Variability of results existed within and across models in the GPT-4 series, affecting various performance statistics — for example, sensitivity, positive predictive value (PPV), and negative predictive value (NPV) — to different degrees. For the 20 runs starting on January 22, 2024, our initial prompt’s identification of articles containing functional evidence had 92.2% sensitivity, 95.6% PPV, and 86.3% NPV. Our second prompt’s pathogenic evidence detection had 90.0% sensitivity, 74.0% PPV, and 95.3% NPV, and its benign evidence detection had 88.0% sensitivity, 76.6% PPV, and 96.9% NPV. These data support the conclusion that nondeterminism and drift should be assessed and monitored for the specific metrics used to decide whether to introduce LLM functionality into clinical workflows. Failing to account for these challenges could lead to unidentified or incorrectly classified information that might be critical for patient care. The performance of our prompts appears adequate to assist in prioritization of articles but not in automated decision-making. Multiple avenues for further enhancing this performance could be explored.