Gioele Barabucci, Ph.D., Victor Shia, Ph.D., Eugene Chu, M.D., Benjamin Harack, MSc., Kyle Laskowski, B.S., and Nathan Fu, B.S.
Although large language models (LLMs), such as OpenAI GPT-4 or Google PaLM 2, are proposed as viable diagnostic support tools or even spoken of as replacements for “curbside consults,” past studies show that they may lack sufficient diagnostic accuracy for real-life applications. In an effort to improve their accuracy and reduce the risk of misdiagnoses, we applied methods from the field of collective intelligence to produce synthetic differential diagnoses that aggregate answers from individual commercial LLMs (OpenAI GPT-4, Google PaLM 2, Cohere Command, and Meta Llama 2). Using 200 clinical vignettes of real-life cases from the Human Diagnosis Project platform, we assessed and compared the accuracy of differential diagnoses from individual LLMs with those from aggregated LLM responses. We aggregated the LLM responses into synthetic differential diagnoses using a simple frequency-based, 1/r-weighted method, in which more weight is given to diagnoses appearing near the top of the LLM responses and appearing in the responses of multiple LLMs. We evaluated all possible combinations of LLMs by calculating various TOP-n accuracy metrics: that is, how frequently the correct diagnosis matches any of the first n diagnoses. We found that aggregating responses from multiple LLMs leads to more accurate differential diagnoses (average TOP-5 accuracy for three LLMs: 75.3%±1.6 percentage points) compared with the differential diagnoses produced by single LLMs (average TOP-5 accuracy for single LLMs: 59.0%±6.1 percentage points). We also found that aggregating smaller and less capable models (TOP-5 accuracy for three smaller LLMs, not including GPT-4: 70.0%) can rival the accuracy of the top-performing model (TOP-5 accuracy for GPT-4: 72.0%). The use of collective intelligence methods to synthesize differential diagnoses, combining the responses of different LLMs, achieves three of the necessary steps toward advancing LLMs as a diagnostic support tool: demonstrating sufficiently high diagnostic accuracy, reducing the risk of misdiagnoses, and eliminating the dependence on a single commercial vendor. (Funded by the European Union’s Horizon Europe research and innovation program.)