Large Language Models Seem Miraculous, but Science Abhors Miracles

Peter Szolovits, Ph.D.
Abstract
Generative artificial intelligence models exhibit amazing abilities but make serious errors. We have a very limited understanding of why they work well at all or of the circumstances under which they give incorrect responses. This suggests the need for additional research and great caution in deploying such models for critical applications. Since the availability of ChatGPT in late 2022, based on OpenAI’s GPT 3.5 large language model, those of us who have explored its capabilities have been amazed by its facility with language and its abilities to generate coherent — and even insightful — synopses; answer questions about everything from general knowledge to domain-specific topics; offer advice on how to accomplish tasks, including for medical diagnosis, therapy, and prognosis; deduce consequences of assumptions; and even write effective computer programs. Nevertheless, I would urge great caution in adopting such methods in health care, mainly because of our lack of understanding of how they accomplish the miraculous-seeming things they are able to do.

Large Language Models Seem Miraculous, but Science Abhors Miracles