Methods: We develop a tool for evaluating empathy called the Chatbot Compassion Quotient, or CCQ. We created a set of nine prompts, assessing compassion in various capacities, including delivering difficult news and alleviating frustration, based on the psychology literature. We compare ChatGPT and Claude-generated responses with responses from healthcare professionals. Participants also guessed which of the responses was AI-generated versus human-generated. In this corollary to the Turing test, the central question "can machines think?" became "can machines demonstrate compassion?" Thirty participants rated 3 responses to 9 scenarios on a 5-point Likert scale of 1 (not at all compassionate) to 5 (very compassionate). Responses corresponded to either ChatGPT, human, or Claude-generated results and were labeled A, B, and C in random order. After rating on the compassion scale, participants were asked to identify which, between two options, was AI-generated.
Results: Results indicated that participants considered responses from ChatGPT (aggregate score: 4.1 out of 5) and Claude (aggregate score: 4.1 out of 5) more empathetic than human (aggregate score: 2.6 out of 5) responses, with length being a potential factor impacting evaluations. Longer responses were typically rated as more compassionate. The scores for ChatGPT and Claude were comparable. Responses that appeared most obviously AI-generated performed well compared to human responses. High-scoring responses were action-oriented with multiple forms of social support.
Conclusion: The study highlights the promise of human-machine synergy in healthcare. AI may alleviate fatigue and burnout in the medical field, contributing thorough responses that offer insight into patient-centered care. Further research can build on these preliminary findings to evaluate and improve expressions of empathy in AI.