Dr. Scott Gottlieb is a physician and served as the 23rd Commissioner of the U.S. Food and Drug Administration. He is a CNBC contributor and a board member of Pfizer and several other health and technology startups. He is also a partner at venture capital firm New Enterprise Associates. Shani Benezra is a senior research fellow at the American Enterprise Institute and a former associate producer of CBS News’ “Face the Nation.”
Many consumers and healthcare providers are turning to chatbots, powered by massive language models, to answer medical questions and inform treatment options. We decided to see if there were significant differences between the leading platforms when it came to their clinical effectiveness.
To obtain a medical license in the United States, aspiring doctors must successfully pass three stages of the United States Medical Licensing Examination, with the third and final stage widely viewed as the most challenging. This stage requires candidates to answer about 60% of the questions correctly, and in the past, the average passing score has hovered around 75%.
When we subjected the main large language models to the same step 3 test, they performed remarkably well, achieving results that significantly outperformed many doctors.
But there were some clear differences between the models.
The USMLE Step 3 exam is typically administered after the first year of training and measures whether medical graduates are able to apply their understanding of clinical science to unsupervised medical practice. It assesses a new physician's ability to manage patient care across a wide range of medical specialties and includes multiple-choice questions and computer-based case simulations.
We isolated 50 questions from the 2023 USMLE Step 3 sample test to assess the clinical proficiency of five different leading large language models, by presenting the same set of questions to each of these platforms – ChatGPT, Claude, Google Gemini, puppy and llama.
Other studies have measured the effectiveness of these models in the medical field, but to our knowledge, this is the first time these five leading platforms have been compared in a direct evaluation. These results may give consumers and providers some insight into where they should go next.
This is how they scored their goals:
ChatGPT-4o (OpenAI) — 49/50 correct questions (98%)Claude 3.5 (Anthropic) — 45/50 (90%)Gemini Advanced (Google) — 43/50 (86%)Grok (xAI) — 42/50 (84%)HuggingChat (Llama) — 33/50 (66%)
In our experiment, OpenAI’s ChatGPT-4o emerged as the top performer, achieving a score of 98%. It provided detailed medical analyses, using language reminiscent of medical professionals. It not only provided answers with extended reasoning, but also put the decision-making process in context, explaining why alternative answers were less appropriate.
Claude, from Anthropic, came in second with 90%. It provided more human-like answers with simpler language and a point structure that might be easier for patients to relate to. Gemini, with 86%, provided answers that weren’t as comprehensive as ChatGPT or Claude, making the reasoning more difficult to understand, but its answers were concise and to the point.
Grok, the chatbot from Elon Musk’s XAI, scored a respectable 84% but offered no descriptive reasoning during our analysis, making it difficult to understand how it arrived at its answers. While HuggingChat, an open-source website built from Meta Llama – He scored the lowest at 66%, however he showed good reasoning for the questions he answered correctly, and provided concise answers and links to resources.
One question that the models got wrong most of the time involved a 75-year-old woman with a hypothetical heart condition. The question asked doctors what the most appropriate next step would be as part of her evaluation. Claude was the only model to get the answer right.
Another noteworthy question focused on a 20-year-old patient with symptoms of a sexually transmitted infection. The doctors asked which of five options was the appropriate next step as part of their workup. ChatGPT correctly determined that the patient should be scheduled for an HIV serology test in three months, but the model went further, recommending a follow-up test in one week to ensure that the patient’s symptoms had resolved and that antibiotics covered the strain of infection they had contracted. For us, the answer highlighted the model’s ability to think broader, expanding beyond the binary choices offered by the test.
These models were not designed for medical reasoning; they are products of the consumer technology sector, designed to perform tasks like language translation and content generation. Despite their non-medical origins, they have shown impressive ability for clinical reasoning.
New platforms are being created specifically to solve medical problems. Google recently introduced Med-Gemini, an enhanced version of its previous Gemini models that is fine-tuned for medical applications and equipped with web search capabilities to enhance clinical reasoning.
As these models evolve, they will hone their skills at analyzing complex medical data, diagnosing conditions, and recommending treatments. They could provide a level of accuracy and consistency that human providers, constrained by fatigue and error, sometimes struggle to match, and pave the way for a future where the gates of treatment are operated by machines, not doctors.