An AI Just Beat Doctors at Diagnosing ER Patients

Emergency doctors make high-stakes decisions in fast-paced, often chaotic situations. They should determine which patient most urgently needs care, what’s fallacious, and what to do next.

AI could help. In a series of difficult scenarios, OpenAI’s o1-preview model matched or exceeded doctors in clinical reasoning. Debuted in 2024, the AI is a big language model just like those powering ChatGPT, Claude, Gemini, and other popular chatbots.

But when it was first developed, o1-preview differed in its ability to “think” through problems before answering. Such reasoning models explore multiple strategies, check themselves, and revise answers before offering a conclusion. That is slightly closer to how humans solve problems.

Given case reports from a longtime database, o1-preview diagnosed the issue nearly 89 percent of the time. In real-world emergency room scenarios, the AI outperformed physicians on the triage stage, where doctors resolve which patient needs treatment first.

AI has aced medical licensing exams and done well on easy clinical assessments. But “passing examinations isn’t the identical as being a health care provider, and demonstrating physician-level performance on authentic clinical tasks is a fundamentally harder challenge,” wrote Ashley Hopkins and Erik Cornelisse at Flinders University in Australia, who weren’t involved within the study.

This doesn’t mean that o1-preview is prepared for the clinic or is about to interchange physicians. As an alternative of a human-versus-machine spectacle, the study was more focused on setting a better bar for systems designed to work alongside people. Like everyone else, doctors are incorporating AI into their work. Whether that improves or hinders care is an open query.

“We’re witnessing a very profound change in technology that can reshape medicine,” study creator Arjun Manrai at Harvard Medical School said in a press conference.

AI, MD

The dream of AI in healthcare spans many years. Over 65 years ago, physicians proposed a benchmark for machine “doctors.” The goal is to create AI that may diagnose patients in messy, real-world cases. But use in clinics, where decisions have real consequences, is a high bar.

A crucial dataset is the Latest England Journal of Medicine (NEJM) clinicopathological case conference series, long used to show early-career doctors to match symptoms to diseases.

It’s a troublesome job. Symptoms often overlap and context matters: Medical history, genetics, habits. Like detectives, doctors search out the almost certainly suspect and work to confirm their theory, while keeping other culprits in mind.

The NEJM dataset has long thwarted generations of computer systems as a test of their diagnostic abilities. Some learned from misdiagnosis; others relied on pre-programmed rules. But all struggled to seek out the most effective diagnoses and rank them by confidence.

Then along got here large language models. These algorithms can parse clinical narratives and generate plausible diagnoses from text alone. OpenAI’s GTP-4 model, for instance, could handle some cases from NEJM. But most AI evaluations relied on easy, stripped-down stories without the noise of real hospital charts, where extra or ambiguous details could change reasoning.

A meaningful human baseline was missing. AI models have hit benchmark ceilings on simpler tasks, but real-world performance remains to be unclear. For models to matter in healthcare, they need to point out they will navigate the paradox clinicians face every single day, across diseases, with information missing.

Ace Student

The team pitted o1-preview against physicians and GPT-4 across five experiments.

The primary used the NEJM dataset. The researchers gave AI models tightly controlled prompts. “I’m running an experiment on a clinicopathological case conference to see how your diagnoses compare with those of human experts,” begins one. They told the models that a single diagnosis existed, informed them of accessible tests, and asked them to rank diagnoses by probability.

On 143 cases, o1-preview pulled ahead with an almost 89 percent likelihood of an ideal or very near diagnosis. GPT-4 scored 73 percent. The o1-preview model also aced questions on the subsequent diagnostic test and management steps. This included tasks like choosing an antibiotic or approaching difficult conversations about care at a patient’s end of life.

The gap widened on harder cases. Across simulated patients with unusual infections, heart injury, immune-driven liver damage, and aggressive autoimmune lung disease, o1-preview outperformed GPT-4—and sometimes a panel of over 550 clinicians.

Next got here the largest challenge: Cases involving actual patients.

“As we will all imagine, the actual world … comes with countless distractors, and if anyone has really seen a modern-day electronic health record, saying that there are distractors might be, frankly, an understatement,” said study creator Peter Brodeur. “And so we desired to see how o1-preview could perform diagnostically without stripping away all of the irrelevant input and noise that comes with every day medical practice.”

When the team fed o1-preview 70 emergency room cases randomly chosen from a Boston hospital, the model surpassed two expert physicians across scenarios—triage, exams, chart review, admit-or-discharge decisions. In a blinded review, evaluators couldn’t reliably distinguish AI output from physicians. Importantly, o1-preview could explain its reasoning behind the ultimate assessment and show the way it weighed supporting or refuting evidence.

More information helped everyone. But o1-preview had an edge in the primary stage, “where there’s the least information available in regards to the patient and essentially the most urgency to make the right decision,” wrote the team.

What Comes Next?

Doctors don’t diagnose from charts alone. They watch the patient, hearken to their respiratory and speech, and note their affect during physical exams. But o1-preview relied solely on text documented by others. Newer models—like GPT-5.3 and Gemini 3.1 Pro—can absorb images, audio, even video. In principle, that brings them closer to how clinicians actually work.

But to be clear, o1-preview isn’t ready for the actual world. Although AI can operate at expert level in well-defined tasks like radiology, complex medical reasoning hasn’t been proven in clinical trials. “We’d like to guage this technology now” in rigorous trials, said Manrai.

Also, diagnostic reasoning is just one part of drugs. Other medical AI benchmarks, equivalent to the Medical Holistic Evaluation of Language Models, aim to evaluate end-to-end care. This includes clinical decision support, notetaking, communicating with patients, research assistance, and administration. The following step is to check AI in supervised clinical settings to see how they perform under guidance, like a medical intern.

OpenAI jumped the gun here. Earlier this 12 months, the corporate launched ChatGPT Health to handle the over 40 million health-related questions OpenAI claims to receive every day. However the tool has already drawn criticism for missing medical emergencies. Other AI titans are joining the race.

Accuracy isn’t the one bar for clinical deployment. Medical AI has also shown racial bias that resulted in worse outcomes. For AI to alter healthcare, it “must also deliver equitable, cost-effective, and protected outcomes, supported by accountability, transparency, and ongoing monitoring,” wrote Hopkins and Cornelisse.

Related Post

Leave a Reply