Hugging Face releases a benchmark for testing generative AI on health tasks

Generative AI models are increasingly being dropped at healthcare settings — in some cases prematurely, perhaps. Early adopters consider that they’ll unlock increased efficiency while revealing insights that’d otherwise be missed. Critics, meanwhile, indicate that these models have flaws and biases that might contribute to worse health outcomes.

But is there a quantitative method to know the way helpful, or harmful, a model is perhaps when tasked with things like summarizing patient records or answering health-related questions?

Hugging Face, the AI startup, proposes an answer in a newly released benchmark test called Open Medical-LLM. Created in partnership with researchers on the nonprofit Open Life Science AI and the University of Edinburgh’s Natural Language Processing Group, Open Medical-LLM goals to standardize evaluating the performance of generative AI models on a variety of medical-related tasks.

Recent: Open Medical LLM Leaderboard! 🩺
In basic chatbots, errors are annoyances.
In medical LLMs, errors can have life-threatening consequences 🩸
It’s subsequently vital to benchmark/follow advances in medical LLMs before fascinated about deployment.
Blog: https://t.co/pddLtkmhsz
— Clémentine Fourrier 🍊 (@clefourrier) April 18, 2024

Open Medical-LLM isn’t a from-scratch benchmark, per se, but slightly a stitching-together of existing test sets — MedQA, PubMedQA, MedMCQA and so forth — designed to probe models for general medical knowledge and related fields, akin to anatomy, pharmacology, genetics and clinical practice. The benchmark incorporates multiple alternative and open-ended questions that require medical reasoning and understanding, drawing from material including U.S. and Indian medical licensing exams and college biology test query banks.

“[Open Medical-LLM] enables researchers and practitioners to discover the strengths and weaknesses of various approaches, drive further advancements in the sphere and ultimately contribute to raised patient care and consequence,” Hugging Face wrote in a blog post.

Image Credits: Hugging Face

Hugging Face is positioning the benchmark as a “robust assessment” of healthcare-bound generative AI models. But some health workers on social media cautioned against putting an excessive amount of stock into Open Medical-LLM, lest it result in ill-informed deployments.

On X, Liam McCoy, a resident physician in neurology on the University of Alberta, identified that the gap between the “contrived environment” of medical question-answering and actual clinical practice could be quite large.

It’s great progress to see these comparisons head-to-head, but necessary for us to also remember how big the gap is between the contrived environment of medical query answering and actual clinical practice! Not to say the idiosyncratic risks these metrics cannot capture.
— Liam McCoy, MD MSc (@LiamGMcCoy) April 18, 2024

Hugging Face research scientist Clémentine Fourrier, who co-authored the blog post, agreed.

“These leaderboards should only be used as a primary approximation of which [generative AI model] to probe for a given use case, but then a deeper phase of testing is all the time needed to look at the model’s limits and relevance in real conditions,” Fourrier replied on X. “Medical [models] should absolutely not be used on their very own by patients, but as a substitute ought to be trained to turn out to be support tools for MDs.”

It brings to mind Google’s experience when it tried to bring an AI screening tool for diabetic retinopathy to healthcare systems in Thailand.

Google created a deep learning system that scanned images of the attention, in search of evidence of retinopathy, a number one reason behind vision loss. But despite high theoretical accuracy, the tool proved impractical in real-world testing, frustrating each patients and nurses with inconsistent results and a general lack of harmony with on-the-ground practices.

It’s telling that of the 139 AI-related medical devices the U.S. Food and Drug Administration has approved so far, none use generative AI. It’s exceptionally difficult to check how a generative AI tool’s performance within the lab will translate to hospitals and outpatient clinics, and, perhaps more importantly, how the outcomes might trend over time.

That’s to not suggest Open Medical-LLM isn’t useful or informative. The outcomes leaderboard, if nothing else, serves as a reminder of just how poorly models answer basic health questions. But Open Medical-LLM, and no other benchmark for that matter, is an alternative to fastidiously thought-out real-world testing.

Categories

Site Map

Hugging Face releases a benchmark for testing generative AI on health tasks

LEAVE A REPLY Cancel reply

Canada sending 300-member biz delegation to PHL in December

Should Gilberto Ramirez Risk It All Against Jai Opetaia?

Box office for European movies falling worldwide

How To Solve Out Of Memory Error In Stalker 2: Heart Of Chornobyl

SCOOP: Allu Arjun spearheads Pushpa 2 climax shoot as Sukumar races to wrap up by November 27! : Bollywood News

More like this
Related

Canada sending 300-member biz delegation to PHL in December

Should Gilberto Ramirez Risk It All Against Jai Opetaia?

Box office for European movies falling worldwide

How To Solve Out Of Memory Error In Stalker 2: Heart Of Chornobyl

TrendWired Solutions Network

Site Map

The latest

Canada sending 300-member biz delegation to PHL in December

Should Gilberto Ramirez Risk It All Against Jai Opetaia?

Box office for European movies falling worldwide

Our Newsletter

Categories

Site Map

Hugging Face releases a benchmark for testing generative AI on health tasks

LEAVE A REPLY Cancel reply

More like thisRelated

TrendWired Solutions Network

Site Map

The latest

Our Newsletter

More like this
Related