Hugging Face releases a benchmark for testing generative AI on health tasks

Date:

ChicMe WW
Kinguin WW
Lilicloth WW

Generative AI models are increasingly being dropped at healthcare settings — in some cases prematurely, perhaps. Early adopters consider that they’ll unlock increased efficiency while revealing insights that’d otherwise be missed. Critics, meanwhile, indicate that these models have flaws and biases that might contribute to worse health outcomes.

But is there a quantitative method to know the way helpful, or harmful, a model is perhaps when tasked with things like summarizing patient records or answering health-related questions?

Hugging Face, the AI startup, proposes an answer in a newly released benchmark test called Open Medical-LLM. Created in partnership with researchers on the nonprofit Open Life Science AI and the University of Edinburgh’s Natural Language Processing Group, Open Medical-LLM goals to standardize evaluating the performance of generative AI models on a variety of medical-related tasks.

Open Medical-LLM isn’t a from-scratch benchmark, per se, but slightly a stitching-together of existing test sets — MedQA, PubMedQA, MedMCQA and so forth — designed to probe models for general medical knowledge and related fields, akin to anatomy, pharmacology, genetics and clinical practice. The benchmark incorporates multiple alternative and open-ended questions that require medical reasoning and understanding, drawing from material including U.S. and Indian medical licensing exams and college biology test query banks.

“[Open Medical-LLM] enables researchers and practitioners to discover the strengths and weaknesses of various approaches, drive further advancements in the sphere and ultimately contribute to raised patient care and consequence,” Hugging Face wrote in a blog post.

gen AI healthcare

Image Credits: Hugging Face

Hugging Face is positioning the benchmark as a “robust assessment” of healthcare-bound generative AI models. But some health workers on social media cautioned against putting an excessive amount of stock into Open Medical-LLM, lest it result in ill-informed deployments.

On X, Liam McCoy, a resident physician in neurology on the University of Alberta, identified that the gap between the “contrived environment” of medical question-answering and actual clinical practice could be quite large.

Hugging Face research scientist Clémentine Fourrier, who co-authored the blog post, agreed.

“These leaderboards should only be used as a primary approximation of which [generative AI model] to probe for a given use case, but then a deeper phase of testing is all the time needed to look at the model’s limits and relevance in real conditions,” Fourrier replied on X. “Medical [models] should absolutely not be used on their very own by patients, but as a substitute ought to be trained to turn out to be support tools for MDs.”

It brings to mind Google’s experience when it tried to bring an AI screening tool for diabetic retinopathy to healthcare systems in Thailand.

Google created a deep learning system that scanned images of the attention, in search of evidence of retinopathy, a number one reason behind vision loss. But despite high theoretical accuracy, the tool proved impractical in real-world testing, frustrating each patients and nurses with inconsistent results and a general lack of harmony with on-the-ground practices.

It’s telling that of the 139 AI-related medical devices the U.S. Food and Drug Administration has approved so far, none use generative AI. It’s exceptionally difficult to check how a generative AI tool’s performance within the lab will translate to hospitals and outpatient clinics, and, perhaps more importantly, how the outcomes might trend over time.

That’s to not suggest Open Medical-LLM isn’t useful or informative. The outcomes leaderboard, if nothing else, serves as a reminder of just how poorly models answer basic health questions. But Open Medical-LLM, and no other benchmark for that matter, is an alternative to fastidiously thought-out real-world testing.


Share post:

High Performance VPS Hosting

Popular

More like this
Related

Should Gilberto Ramirez Risk It All Against Jai Opetaia?

Promoter Eddie Hearn wants the unified WBA & WBO...

Box office for European movies falling worldwide

Critically, European movies are having a hell of a...