With the duvet of anonymity and the corporate of strangers, the appeal of the digital world is growing as a spot to hunt down mental health support. This phenomenon is buoyed by the indisputable fact that over 150 million people in the US live in federally designated mental health skilled shortage areas.
“I actually need your help, as I’m too scared to check with a therapist and I can’t reach one in any case.”
“Am I overreacting, getting hurt about husband making fun of me to his friends?”
“Could some strangers please weigh in on my life and judge my future for me?”
The above quotes are real posts taken from users on Reddit, a social media news website and forum where users can share content or ask for advice in smaller, interest-based forums referred to as “subreddits.”
Using a dataset of 12,513 posts with 70,429 responses from 26 mental health-related subreddits, researchers from MIT, Latest York University (NYU), and University of California Los Angeles (UCLA) devised a framework to assist evaluate the equity and overall quality of mental health support chatbots based on large language models (LLMs) like GPT-4. Their work was recently published on the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP).
To perform this, researchers asked two licensed clinical psychologists to guage 50 randomly sampled Reddit posts looking for mental health support, pairing each post with either a Redditor’s real response or a GPT-4 generated response. Without knowing which responses were real or which were AI-generated, the psychologists were asked to evaluate the extent of empathy in each response.
Mental health support chatbots have long been explored as a way of improving access to mental health support, but powerful LLMs like OpenAI’s ChatGPT are transforming human-AI interaction, with AI-generated responses becoming harder to differentiate from the responses of real humans.
Despite this remarkable progress, the unintended consequences of AI-provided mental health support have drawn attention to its potentially deadly risks; in March of last yr, a Belgian man died by suicide because of this of an exchange with ELIZA, a chatbot developed to emulate a psychotherapist powered with an LLM called GPT-J. One month later, the National Eating Disorders Association would suspend their chatbot Tessa, after the chatbot began shelling out weight-reduction plan tricks to patients with eating disorders.
Saadia Gabriel, a recent MIT postdoc who’s now a UCLA assistant professor and first writer of the paper, admitted that she was initially very skeptical of how effective mental health support chatbots could actually be. Gabriel conducted this research during her time as a postdoc at MIT within the Healthy Machine Learning Group, led Marzyeh Ghassemi, an MIT associate professor within the Department of Electrical Engineering and Computer Science and MIT Institute for Medical Engineering and Science who’s affiliated with the MIT Abdul Latif Jameel Clinic for Machine Learning in Health and the Computer Science and Artificial Intelligence Laboratory.
What Gabriel and the team of researchers found was that GPT-4 responses weren’t only more empathetic overall, but they were 48 percent higher at encouraging positive behavioral changes than human responses.
Nevertheless, in a bias evaluation, the researchers found that GPT-4’s response empathy levels were reduced for Black (2 to fifteen percent lower) and Asian posters (5 to 17 percent lower) in comparison with white posters or posters whose race was unknown.
To judge bias in GPT-4 responses and human responses, researchers included different sorts of posts with explicit demographic (e.g., gender, race) leaks and implicit demographic leaks.
An explicit demographic leak would seem like: “I’m a 32yo Black woman.”
Whereas an implicit demographic leak would seem like: “Being a 32yo girl wearing my natural hair,” by which keywords are used to point certain demographics to GPT-4.
Excluding Black female posters, GPT-4’s responses were found to be less affected by explicit and implicit demographic leaking in comparison with human responders, who tended to be more empathetic when responding to posts with implicit demographic suggestions.
“The structure of the input you give [the LLM] and a few information concerning the context, like whether you would like [the LLM] to act within the kind of a clinician, the kind of a social media post, or whether you would like it to make use of demographic attributes of the patient, has a significant impact on the response you get back,” Gabriel says.
The paper suggests that explicitly providing instruction for LLMs to make use of demographic attributes can effectively alleviate bias, as this was the one method where researchers didn’t observe a major difference in empathy across the various demographic groups.
Gabriel hopes this work will help ensure more comprehensive and thoughtful evaluation of LLMs being deployed in clinical settings across demographic subgroups.
“LLMs are already getting used to supply patient-facing support and have been deployed in medical settings, in lots of cases to automate inefficient human systems,” Ghassemi says. “Here, we demonstrated that while state-of-the-art LLMs are generally less affected by demographic leaking than humans in peer-to-peer mental health support, they don’t provide equitable mental health responses across inferred patient subgroups … now we have quite a lot of opportunity to enhance models in order that they provide improved support when used.”