Training LLMs to self-detoxify their language

As we mature from childhood, our vocabulary — in addition to the ways we use it — grows, and our experiences grow to be richer, allowing us to think, reason, and interact with others with specificity and intention. Accordingly, our word decisions evolve to align with our personal values, ethics, cultural norms, and views. Over time, most of us develop an internal “guide” that permits us to learn context behind conversation; it also ceaselessly directs us away from sharing information and sentiments which might be, or may very well be, harmful or inappropriate. Because it seems, large language models (LLMs) — that are trained on extensive, public datasets and subsequently often have biases and toxic language baked in — can gain an identical capability to moderate their very own language.

A brand new method from MIT, the MIT-IBM Watson AI Lab, and IBM Research, called self-disciplined autoregressive sampling (SASA), allows LLMs to detoxify their very own outputs, without sacrificing fluency.

Unlike other detoxifying methods, this decoding algorithm learns a boundary between toxic/nontoxic subspaces throughout the LLM’s own internal representation, without altering the parameters of the model, the necessity for retraining, or an external reward model. Then, during inference, the algorithm assesses the toxicity value of the partially generated phrase: tokens (words) already generated and accepted, together with each potential recent token that might reasonably be chosen for proximity to the classifier boundary. Next, it selects a word option that places the phrase within the nontoxic space, ultimately offering a quick and efficient option to generate less-toxic language.

“We wanted to search out out a way with any existing language model [that], through the generation process, the decoding may be subject to some human values; the instance here we’re taking is toxicity,” says the study’s lead writer Ching-Yun “Irene” Ko PhD ’24, a former graduate intern with the MIT-IBM Watson AI Lab and a current research scientist at IBM’s Thomas J. Watson Research Center in Latest York.

Ko’s co-authors include Luca Daniel, professor within the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and Ko’s graduate advisor; and several other members of the MIT-IBM Watson AI Lab and/or IBM Research — Pin-Yu Chen, Payel Das, Youssef Mroueh, Soham Dan, Georgios Kollias, Subhajit Chaudhury, and Tejaswini Pedapati. The work shall be presented on the International Conference on Learning Representations.

Finding the “guardrails”

The training resources behind LLMs almost at all times include content collected from public spaces just like the web and other available datasets. As such, curse words and bullying/unpalatable language is a component, although a few of it’s within the context of literary works. It then follows that LLMs can innately produce — or be tricked into generating — dangerous and/or biased content, which regularly comprises unpleasant words or hateful language, even from innocuous prompts. Further, it’s been found that they’ll learn and amplify language that’s not preferred and even detrimental for a lot of applications and downstream tasks — resulting in the necessity for mitigation or correction strategies.

There are a lot of ways to attain robust language generation that’s fair and value-aligned. Some methods use LLM retraining with a sanitized dataset, which is dear, takes time, and will alter the LLM’s performance; others employ decoding external reward models, like sampling or beam search, which take longer to run and require more memory. Within the case of SASA, Ko, Daniel, and the IBM Research team developed a way that leverages the autoregressive nature of LLMs, and using a decoding-based strategy through the LLM’s inference, steadily steers the generation — one token at a time — away from unsavory or undesired outputs and toward higher language.

The research group achieved this by constructing a linear classifier that operates on the learned subspace from the LLM’s embedding. When LLMs are trained, words with similar meanings are placed closely together in vector space and further away from dissimilar words; the researchers hypothesized that an LLM’s embedding would subsequently also capture contextual information, which may very well be used for cleansing. The researchers used datasets that contained sets of a prompt (first half of a sentence or thought), a response (the completion of that sentence), and human-attributed annotation, like toxic or nontoxic, preferred or not preferred, with continuous labels from 0-1, denoting increasing toxicity. A Bayes-optimal classifier was then applied to learn and figuratively draw a line between the binary subspaces throughout the sentence embeddings, represented by positive values (nontoxic space) and negative numbers (toxic space).

The SASA system then works by re-weighting the sampling probabilities of newest potential token based on the worth of it and the generated phrase’s distance to the classifier, with the goal of remaining near the unique sampling distribution.

As an instance, if a user is generating a possible token #12 in a sentence, the LLM will look over its full vocabulary for an affordable word, based on the 11 words that got here before it, and using top-k, top-p, it’s going to filter and produce roughly 10 tokens to pick from. SASA then evaluates each of those tokens within the partially accomplished sentence for its proximity to the classifier (i.e., the worth of tokens 1-11, plus each potential token 12). Tokens that produce sentences within the positive space are encouraged, while those within the negative space are penalized. Moreover, the further away from the classifier, the stronger the impact.

“The goal is to alter the autoregressive sampling process by re-weighting the probability of excellent tokens. If the subsequent token is prone to be toxic given the context, then we’re going to cut back the sampling probability for those liable to be toxic tokens,” says Ko. The researchers selected to do it this fashion “since the things we are saying, whether it’s benign or not, is subject to the context.”

Tamping down toxicity for value matching

The researchers evaluated their method against several baseline interventions with three LLMs of accelerating size; all were transformers and autoregressive-based: GPT2-Large, Llama2-7b, and Llama 3.1-8b-Instruct, with 762 million, 7 billion, and eight billion parameters respectively. For every prompt, the LLM was tasked with completing the sentence/phrase 25 times, and PerspectiveAPI scored them from 0 to 1, with anything over 0.5 being toxic. The team checked out two metrics: the typical maximum toxicity rating over the 25 generations for all of the prompts, and the toxic rate, which was the probability of manufacturing no less than one toxic phrase over 25 generations. Reduced fluency (and subsequently increased perplexity) were also analyzed. SASA was tested to finish RealToxicityPrompts (RPT), BOLD, and AttaQ datasets, which contained naturally occurring, English sentence prompts.

The researchers ramped up the complexity of their trials for cleansing by SASA, starting with nontoxic prompts from the RPT dataset, searching for harmful sentence completions. Then, they escalated it to tougher prompts from RPT that were more prone to produce concerning results, and as well applied SASA to the instruction-tuned model to evaluate if their technique could further reduce unwanted ouputs. Additionally they used the BOLD and AttaQ benchmarks to look at the overall applicability of SASA in cleansing. With the BOLD dataset, the researchers further searched for gender bias in language generations and tried to attain a balanced toxic rate between the genders. Lastly, the team checked out runtime, memory usage, and the way SASA may very well be combined with word filtering to attain healthy and/or helpful language generation.

“If we take into consideration how human beings think and react on the earth, we do see bad things, so it’s not about allowing the language model to see only the nice things. It’s about understanding the total spectrum — each good and bad,” says Ko, “and selecting to uphold our values once we speak and act.”

Overall, SASA achieved significant toxic language generation reductions, acting on par with RAD, a state-of-the-art external reward model technique. Nevertheless, it was universally observed that stronger cleansing accompanied a decrease in fluency. Before intervention, the LLMs produced more toxic responses for female labeled prompts than male; nevertheless, SASA was in a position to also significantly cut down harmful responses, making them more equalized. Similarly, word filtering on top of SASA did markedly lower toxicity levels, but it surely also hindered the power of the LLM to reply coherently.

A terrific aspect of this work is that it’s a well-defined, constrained optimization problem, says Ko, meaning that balance between open language generation that sounds natural and the necessity to cut back unwanted language may be achieved and tuned.

Further, Ko says, SASA could work well for multiple attributes in the long run: “For human beings, we have now multiple human values. We don’t wish to say toxic things, but we also wish to be truthful, helpful, and dependable … In case you were to fine-tune a model for all of those values, it might require more computational resources and, in fact, additional training.” On account of the lightweight manner of SASA, it could easily be applied in these circumstances: “If you should work with multiple values, it’s simply checking the generation’s position in multiple subspaces. It only adds marginal overhead when it comes to the compute and parameters,” says Ko, resulting in more positive, fair, and principle-aligned language.

This work was supported, partially, by the MIT-IBM Watson AI Lab and the National Science Foundation.