Why RAG won’t solve generative AI’s hallucination problem

Hallucinations — the lies generative AI models tell, principally — are a giant problem for businesses seeking to integrate the technology into their operations.

Because models haven’t any real intelligence and are simply predicting words, images, speech, music and other data in accordance with a personal schema, they often get it flawed. Very flawed. In a recent piece in The Wall Street Journal, a source recounts an instance where Microsoft’s generative AI invented meeting attendees and implied that conference calls were about subjects that weren’t actually discussed on the decision.

As I wrote some time ago, hallucinations could also be an unsolvable problem with today’s transformer-based model architectures. But numerous generative AI vendors suggest that they can be done away with, kind of, through a technical approach called retrieval augmented generation, or RAG.

Here’s how one vendor, Squirro, pitches it:

On the core of the offering is the concept of Retrieval Augmented LLMs or Retrieval Augmented Generation (RAG) embedded in the answer … [our generative AI] is exclusive in its promise of zero hallucinations. Each piece of knowledge it generates is traceable to a source, ensuring credibility.

Here’s a similar pitch from SiftHub:

Using RAG technology and fine-tuned large language models with industry-specific knowledge training, SiftHub allows firms to generate personalized responses with zero hallucinations. This guarantees increased transparency and reduced risk and inspires absolute trust to make use of AI for all their needs.

RAG was pioneered by data scientist Patrick Lewis, researcher at Meta and University College London, and lead writer of the 2020 paper that coined the term. Applied to a model, RAG retrieves documents possibly relevant to an issue — for instance, a Wikipedia page in regards to the Super Bowl — using what’s essentially a keyword search after which asks the model to generate answers given this extra context.

“Once you’re interacting with a generative AI model like ChatGPT or Llama and also you ask an issue, the default is for the model to reply from its ‘parametric memory’ — i.e., from the knowledge that’s stored in its parameters in consequence of coaching on massive data from the net,” David Wadden, a research scientist at AI2, the AI-focused research division of the nonprofit Allen Institute, explained. “But, similar to you’re likely to offer more accurate answers if you might have a reference [like a book or a file] in front of you, the identical is true in some cases for models.”

RAG is undeniably useful — it allows one to attribute things a model generates to retrieved documents to confirm their factuality (and, as an additional benefit, avoid potentially copyright-infringing regurgitation). RAG also lets enterprises that don’t want their documents used to coach a model — say, firms in highly regulated industries like healthcare and law — to permit models to attract on those documents in a safer and temporary way.

But RAG actually can’t stop a model from hallucinating. And it has limitations that many vendors gloss over.

Wadden says that RAG is handiest in “knowledge-intensive” scenarios where a user wants to make use of a model to handle an “information need” — for instance, to search out out who won the Super Bowl last yr. In these scenarios, the document that answers the query is prone to contain lots of the same keywords because the query (e.g., “Super Bowl,” “last yr”), making it relatively easy to search out via keyword search.

Things get trickier with “reasoning-intensive” tasks equivalent to coding and math, where it’s harder to specify in a keyword-based search query the concepts needed to reply a request — much less discover which documents is perhaps relevant.

Even with basic questions, models can get “distracted” by irrelevant content in documents, particularly in long documents where the reply isn’t obvious. Or they will — for reasons as yet unknown — simply ignore the contents of retrieved documents, opting as an alternative to depend on their parametric memory.

RAG can be expensive when it comes to the hardware needed to use it at scale.

That’s because retrieved documents, whether from the net, an internal database or someplace else, must be stored in memory — not less than temporarily — in order that the model can refer back to them. One other expenditure is compute for the increased context a model has to process before generating its response. For a technology already notorious for the quantity of compute and electricity it requires even for basic operations, this amounts to a serious consideration.

That’s to not suggest RAG can’t be improved. Wadden noted many ongoing efforts to coach models to make higher use of RAG-retrieved documents.

A few of these efforts involve models that may “determine” when to utilize the documents, or models that may select to not perform retrieval in the primary place in the event that they deem it unnecessary. Others deal with ways to more efficiently index massive datasets of documents, and on improving search through higher representations of documents — representations that transcend keywords.

“We’re pretty good at retrieving documents based on keywords, but not so good at retrieving documents based on more abstract concepts, like a proof technique needed to resolve a math problem,” Wadden said. “Research is required to construct document representations and search techniques that may discover relevant documents for more abstract generation tasks. I feel this is generally an open query at this point.”

So RAG might help reduce a model’s hallucinations — but it surely’s not the reply to all of AI’s hallucinatory problems. Watch out for any vendor that tries to assert otherwise.