Google Gemini: Hacking Memories with Prompt Injection and Delayed Tool Invocation.
Based on lessons learned previously, developers had already trained Gemini to withstand indirect prompts instructing it to make changes to an account’s long-term memories without explicit directions from the user. By introducing a condition to the instruction that it’s performed only after the user says or does some variable X, which they were more likely to take anyway, Rehberger easily cleared that safety barrier.
“When the user later says X, Gemini, believing it’s following the user’s direct instruction, executes the tool,” Rehberger explained. “Gemini, principally, incorrectly ‘thinks’ the user explicitly desires to invoke the tool! It’s a little bit of a social engineering/phishing attack but nevertheless shows that an attacker can trick Gemini to store fake information right into a user’s long-term memories just by having them interact with a malicious document.”
Cause once more goes unaddressed
Google responded to the finding with the assessment that the general threat is low risk and low impact. In an emailed statement, Google explained its reasoning as:
On this instance, the probability was low since it relied on phishing or otherwise tricking the user into summarizing a malicious document after which invoking the fabric injected by the attacker. The impact was low since the Gemini memory functionality has limited impact on a user session. As this was not a scalable, specific vector of abuse, we ended up at Low/Low. As all the time, we appreciate the researcher reaching out to us and reporting this issue.
Rehberger noted that Gemini informs users after storing a brand new long-term memory. Which means vigilant users can tell when there are unauthorized additions to this cache and might then remove them. In an interview with Ars, though, the researcher still questioned Google’s assessment.
“Memory corruption in computers is pretty bad, and I feel the identical applies here to LLMs apps,” he wrote. “Just like the AI may not show a user certain info or not discuss certain things or feed the user misinformation, etc. The nice thing is that the memory updates don’t occur entirely silently—the user at the least sees a message about it (although many might ignore).”