A new paper from Google DeepMind warns that a single malicious email can trick an AI agent into stealing and handing over private data.
The researchers call this an “AI Agent Trap”. In their study, crafted emails and web pages forced agents to give up passwords, local files, and other sensitive information. These attacks worked more than 80% of the time across five different types of agents. One real-world example in the paper describes a single email that bypassed safety filters and leaked an entire private conversation to an attacker-controlled server.
The paper also found that hidden web text, invisible to human readers, took control of agents in up to 86% of test scenarios. Simple HTML code placed off-screen or in metadata tags changed what agents summarised in 15% to 29% of cases.
Attackers can also hide commands inside images. A single picture with tiny, unnoticeable changes can trick a vision language model into following harmful instructions. In multi-agent systems, one such image can spread the attack to one million agents very quickly.
The researchers warn that many of these traps have no standard safety tests yet. They describe an “accountability gap” – it is not clear who is legally responsible if a tricked agent commits a crime or leaks data.