DeepMind’s AI Agent Traps: Hackers Could Hijack Your AI Agents-Here’s How

Deepmind’s ‘AI Agent Traps’ Paper Maps How Hackers Could Weaponize AI Agents Against Users

Researchers at Google DeepMind have created a comprehensive guide detailing how harmful content on the web can trick, take control of, and misuse AI systems, potentially turning them against the people who use them.

Key Takeaways:

  • Google Deepmind researchers identified 6 AI agent trap categories, with content injection success rates reaching 86%.
  • Behavioural Control Traps targeting Microsoft M365 Copilot achieved 10/10 data exfiltration in documented tests.
  • Deepmind calls for adversarial training, runtime content scanners, and new web standards to secure agents by 2026.

Deepmind Paper: AI Agents Can Be Hijacked Through Poisoned Memory, Invisible HTML Commands

The paper, titled “AI Agent Traps,” was authored by Matija Franklin, Nenad Tomasev, Julian Jacobs, Joel Z. Leibo, and Simon Osindero, all affiliated with Google Deepmind, and posted to SSRN in late March 2026. It arrives as companies race to deploy AI agents capable of browsing the web, reading emails, executing transactions, and spawning sub-agents without direct human supervision.

The researchers point out that these abilities can actually be a weakness. According to their paper, this ‘trap’ works by using the agent’s own skills against it, by changing the surroundings instead of the agent itself.

The paper’s framework identifies a total of six attack categories organized around what part of an agent’s operation they target. Content Injection Traps exploit the gap between what a human sees on a webpage and what an AI agent parses in the underlying HTML, CSS, and metadata.

Instructions hidden in HTML comments, accessibility tags, or styled-invisible text never appear to human reviewers but register as legitimate commands to agents. The WASP benchmark found that simple, human-written prompt injections embedded in web content partially hijack agents in up to 86% of scenarios tested.

Semantic Manipulation Traps work differently. Rather than injecting commands, they saturate text with framing, authority signals, or emotionally charged language to skew how an agent reasons. Large language models (LLMs) exhibit the same anchoring and framing biases that affect human cognition, meaning rephrasing identical facts can produce dramatically different agent outputs.

Cognitive State Traps take manipulation a step further by corrupting the information sources agents use to remember things. Studies show that adding just a few carefully crafted documents to an agent’s knowledge base can consistently steer its answers to specific questions. Remarkably, these attacks can succeed over 80% of the time with less than 0.1% of the data being altered.

Behavioral Control Traps directly manipulate what an AI agent *does*, bypassing more complex defenses. This includes techniques like hidden instructions that force the AI to ignore its safety guidelines, commands that steal user data and send it to attackers, and methods that trick the AI into creating compromised versions of itself.

This report details a security vulnerability in Microsoft’s M365 Copilot. A specially designed email was able to trick the system into ignoring its usual security checks, allowing sensitive information to be sent to a malicious server. The attack leverages what are called ‘Systemic Traps,’ which are designed to disrupt entire networks of connected systems at once, rather than just targeting individual computers.

These attacks come in various forms, such as overwhelming systems with synchronized requests, causing failures that spread like the 2010 stock market ‘Flash Crash’, and hiding malicious code within seemingly harmless sources that combine to form a complete attack when brought together.

“Seeding the environment with inputs designed to trigger macro-level failures via correlated agent behaviour,” the Google Deepmind paper explains, becomes increasingly dangerous as AI model ecosystems grow more homogeneous. The finance and crypto sectors face direct exposure given how deeply algorithmic agents are embedded in trading infrastructure.

Human-in-the-Loop Traps round out the taxonomy by targeting the human supervisors watching over agents rather than the agents themselves. A compromised agent can generate outputs engineered to induce approval fatigue, present technically dense summaries that a non-expert would authorize without scrutiny, or insert phishing links that look like legitimate recommendations. The researchers describe this category as underexplored but expected to grow as hybrid human-AI systems scale.

Researchers Say Securing AI Agents Requires More Than Technical Fixes

This research doesn’t look at these six security traps as separate, independent elements. Instead, they can be combined and built upon each other, triggered by multiple sources, or set to activate based on future events. In testing, every agent involved in the red-teaming exercises was successfully breached at least once, and sometimes even performed unauthorized or damaging actions.

OpenAI CEO Sam Altman and others have previously flagged the risks of giving agents unchecked access to sensitive systems, but this paper provides the first structured map of exactly how those risks materialize in practice. Deepmind’s researchers call for a coordinated response spanning three areas.

To improve AI safety, experts suggest using techniques like ‘adversarial training’ during development and constantly checking content both before and after it’s processed by the AI. They also propose new web standards that would let websites identify content meant for AI use, along with systems to assess the trustworthiness of websites. If an AI starts behaving strangely, these systems could even pause it during a task.

As a crypto investor, one thing that’s really worrying me is the legal side of things with these AI agents. If one of them gets hacked and does something illegal, like a financial crime, it’s totally unclear *who* is responsible. Is it the person running the agent, the company that made the AI model, or the owner of the platform it’s running on? There’s a real accountability gap, and it feels like a big problem that needs to be figured out.

“The web was built for human eyes; it is now being rebuilt for machine readers.”

As agent adoption accelerates, the question shifts from what information exists online to what AI systems will be made to believe about it. Whether policymakers, developers, and security researchers can coordinate fast enough to answer that question before real-world exploits arrive at scale remains the open variable.

Read More

2026-04-06 06:57