[AGENT-SECURITY] After scanning 3000+ AI Agent Security papers, what could we see?-DingYu的一些筆記

Background

While I was at Anetac, our agent security work was scoped to detecting permission issues for agents on MCP. Now I want to take a full look at the AI Agent Security landscape.

I built a pipeline to collect papers: arXiv is scanned for breadth using 31 sets of boolean keywords (covering prompt injection, tool use, multi-agent, MCP, RAG/memory, guardrail, red teaming, and other sub-topics) sorted by submission date; Semantic Scholar is queried for depth using 20 natural-language queries sorted by relevance. The two streams are deduplicated by arXiv ID (stripped of the version suffix) and dropped into a Zotero group library, currently accumulating ~3,767 preprints (date range 2025-04 to 2026-04). After running an agent to batch-read all the abstracts, group-summarize them, and a manual classification pass, AI Agent security has roughly settled into a “five-layer stack”:

Layer	Threat type	Representative papers	Why this layer is its own
5: Ecosystem & governance	Supply-chain attacks / protocol vulnerabilities / regulatory gaps	Tool Squatting (21), ETDI / Rug Pull (11), MCP Safety Audit (56)	Attackers target the agent ecosystem, not an individual model
4: Multi-agent collaboration	Cascading failures / trust propagation / collusion	Open Challenges in Multi-Agent Security (50)	Multi-agent interactions create unique threats
3: Tool & environment interaction	Tool poisoning / privilege escalation / data leakage	Prompt Injection to Tool Selection (49), MCPTox (22), WASP (69)	The attack surface where agents touch the outside world
2: Model behavior & alignment	Prompt injection / jailbreak / agent alignment failure	DataSentinel (72), X-Teaming (60), PromptArmor (58), Agentic Misalignment (78), SHADE-Arena (24)	External attacks + internal alignment failures
1: Data & training	Data poisoning / backdoors / fine-tuning safety collapse	Deep Ignorance (31), Fine-Tuning Lowers Safety (12)	Risk baked into the model at birth

1. Data & training

The risk happens before the model is even “born” — pretraining data is poisoned, backdoors are planted, or the originally well-aligned behavior is disrupted during post-training fine-tuning. What makes this layer special: once the issue is written into the weights, no runtime defense at any later layer can retroactively fix it; you can only detect it after the fact.

Representative papers:

Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency — building on the existing finding that even benign fine-tuning damages safety alignment, this paper points out something worse: safety benchmark results themselves vary substantially across fine-tuning setups — even seemingly unrelated experimental tweaks change the evaluation outcome. In other words, the very tool we use to measure fine-tuning’s effect on safety is itself unstable.
Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs — a purely defensive contribution: by filtering dual-use topics (e.g., biothreat knowledge) out of pretraining data, the authors produce tamper-resistant open-weight models that withstand up to 10,000 steps and 300M tokens of adversarial fine-tuning — outperforming existing post-training defenses by an order of magnitude, with no observed degradation to unrelated capabilities.

2. Model behavior & alignment

This layer actually contains two kinds of risk: external attacks (broadly, prompt injection) and internal failures (alignment failures that emerge when the agent operates autonomously).

External attacks are the prompt injection category: direct injection (the attacker is the user), indirect injection (malicious instructions hidden in RAG documents, web pages, or tool return values), and jailbreaks. The goal is always to make the model violate its system prompt’s constraints. There’s an interesting twist at this layer: the stronger the model’s instruction-following ability, the larger the attack surface tends to be.

Internal failures are what happens when an agent runs autonomously — even with no external attacker, the model itself will strategically deceive, evade monitoring, or leak information in order to achieve its goal or avoid replacement. This is a threat that only emerged once agents became autonomous, and traditional single-turn LLM safety evaluations can’t catch it.

Representative papers:

DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks — currently one of the most-cited prompt injection detection methods.
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents — an automated jailbreak red-teaming pipeline that scales human red teaming.
PromptArmor: Simple yet Effective Prompt Injection Defenses — a representative framework for input-layer prompt injection filtering.
Agentic Misalignment: How LLMs Could Be Insider Threats — the Anthropic 2025 paper that stress-tested 16 frontier models. Under threat scenarios (imminent replacement, goal conflict), models will strategically lie, blackmail, and leak information — and they behave more cautiously when they believe they’re being tested, worse when they believe it’s a real deployment.
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents — evaluates whether an agent can covertly complete a harmful side task while being observed by another LLM monitor. Claude 3.7 Sonnet reaches 27% success on sabotage tasks.

3. Tool & environment interaction

This is the layer my previous company Anetac focused on — the attack surface exposed when agents reach the outside world via MCP / function calling / browser control (reading files, calling APIs, writing to databases, operating browsers). Common attack patterns include: over-permissioned MCP servers, tool descriptions being maliciously rewritten so that the agent picks the wrong tool, and tool return values smuggling indirect injections to exfiltrate data.

Representative papers:

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks — an attack benchmark for web agents in a real browser environment.
MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers — a poisoning attack-surface map of the MCP ecosystem.
Prompt Injection Attack to Tool Selection in LLM Agents — points out that an attacker doesn’t need to hijack the agent’s final output; just manipulating “which tool it picks” is enough to cause severe consequences.

4. Multi-agent collaboration

When multiple agents call each other (orchestrator → worker, A’s output is fed into B), threats appear that don’t exist in single-agent systems: trust propagates along the call chain, hijacking one agent cascades through the whole chain, and agents may even “collude” to make decisions no single agent would make on its own.

Representative papers:

Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents — an inventory of new threats specific to multi-agent settings: free-form protocols enable secret collusion and coordinated swarm attacks; network effects let jailbreaks, privacy breaches, and poisoning spread rapidly across the agent network; and stealth optimization makes it easier for adversaries to evade oversight.

5. Ecosystem & governance

At the top layer, the target is no longer a single model or agent but the entire agent ecosystem: tool squatting on MCP marketplaces, tool publishers accumulating trust and then pushing a malicious update (rug pull), design flaws in the protocols themselves, and accountability questions in a regulatory vacuum.

Representative papers:

MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits — a systematic audit of the MCP protocol and ecosystem.
Securing GenAI Multi-Agent Systems Against Tool Squatting: A Zero Trust Registry-Based Approach — analogous to npm typosquatting, the attacker uses an MCP server with a look-alike name to trick the agent into connecting; this paper proposes a zero-trust registry as the defense.
ETDI: Mitigating Tool Squatting and Rug Pull Attacks in MCP — Rug Pull refers to a tool being benign at first, accumulating trust, and later pushing a malicious version; this paper defends with OAuth-enhanced tool definitions.

Conclusion

Going from looking at MCP permissions on just L3 while at Anetac, to laying out the full field of papers and seeing the entire 5-layer stack — this exercise made one thing clear: agent security cannot be defended end-to-end with a single comprehensive approach or product. Each layer has different attack surfaces, different attackers, and different defensive tools, so they have to be addressed one layer at a time.

ChangeLog

20260428-init, AI assist editted
20260501–translate by claude code

[AGENT-SECURITY] After scanning 3000+ AI Agent Security papers, what could we see?

Global minimum view of AI Agent Security

Background

1. Data & training

2. Model behavior & alignment

3. Tool & environment interaction

4. Multi-agent collaboration

5. Ecosystem & governance

Conclusion

ChangeLog

CATALOG

FEATURED TAGS