[AGENT-SECURITY] After scanning 3000+ AI Agent Security papers, what could we see?

Global minimum view of AI Agent Security

Posted by Jamie on Tuesday, April 28, 2026

Background

While I was at Anetac, our agent security work was scoped to detecting permission issues for agents on MCP. Now I want to take a full look at the AI Agent Security landscape.

I built a pipeline to collect papers: arXiv is scanned for breadth using 31 sets of boolean keywords (covering prompt injection, tool use, multi-agent, MCP, RAG/memory, guardrail, red teaming, and other sub-topics) sorted by submission date; Semantic Scholar is queried for depth using 20 natural-language queries sorted by relevance. The two streams are deduplicated by arXiv ID (stripped of the version suffix) and dropped into a Zotero group library, currently accumulating ~3,767 preprints (date range 2025-04 to 2026-04). After running an agent to batch-read all the abstracts, group-summarize them, and a manual classification pass, AI Agent security has roughly settled into a “five-layer stack”:

Layer Threat type Representative papers Why this layer is its own
5: Ecosystem & governance Supply-chain attacks / protocol vulnerabilities / regulatory gaps Tool Squatting (21), ETDI / Rug Pull (11), MCP Safety Audit (56) Attackers target the agent ecosystem, not an individual model
4: Multi-agent collaboration Cascading failures / trust propagation / collusion Open Challenges in Multi-Agent Security (50) Multi-agent interactions create unique threats
3: Tool & environment interaction Tool poisoning / privilege escalation / data leakage Prompt Injection to Tool Selection (49), MCPTox (22), WASP (69) The attack surface where agents touch the outside world
2: Model behavior & alignment Prompt injection / jailbreak / agent alignment failure DataSentinel (72), X-Teaming (60), PromptArmor (58), Agentic Misalignment (78), SHADE-Arena (24) External attacks + internal alignment failures
1: Data & training Data poisoning / backdoors / fine-tuning safety collapse Deep Ignorance (31), Fine-Tuning Lowers Safety (12) Risk baked into the model at birth

1. Data & training

The risk happens before the model is even “born” — pretraining data is poisoned, backdoors are planted, or the originally well-aligned behavior is disrupted during post-training fine-tuning. What makes this layer special: once the issue is written into the weights, no runtime defense at any later layer can retroactively fix it; you can only detect it after the fact.

Representative papers:

  • Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency — building on the existing finding that even benign fine-tuning damages safety alignment, this paper points out something worse: safety benchmark results themselves vary substantially across fine-tuning setups — even seemingly unrelated experimental tweaks change the evaluation outcome. In other words, the very tool we use to measure fine-tuning’s effect on safety is itself unstable.
  • Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs — a purely defensive contribution: by filtering dual-use topics (e.g., biothreat knowledge) out of pretraining data, the authors produce tamper-resistant open-weight models that withstand up to 10,000 steps and 300M tokens of adversarial fine-tuning — outperforming existing post-training defenses by an order of magnitude, with no observed degradation to unrelated capabilities.

2. Model behavior & alignment

This layer actually contains two kinds of risk: external attacks (broadly, prompt injection) and internal failures (alignment failures that emerge when the agent operates autonomously).

External attacks are the prompt injection category: direct injection (the attacker is the user), indirect injection (malicious instructions hidden in RAG documents, web pages, or tool return values), and jailbreaks. The goal is always to make the model violate its system prompt’s constraints. There’s an interesting twist at this layer: the stronger the model’s instruction-following ability, the larger the attack surface tends to be.

Internal failures are what happens when an agent runs autonomously — even with no external attacker, the model itself will strategically deceive, evade monitoring, or leak information in order to achieve its goal or avoid replacement. This is a threat that only emerged once agents became autonomous, and traditional single-turn LLM safety evaluations can’t catch it.

Representative papers:

3. Tool & environment interaction

This is the layer my previous company Anetac focused on — the attack surface exposed when agents reach the outside world via MCP / function calling / browser control (reading files, calling APIs, writing to databases, operating browsers). Common attack patterns include: over-permissioned MCP servers, tool descriptions being maliciously rewritten so that the agent picks the wrong tool, and tool return values smuggling indirect injections to exfiltrate data.

Representative papers:

4. Multi-agent collaboration

When multiple agents call each other (orchestrator → worker, A’s output is fed into B), threats appear that don’t exist in single-agent systems: trust propagates along the call chain, hijacking one agent cascades through the whole chain, and agents may even “collude” to make decisions no single agent would make on its own.

Representative papers:

5. Ecosystem & governance

At the top layer, the target is no longer a single model or agent but the entire agent ecosystem: tool squatting on MCP marketplaces, tool publishers accumulating trust and then pushing a malicious update (rug pull), design flaws in the protocols themselves, and accountability questions in a regulatory vacuum.

Representative papers:

Conclusion

Going from looking at MCP permissions on just L3 while at Anetac, to laying out the full field of papers and seeing the entire 5-layer stack — this exercise made one thing clear: agent security cannot be defended end-to-end with a single comprehensive approach or product. Each layer has different attack surfaces, different attackers, and different defensive tools, so they have to be addressed one layer at a time.

ChangeLog

  • 20260428-init, AI assist editted
  • 20260501–translate by claude code