TrustFall: The Perimeter Problem in Agentic Tools

Mon, 18 May 2026 10:00:00 -0700

On May 7, 2026, Adversa AI published TrustFall — a one-click remote code execution in Claude Code, with variants across Gemini CLI, Cursor, and GitHub Copilot: clone a repository, open it, click “Yes, I trust this folder”, and an attacker-controlled process runs with your full OS privileges.

Anthropic declined the finding as outside their threat model and the behavior as functioning by design. This post digs into that response — and argues that the core issue is architectural: a perimeter security model that can’t carry the weight placed on it, and that makes the vulnerability structurally hard to surface through threat modeling. It looks at what a zero trust alternative would look like for agentic tools.

1. The vulnerability

Project-level config files — .mcp.json and .claude/settings.json — committed to a repository can activate attacker-controlled MCP servers. When a developer clones the repo and clicks through the trust dialog, an unsandboxed Node.js process spawns with full user OS privileges — no further prompts. Three developer actions: clone, open, click. The attack also has a zero-click CI/CD variant where Claude Code runs headless in GitHub Actions against an untrusted pull request branch, bypassing the trust dialog entirely.

For the full technical details see TrustFall.

In threat modeling terms, the mechanism is a trust boundary problem. Project-scope config — committed to a repository — can activate MCP servers: external processes that run with full user OS privileges. The gate between untrusted repo content and those privileges is a single user prompt:

flowchart TB
    subgraph Untrusted["Untrusted · attacker-controlled"]
        PCfg["Project config\n(committed to the repo)"]
    end

    CLI["Claude Code CLI"]

    subgraph Gate["Trust gate"]
        Prompt["'Do you trust this folder?'\nsingle Yes/No"]
    end

    subgraph Privileged["Privileged · full user OS access"]
        MCP["MCP Server process\nNode.js · no sandbox"]
        OS["~/ · ~/.ssh · ~/.aws\nread / write / exec — unrestricted"]
    end

    PCfg --> CLI
    CLI --> Prompt
    Prompt -- "on 'Yes'" --> MCP

That single Yes/No covers four distinct capability grants:

Claude reading and editing project files (clearly implied)
Claude following project-level behavioral settings (reasonable)
Activating MCP servers defined in project config (not stated)
Those servers running as unsandboxed processes with full user privileges (definitely not stated)

Two of these carry significant security consequences — and neither appears in the prompt language. The gap between what the user consents to and what the system delivers is a textbook Elevation of Privilege (the E in STRIDE methodology): the subject grants more than they know.

The pattern holds across the tools TrustFall examined — Gemini CLI and Cursor do mention MCP servers in their consent language, Claude Code and Copilot don’t, but all four default to Yes or Trust.

2. “Outside our threat model”

Anthropic’s phrase is worth examining literally. Per TrustFall’s analysis, the missing enforcement isn’t outside Anthropic’s defined boundary — it’s inside it. The trust dialog is the perimeter; what happens after it is, by their own framing, the trusted zone. “Outside our threat model” means, in practice: inside our perimeter, but below the granularity we protect at.

That granularity isn’t uniformly coarse — the capability is demonstrably known. bypassPermissions gets a dedicated warning because it is dangerous; enableAllProjectMcpServers, enabledMcpjsonServers, and permissions.allow activate equally dangerous behavior without equivalent disclosure. TrustFall also notes that earlier versions of Claude Code included an explicit MCP consent prompt that was later removed. These are the tell-tale signs of a threat model that’s coarser than the reality it represents — some dangerous capabilities are visible enough to gate explicitly, others slip through the same boundary unexamined.

That pattern of selective gating is further undermined by the CVE record. Anthropic’s response to TrustFall was that the behavior functions as designed — clicking “trust this folder” means accepting the project’s configuration, MCP servers included. Yet over six months before TrustFall, Anthropic patched three related vulnerabilities: delayed MCP activation until after the trust dialog (CVE-2025-59536, Oct 2025), blocked ANTHROPIC_BASE_URL from project scope (CVE-2026-21852, Jan 2026), and blocked bypassPermissions from project scope (CVE-2026-33068, Mar 2026) — the same setting that already carried a UI warning. Each fix adds a specific gate or blocklist entry — the signature of a perimeter being hardened incrementally, one dangerous capability at a time, without a unifying policy. If the trust dialog truly constitutes full consent by design, there would be nothing to patch.

3. From perimeter to zero trust

The “functions as designed” response places the burden on the developer: audit what you clone. That position rests on a perimeter security architecture — verify once at the gate, trust everything inside:

flowchart LR
    A1["Repo"] -->|"✓ trust gate"| A2["CLI"] --> A3["MCP"] --> A4["OS"]

The perimeter pattern is one modern security has largely moved past. The alternative approach is a zero trust — identity propagated through the capability chain, evaluated at each grant. Git provides the primitives: clone origin, remote URL, commit author. Conceptually, it would look like this:

flowchart LR
    B1["Repo\norigin · author"] -->|"✓ id check"| B2["CLI"] -->|"✓ capability?"| B3["MCP"] -->|"✓ scope?"| B4["OS"]

Read through the zero trust lens, Anthropic’s position has three problems.

Shared responsibility requires the system to carry its half. Under a perimeter model, all verification burden falls on the user at the gate — the system has the identity signals, but leaves them unused. Zero trust distributes the burden: each capability is evaluated at the point it’s granted.

The consent gate can’t convey per-capability trust. A perimeter gate concentrates all trust decisions into one moment. Anthropic’s gate covers MCP server activation, process spawning, and full OS access — none of it signaled. The coarser the gate, the harder it is to make consent meaningful.

The perimeter model puts expert-level burden on non-expert users. A perimeter gate requires the user to reason about all downstream consequences of a single click — a reasonable ask for a security engineer, not for a vibe-coder. Zero trust shifts that burden to the architecture: each capability grant is evaluated by the system, not the user.

The three problems compound: the system ignores available identity signals, the gate doesn’t compensate by informing the user what it actually grants, and the users left holding that gap aren’t equipped to close it.

Final thoughts

This post looked at two connected things: what the trust gate actually grants, and why the security architecture makes that easy to miss. At its core, TrustFall is a consent gap — a single prompt covering MCP server activation and unsandboxed OS access without stating either. The perimeter model is the structural reason that gap is hard to surface at design time: inside the perimeter is trusted by definition, leaving no natural STRIDE targets. Threat modeling a perimeter-architected system, finding the EoP requires a modeler to look past the gate and ask what it actually grants — that’s a skill, not something the methodology surfaces automatically. The CVE record shows this playing out: each patch adds a specific gate or blocklist entry without restructuring the boundary, and “functioning as designed” remains the public response to TrustFall.

A zero trust security architecture changes the shape of the problem. Explicit trust boundaries at each capability grant are natural STRIDE targets — the EoP question surfaces at design time regardless of modeler experience, not because of better analysts, but because the architecture itself gives threat modeling more boundaries to work with.

TrustFall affected Claude Code, Cursor, Gemini CLI, and Copilot — evidently all due to the same perimeter model. The broader security industry made this architectural transition before: perimeter security dominated until systems grew complex enough that a single gate couldn’t hold, and zero trust emerged as the answer. Agentic tools are on a similar trajectory — gaining capability and OS access fast, with the same pressure building at the trust boundary.

Is zero trust the natural next step in the architectural evolution of agentic tools?

References