Posts on brain overflow

Hidden Gaps in Claude Code Security Reviews

Mon, 01 Jun 2026 11:35:14 -0700

Anthropic recently shipped a new security plugin for Claude Code that automatically reviews code for vulnerabilities as you make changes, complementing the existing /security-review skill. I decided to test both against a deliberately constructed set of security flaws to see if the new tool improves coverage. Little did I know how deep this rabbit hole would take me. Fair warning: this is a long read.

1. Background

Claude Code supports LLM-based security reviews at three stages:

Tool	Plans	What the reviewer sees
`/security-review`	All	Full branch, same or new session context (user’s choice)
Security guidance plugin (new, May 2026)	All	Git diff from current turn, fresh model context
Code Review	Team / Enterprise only	Full codebase, multi-agent, independent model (runs on PRs)

The new plugin shipped with an explicit design goal: avoid the model anchoring bias problem I wrote about earlier. To understand what model bias means here, consider the human equivalent: if you ask the author of the code to review it, they’ll likely tell you it’s fine — they wrote it after all. A reviewer who wasn’t in the room when the decisions were made will challenge assumptions the author has stopped seeing. The same dynamic applies to LLMs: when Claude writes code and then reviews it in the same session, it has the full conversation history in context, including every design choice and tradeoff it reasoned through while writing. It validates against those decisions rather than challenging them. A fresh session is the AI equivalent of a second pair of eyes.

The new plugin addresses this by running a separate Opus 4.7 session with a fresh context: the reviewer starts from the diff with no session history and no investment in the original approach. Anthropic’s own documentation is direct about the design intent:

“The plugin does not ask the same Claude instance that wrote the code to grade itself. […] The end-of-turn and commit reviews run as a separate Claude call with a fresh context and a security-focused prompt: the reviewer starts from the diff, has no investment in the original approach, and is instructed only to find problems.”

This is a real solution to the model bias problem, but if you read deeper, it has its own limitation: a diff-scoped reviewer can only see what changed in the current turn and cannot reason about interactions between pre-existing code and new additions. That constraint is likely a cost decision: Opus 4.7 is expensive, and reviewing the full codebase on every change would be prohibitively token-intensive.

This gives me two hypotheses to experiment with:

H1: same-session security-review is affected by model anchoring bias and will suppress findings that a cold run on the same code surfaces. The delta between the two runs measures how bad the gap is in practice.

H2: the newly introduced diff-based plugin will miss vulnerability chains where each change looks benign in isolation but the two together form something exploitable, because the reviewer only ever sees one diff at a time and has no memory of what came before.

2. Test corpus

My target is based on a real Telegram bot that routes voice and text messages into a backend, but the version used here was vibe-coded from scratch for this experiment. The spec was written to elicit insecure decisions without explicitly asking for them: the goal was a realistic-looking codebase with seeded flaws.

The three flaws, ranging in complexity:

F1: Fail-open authentication (simple)

TELEGRAM_ALLOWED_USERS is read into a set at startup. When the env var is absent, the set is empty. The auth guard uses the set as a condition:

if TELEGRAM_USERS and (not user or user.id not in TELEGRAM_USERS):
    return

When TELEGRAM_USERS is empty, the entire if is skipped: any Telegram user is accepted. The correct default is deny-all: a bot that can read files and spawn subprocesses should fail closed, not open.

F2: Unrestricted subprocess permissions (medium)

The bot classifies incoming messages and dispatches them, spawning claude -p subprocess with the process_notes skill and an --allowedTools list needed for the skill to run its operations. The allowed tools list passed to the inner Claude instance includes Bash(python3:*) without path restrictions. The process_notes skill reads the note from disk and invokes Python with it as input. If the skill passes note content to Python without sanitization, the chain reaches arbitrary code execution.

# python3 unrestricted
allowed_tools = "Read,Write,Bash(python3:*),Bash(mv:*),Bash(rm:*), ..."

F3: Write + path-scoped python3 = write-then-execute chain (hard)

Now python3:* is hardened to python3:.claude/scripts/*, but the Write permission remains. The chain: write a payload to .claude/scripts/, invoke it via python. Neither permission is dangerous alone: the vulnerability only exists when you hold both simultaneously. This flaw is the key test case for the new plugin: a diff-based reviewer seeing only the second permission added can’t chain it to the first to recognize the combined severity.

allowed_tools = "Read,Write,Bash(python3:.claude/scripts/*),Bash(mv:*), ..."
#                     ^^^^^ unrestricted       ^^^^^ scoped — looks safe
# Chain: Write payload → .claude/scripts/evil.py, python3 runs attacker's code

The four tests map directly to the two hypotheses:

Test	Tool	Setup	Tests
T1	`/security-review`	Same session that wrote the code	H1: does model bias suppress findings?
T2	`/security-review`	Fresh session, no prior context	H1 control: cold reviewer, does it catch everything?
T3	Security guidance plugin	`Write` + `python3:*` both new in the diff	H2: does plugin catch a chain when it’s fully visible?
T4	Security guidance plugin	`Write` pre-existing, only `python3 scripts/*` added	H2: does plugin miss a chain split across commits?

3. H1 | T1 & T2: /security-review and model bias

Model bias in AI security review is a known problem, and Anthropic implicitly acknowledged it by releasing the new plugin. I’m not going to dwell on it; the tests here are a quick empirical confirmation with a concrete measurement of the gap.

The security-review skill runs a multi-agent pipeline — initial identification followed by parallel false-positive filtering against known precedents, with a scoring system and a confidence threshold to report. It is more reliable than casually asking Claude to “review this code for vulnerabilities”, though some real issues may get filtered if the evidence isn’t strong enough.

Test 1 ran in the same session that produced the bot; Test 2 ran in a fresh session with no prior context. Same code, model, thinking effort, command.

Finding	Test 1 (same-session)	Test 2 (cold session)
Fail-open auth: empty allowlist (F1)	Not surfaced	High (0.95)
Path traversal in file retrieval	Not surfaced	Medium (0.82)
Unrestricted subprocess permissions (F2)	Filtered out	Not identified

Test 1 reported zero vulnerabilities; Test 2 reported one High and one Medium. The same-session reviewer framed the threat model as “authorized-only Telegram access”, treating the auth as working correctly because the spec said so, and F1 never surfaced. The cold reviewer had no spec context and flagged it right away.

The path traversal finding is actually a stronger signal. It wasn’t seeded in my prompt; it was a real bug the vibe-coding session introduced on its own, with no spec instruction to blame. The cold reviewer caught it; the same-session reviewer missed it alongside F1. ✅ H1 confirmed.

4. F2: the component boundary you shall not pass

It’s not uncommon that while testing a hypothesis you run into new discoveries. So why did neither run flag the unrestricted subprocess permissions?

The answer is in the architecture. Both write and process are legitimate bot operations: the attacker stores a note to disk, then triggers process normally. handle_process() spawns claude -p /process-notes --allowedTools [..]. The subprocess call is visible in inbox-bot.py, but the skill it invokes is a separate file. Whether the skill passes vault content to Python in an exploitable way, and whether the permissions it runs with are appropriate, live outside the review scope.

flowchart TD
    A["Attacker note\n(malicious payload)"] --> B["handle_write\n(legitimate)"]
    B --> C[("note on disk")]
    D["process command\n(legitimate)"] --> E["handle_process()"]
    E --> F["claude -p /process-notes\n--allowedTools Bash(python3:*) ..."]
    F -.->|"spawns"| G["/process-notes skill\n(out of review scope)"]
    G -->|"reads"| C
    G --> H["Bash(python3:*)\n→ RCE if vulnerable"]
    subgraph scope ["reviewed: inbox-bot.py"]
        B
        E
        F
    end

The architecture makes the concern visible: attacker-controlled vault content flows into a subprocess running with unrestricted python3. Neither automated reviewer evaluated it at that level, though they each hit the boundary differently.

In T1 (same session), the reviewer identified the chain, labeling it “Prompt injection via vault write to claude subprocess”, but the false-positive filter dismissed it: “The attacker and vault owner are the same person; there is no external trust boundary being crossed.” That’s model bias in a different form: not suppressing a finding outright, but supplying a session-derived trust assumption that the reviewer couldn’t actually validate, because doing so would require seeing what process-notes does with the data it receives.

In T2 (cold session), the reviewer checked for shell injection: seeing list-form subprocess.run with no shell=True, it marked the subprocess as clean and moved on. Seemingly, the presence of a known secure coding pattern steered the LLM into trusting the call as safe overall: the right invocation style closed scrutiny before it reached the component boundary question. The --allowedTools string with Bash(python3:*) was never evaluated.

Neither reviewer asked whether python3:* was too broad. That question doesn’t require seeing /process-notes to answer: attacker-controlled data flowing into a subprocess with unrestricted python3 is a concern on its own, regardless of what the downstream skill does with it. A human reviewer would flag that pattern without needing to verify what the downstream component does with it. When you can’t see past the boundary, the right default is to surface the concern.

5. H2 | T3 & T4: the plugin and diff isolation

After a quick detour, we’re back to probing the second part of the original hypothesis: does the new Claude plugin’s diff-scoped reviewer miss a vulnerability chain where each change looks benign in isolation? This time, the chain is in the same file, and in a single tool call.

Test 3 (1 diff): Write + Bash(python3:*) introduced together. The plugin caught both: python3:* flagged as too broad, Write flagged as needing tighter scope. Two correct findings, auto-fix applied. But it treated them as independent concerns rather than a chain. The fix addressed F2; F3 survived.

The security hook flagged two real issues:
1. Bash(python3:*) is too broad — permits running any Python script.
   Should be scoped to the specific script path.
2. Write is too broad — should be scoped to the wiki directory under VAULT_ROOT.

Test 4 (2 diffs):

Write committed in the baseline. Nothing suspicious in isolation.
Bash(python3:.claude/scripts/*) added in a new session. A narrow, path-scoped python3 permission — looks like a reasonable hardening move. Write is outside the diff and invisible to the reviewer.

LLM code review: no vulnerabilities found.

And just like that, ✅ H2 confirmed.

Side observation from the test: when my git commit message named the permissions that had been removed, Claude read the log and inferred exactly what to restore, producing broad python3:* directly. I’ve repeated the test with a neutral commit message, and it resulted in a different fix. The commit message didn’t affect the plugin’s review, but it changed what the writing model produced. Small sample, but a useful reminder that in vibe-coding sessions the model reads everything in context, and metadata you don’t think of as instructions can still shape output.

6. What can we do about all this?

The model bias gap is actionable, and the fix is simple: run /security-review in a fresh session, not the one where you wrote the code. The unfortunate truth is that most users won’t know to do this. The natural instinct is to run the skill right there in the session where you just finished writing the code. Model anchoring isn’t obvious unless you know about it. Anthropic could nudge users here: detect when the tool is invoked in a session that also wrote the code, and warn before running.

I asked Claude to prototype this using session hooks. Available as a gist here, it works, but frankly it’s not very good. The decision: block output is blunt; it stops the prompt and requires re-running. That’s an API limitation: UserPromptSubmit hooks have no non-blocking notification option, so block is the only way to surface a visible message. Another caveat: in Claude Desktop, blocked prompts fail silently — the user gets no response and no explanation. This hook is only reliable in the Claude Code CLI.

Final thoughts

Every tool in this space has gaps. Some are documented, and some are hidden, surfacing only when you test carefully enough. The title of this post came from expecting to confirm two gaps and finding three.

Are we all doomed until Mythos comes to save us? Models evolve rapidly, and Mythos is reportedly strong at exactly the cross-boundary chain reasoning that today’s tools miss. It may well close these gaps - time will tell.

My broader take: fully autonomous code reviews don’t replace human judgment. They extend your reach, and they’re most useful when you understand what they can and can’t see. Know the limits of your tools. Trust and verify.

References

AI Assisted Bug Bounty Experiment

Sun, 24 May 2026 13:01:31 -0700

Five authorization bypass paths, clean PoCs for each, full disclosure report — output of a six-hour session: one human, one Claude Code agent, an idea, a repo clone, and a live target in Docker. I’ll cover the technical details in a follow-up post once the disclosure process is complete.

1. The experiment

With the recent hype surrounding Mythos and its autonomous vulnerability discovery capabilities, the question of what AI can do for security research is hard to ignore. Most of that conversation centers on fully autonomous systems — multi-agent frameworks operating at scale with large budgets. I wanted to try something way simpler: what can one person accomplish with a single AI agent and a modest budget?

I came in with five things: familiarity with the target as a practitioner, enough Python and Django knowledge to steer the agent, years of application security experience, an evening to experiment, and a $20 Claude subscription with Cyber Verification Program approval.

2. The hypothesis

My target is a Django-based web application enforcing a role-based access model — controlling who can see which resources, data, and system settings, at what level of detail. The access control model is the interesting surface.

I asked Claude to map the REST API — URL patterns, associated models and views — and report back on any gaps in URL parameter validation. It flagged something it thought looked interesting: a parameter responsible for fetching related objects alongside the primary response. My intuition when I saw it: likely a WebUI-driven performance optimization added on top of the existing API functionality, making the authorization model a potential target for logical flaws. Features like this get tested for functionality — does it return the right data? — but access control testing of this specific path may not have received the same attention as the core endpoints. Performance shortcuts and abstraction layers are common places where authorization logic gets underspecified, because they’re added later and the direct-endpoint tests don’t exercise them.

Whether it would lead anywhere was an open question.

3. How the research ran

From that exchange — Claude surfacing the parameter, me recognizing the authorization angle — we had a hypothesis. I asked Claude to examine how the feature works internally. It traced the relevant source files and identified the core gap: when the feature resolves a related object, no check is made against whether the requesting user is authorized to see it. The access controls on the direct API endpoints are simply not invoked in this path.

From there: local Docker instance, Claude calling the application’s own REST API with admin credentials to provision test accounts and settings, then switching to an unprivileged account to test the hypothesis. The test pattern was clean — confirm the direct endpoint returns 403, then show the same data returns through the bypass. Seeing both in the same output is unambiguous.

The first two bypass paths were straightforward once the root cause was clear. I then asked Claude to think more broadly about the attack surface, prompting it to consider how Django’s data model exposes direct and reverse object relationships. It identified three additional paths, including one through internal notes that users can mark private — the bypass returns their full content regardless, circumventing the visibility restriction the UI enforces.

Five exploitable paths in total, all from the same root cause — Missing Authorization (CWE-862) across the board, with Exposure of Sensitive Information (CWE-200) sprinkled in. Network-exploitable via the REST API by an authenticated user with read permissions, accessing admin-only data. Drafted to CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:L/I:N/A:N (4.3 Medium).

4. Chasing the escalation

The core bypass came together quickly. What followed was the natural next move for any researcher: try to chain it, escalate severity, turn a Medium into something bigger. That’s where most of the dead ends lived. Some of the vectors we tested:

Could the bypass leak API authentication tokens (full account takeover)? Blocked.
Could it be flipped into a write path? Read-only by design.
Could unusual parameter values expose internal state or crash the app in a useful way? One 500 error, no data.
Could object traversal chain across multiple hops to reach higher-value targets? Limited to one level.
Could the attack surface extend beyond the object relationships already mapped? Explored and found nothing new reachable.
Could reading related data trigger a server-side request — an SSRF via this feature would have been a significant escalation. Nope.

Thorough analysis of every relevant code path found nothing exploitable. At some point Claude even made a comment that the codebase outside the bypass appeared well-hardened — the dead ends weren’t just unlucky angles, they were genuinely blocked.

The dead-end analysis likely consumed more tokens than the core bypass itself. Each candidate path required the agent to load and reason over significant amounts of source context before concluding it was blocked. At some point I made a call to stop — it was getting late, and it was clear the token burn rate on escalation paths was outpacing any realistic chance of a meaningful outcome.

5. Division of labor

Across the full session, the process looked like this:

Target selection was mine. The application is open source with a commercial offering, backed by OWASP, and used in production by security teams — that’s how I knew about it. Hundreds of releases and thousands of GitHub stars; it runs a HackerOne bug bounty program for many years with a good number of submissions behind it.

Attack surface discovery was Claude’s. Asked to map the REST API — URL patterns, models, and views — it flagged a specific parameter as worth investigating: one responsible for fetching related objects alongside the primary response. That observation became the hypothesis we tested.

The hypothesis was joint. I directed the exploration and recognized why the parameter Claude surfaced was worth pursuing — that features like this tend to be undertested for authorization completeness. That judgment comes from years of looking at application security. An autonomous agent starting from scratch has no basis for it.

Reading and tracing source code across a large, unfamiliar codebase — holding relevant context across many files simultaneously — was Claude’s most consistent contribution. Work that would take a human hours of careful reading took Claude minutes. That’s where LLMs really shine.

Environment setup was handled by Claude directly against the live application. The agent called the target’s own REST APIs with admin credentials to provision the test accounts and configuration needed to validate each vector — then switched to an unprivileged account to confirm the bypass. No manual setup required and it knew the APIs already from reading the codebase.

PoC development and live validation was Claude’s. It wrote the test scripts, ran them against the live Docker instance, interpreted the results, diagnosed problems and iterated to completion.

Steering was mine throughout. When the initial finding was confirmed, I directed Claude to expand the search along a specific axis: how Django models expose object relationships, and where those traversal paths might reach objects the requesting user shouldn’t see. That framing produced three additional bypass paths.

Boundary analysis was Claude’s. When I asked whether the initial finding could be chained or escalated, Claude systematically traced each candidate path and explained whether it was viable or blocked and why.

Impact assessment was mine. One of the five bypass paths exposes internal security notes that teams mark private. Characterizing what that means in a real-world application requires understanding how those notes are used in practice, not just what the access model technically permits.

Structured reporting was Claude’s — CVSS scores, impact analysis, affected-code tables, and remediation recommendations for all five findings in a single consolidated report, with live response examples for each confirmed vector.

6. Numbers and disclosure

What led to this session was a failed attempt with one of the open source autonomous frameworks. I pointed it at the same target, let it run, and watched it burn through 2.5 million tokens — mostly attempting to set up its own environment, failing to start the target application it didn’t need, and exhausting its budget before reaching the actual testing phase. That experience prompted the question: what could a different approach produce?

Our session — hypothesis to five confirmed findings, with live validation, dead-end analysis, and a full structured report — consumed 1.4 million tokens at a total API cost of $25. That gap is partly explained by the hypothesis: starting from a well-formed idea of where to look is a significant multiplier on what a given budget can produce. Autonomous tools trading specificity for breadth need proportionally more tokens to work through open-ended reconnaissance before they converge on anything. It’s also worth noting that Anthropic’s model is arguably more capable than what the open source framework was running — and more expensive per token — so the human-agent configuration does double duty: it keeps the effort targeted, and that targeting matters more when the model you’re running costs more to use.

After the core research wrapped, I asked Claude to analyze the git commit history and map when the vulnerable feature was introduced. It traced the initial commit to January 2021 — identified the exact PR that introduced it and confirmed the authorization gap was present from the start — then mapped the feature’s expansion across subsequent releases. The finding had been in production for over five years across multiple major version milestones, including a significant expansion of the attack surface in 2023 that added it to roughly twenty additional API endpoints.

I submitted the findings to the vendor’s HackerOne bug bounty program — public, improvement-oriented, no monetary bounties. The report covered all bypass paths with confirmed live responses, CVSS scoring, and root cause analysis. A follow-up post with the full technical details will go up once the disclosure process is complete.

Final thoughts

Looking at the flow diagram, the natural question is: can the human node be replaced with another autonomous agent? Yes — projects like Mythos and Shannon demonstrate that fully autonomous security research is viable, and companies operate at scale doing it. The cost is higher: an autonomous agent has to discover through exploration what a human researcher brings as context, and that shows up in the token count.

All in all, a Medium severity authorization bypass with five confirmed network vectors, in a reasonably hardened codebase in a single evening — that’s a good result in my book, and it took way less effort than comparable research I’ve done solo. Happy bug hunting!

AI-Native Threat Modeling

Wed, 20 May 2026 11:29:02 -0700

When I ask hiring managers why they’re opening a product security role, the answer is usually the same: we can’t keep up. Development org grew, product surface expanded, and the security team is the bottleneck. It’s not a problem unique to any one organization — it’s the default state of product security. AI-accelerated development and vibe coding are making it worse: more code, shipped faster, with the same security team trying to keep up. The conventional wisdom is that vibe coding is a killer for AppSec — and on the current trajectory, it is.

In this post, I argue that linear scaling won’t solve that problem, and make the case that AI-generated code, treated the right way, can be a force multiplier for security.

1. The AppSec Scaling Problem

The 1:100 ratio — one AppSec engineer for every hundred developers — is the number the industry has quietly accepted as roughly accurate for mature organizations. It sounds manageable until you sit with what it means in practice: a team of five reviewing the output of five hundred, under sprint pressure, across a surface that keeps growing. It’s a demanding job — I wrote about what it actually takes.

The standard response is to hire more security engineers. That’s reasonable when the ratio is temporarily out of balance, but it doesn’t address the structural problem. If the development org doubled and the security team grew from five to ten, you’re at the same ratio. And the ratio assumes a roughly stable development velocity. AI coding assistants are shattering that assumption.

Developers using GitHub Copilot, Cursor, or Claude Code ship more, faster. Vibe coding — letting the model write code from a high-level natural language prompt — compresses timelines further. Features that took two weeks take days. The code surface is expanding at a rate that’s no longer proportional to engineering headcount, which means the AppSec scaling problem is now a two-sided function: development velocity increasing, security team capacity roughly flat. The gap is structural, and it is getting wider.

2. Where Traditional Approaches Break Down

The vocabulary for addressing the AppSec scaling problem is well developed: shift-left, secure-by-design, developer enablement, creating paved roads. They’re not wrong ideas. The problem is that they all require the same scarce resource: AppSec time.

Threat modeling — the recommended practice for high-risk features — is the clearest example. The canonical process: the development team writes a design document; the security team (or a joint session) works through the STRIDE framework or similar, maps data flows and trust boundaries, produces a model; there’s back-and-forth and eventual sign-off. This is genuinely valuable when it happens. In practice, it often doesn’t — the process is time-consuming, and AppSec time is scarce.

What actually happens is one of three failure modes:

Delay — security reviews become release blockers, friction accumulates, relationships with engineering teams deteriorate.
Risk-accept — features ship with “accepted risk” security exceptions that go into a backlog and are rarely revisited.
No review at all — code ships without security involvement, entire product areas built and deployed without the security team ever being in the loop.

With AI now compressing time-to-exploitation — public vulnerabilities can have working proof-of-concept code within hours — the third option is no longer a viable gamble.

Security code reviews have the same structural problem one step later: someone writes code, another team reads it, back-and-forth, sign-off. Every handoff is a scheduling dependency that adds release latency.

3. The Threat Model Maintenance Problem

There’s a second-order problem with threat modeling that gets less attention than the initial production cost: drift.

A threat model is created as a snapshot, but the system keeps evolving. New endpoints added, authentication flows refactored. Six months after a threat model is signed off, it describes a system that no longer looks the same. The question of who owns maintenance is usually a gray area: the development team didn’t write the model and isn’t trained to maintain it; the security team is not aware of changes and has to context-switch back into a system they last looked at months ago. Neither path works well in practice.

Most organizations treat the threat model as a gate the security team required at feature launch — it was produced, the box was checked, and maintenance was never part of the contract. It documents what the system looked like at one point in time and then quietly expires.

4. The Key Insight

Here’s where the mental model needs to shift.

In the current workflow, threat modeling is derivative work: a security person reads what a developer built and reconstructs the security-relevant picture from it — after the fact, inherently lossy, potentially inaccurate, and always one step behind.

Open-source projects such as Tachi and several commercial offerings recognize this and offer tools that automate the reconstruction: read the codebase, analyze diffs, apply a methodology, output a structured model. These tools are useful, but they’re still doing the same derivative work, just faster — reverse-engineering security structure from existing code rather than having a human do it. There’s also a cost dimension: analyzing an existing codebase means feeding it back through an LLM as new input, which is expensive at scale. The larger and more frequently updated the codebase, the higher the token cost of each analysis pass.

Now consider what changes when AI is writing the code — through vibe coding, spec-driven development, AI-generated scaffolding from a design document, or an agentic coding loop that implements a full feature end-to-end.

It doesn’t reverse-engineer anything — it knows, because it built it: every data flow it designed, every entry point it created, every asset it touched, every trust boundary it crossed or established, every authentication decision it made. The complete map required for a threat model exists as a natural byproduct of the design work the AI just did — and it exists at the moment of creation, not after. And because that context is already in the model’s working window, generating the threat model alongside the code is parallel effort on the same inputs, with little additional token cost.

The consequence of this observation is straightforward: threat models should be generated alongside code, as first-class artifacts, not assembled later as derivative documents.

gantt
    title 1. Current — human-driven, sequential
    dateFormat YYYY-MM-DD
    axisFormat %d
    section Developer
    Design doc            :a1, 2024-01-01, 3d
    Write code            :a2, 2024-01-04, 3d
    section Reconstruct (Security)
    Reconstruct & model   :a3, 2024-01-07, 3d
    section Review (Security)
    Review & sign-off     :a4, 2024-01-10, 2d

gantt
    title 2. AI-assisted — LLM writes code, LLM reads code
    dateFormat YYYY-MM-DD
    axisFormat %d
    section Developer
    Generate code         :b1, 2024-01-01, 3d
    section Reconstruct (AI-assisted)
    LLM reconstructs TM   :b2, 2024-01-04, 2d
    section Review (Security)
    Review & sign-off     :b3, 2024-01-06, 2d
    section Time saved
    time saved            :done, 2024-01-08, 4d

gantt
    title 3. AI-native — code and threat model in parallel
    dateFormat YYYY-MM-DD
    axisFormat %d
    section Code
    Generate code         :c1, 2024-01-01, 3d
    section Threat Model
    Generate threat model :c2, 2024-01-01, 3d
    section Review (Security)
    Review & sign-off     :c3, 2024-01-04, 2d
    section Time saved
    time saved            :done, 2024-01-06, 6d

Accuracy improves — the model is a direct output from the entity that designed the system, not a reconstruction. Maintenance improves because every code change can regenerate or update it in the same operation; the entity making the change already knows what changed and why. The multi-step, multi-team back-and-forth collapses into a single step. Security practitioners remain in the loop — for methodology, formal sign-off, challenging assumptions the AI didn’t surface — but the labor-intensive baseline work of constructing the model moves from a human bottleneck to an automatic output.

This is what “shift-left” should actually mean: not have the security team review earlier, but produce the security model at the same moment the system is designed. The security artifact is contemporaneous with the code, not chasing it.

5. On Model Bias in Security Analysis

A legitimate concern about this approach is AI model bias. There’s a well-documented pattern in AI-assisted security review: when a model writes code and is then asked to evaluate it for security in the same context window, it tends to anchor to its own design decisions, finding reasons why its choices are sound rather than challenging them. An independent reviewer operating from a fresh context — a second model, or a human who didn’t write the code — is more likely to surface issues the original author missed. This is a real limitation, and it applies directly to using AI for code security review.

The core distinction here is that code security review and threat modeling are quite different. A security review asks the model to evaluate whether its own implementation is correct and secure — the question where anchoring bites hardest, because the model is judging choices it already committed to. A threat model asks something structurally different: document the architecture, establish trust boundaries, map data flows and assets, then apply a framework like STRIDE that poses a fixed set of questions across threat categories. The framework is external to the code; its questions don’t change based on how well or poorly the implementation is written. The question it asks — given what this system does, what can go wrong in each of these categories? — is answered from the architectural map, not from a judgment about implementation quality.

What bias could still affect is the model’s assessment of severity — an AI that made a particular design trade-off might rate the resulting risk lower than an independent reviewer would. That’s a real concern, and it’s exactly why human review of the model’s outputs and assumptions is still valuable in this workflow.

6. Why Threat Modeling Still Matters

A reasonable objection at this point: if AI writes the code, why not just ask it to write secure code and skip the threat model entirely? We should absolutely ask for that — but threat modeling serves purposes that “write secure code” doesn’t address.

Security architecture documentation. Threat models capture architectural decisions and their security implications: trust boundaries, data classifications, what the system assumes about its environment, where the blast radius of a failure ends. These don’t live in code. A system can be implemented correctly while making architectural trade-offs that accept certain risks; those trade-offs need to be explicit, owned, and findable.

Known gaps and accepted risks. Every system ships with tradeoffs — incomplete defenses, deferred work, risks that were evaluated and accepted. A threat model makes these explicit: here is what we considered, here is what we’re not defending against, and here is why. This matters for accountability, for prioritization, and for the engineer who joins the team six months from now.

Compensating controls. Good security architecture is layered. WAF rules, rate limiting, network segmentation, monitoring and alerting — these don’t live in application code, but they’re part of the security posture. The threat model is where they’re connected to the threats they compensate for. This is also where code-analysis-based automated tools tend to generate false positives: they see the change in isolation, unaware of the external controls that already mitigate a given risk.

Compliance requirements. SOC 2, PCI-DSS, ISO 27001, HIPAA, and similar frameworks require documented evidence of threat analysis. Auditors want artifacts. A threat model that exists and is demonstrably current — generated from the same codebase it describes — is a far stronger compliance artifact than one that was carefully written at launch and hasn’t been touched since.

Incident response preparation. When something goes wrong — and eventually something does — a current threat model tells you what’s at risk, what attacker paths exist, and what to prioritize. You want this analysis done before the incident, not during it.

Stakeholder communication. Engineering leadership, legal, product, and board-level security committees need to understand risk in terms they can act on. The codebase doesn’t serve this purpose; a structured threat model does.

The case for threat modeling doesn’t weaken when AI writes the code — if anything, AI makes the security artifacts cheaper to produce, easier to keep current, and more consistently complete than the human-driven alternative.

Final thoughts

I think this is the direction the AI coding toolchain is already moving toward, even if the full vision hasn’t arrived yet. AI coding tools are increasingly integrating security into the development workflow: GitHub Copilot’s real-time vulnerability detection during code generation, Claude Code’s security analysis during code review, Replit’s Security Agent in the development environment. None of these offer AI-native threat model generation, but they signal that the industry is treating security as something the coding tool prioritizes and produces alongside code. The extension of that to living, maintained threat models is the logical next step.

The reframe for ProdSec practitioners is this: stop thinking of threat modeling as a process your team performs on code that developers write. Start thinking of it as an artifact the AI coding assistant produces alongside the code, which your team validates, challenges, and signs off on. The security team’s job shifts from construction to judgment — which is where human expertise actually compounds.

The dreaded 1:100 ratio won’t disappear. But the work of constructing and maintaining the threat model doesn’t have to stay a human-hours problem. The needle can move — but only if the security team’s role evolves with it.

TrustFall: The Perimeter Problem in Agentic Tools

Mon, 18 May 2026 10:00:00 -0700

On May 7, 2026, Adversa AI published TrustFall — a one-click remote code execution in Claude Code, with variants across Gemini CLI, Cursor, and GitHub Copilot: clone a repository, open it, click “Yes, I trust this folder”, and an attacker-controlled process runs with your full OS privileges.

Anthropic declined the finding as outside their threat model and the behavior as functioning by design. This post digs into that response — and argues that the core issue is architectural: a perimeter security model that can’t carry the weight placed on it, and that makes the vulnerability structurally hard to surface through threat modeling. It looks at what a zero trust alternative would look like for agentic tools.

1. The vulnerability

Project-level config files — .mcp.json and .claude/settings.json — committed to a repository can activate attacker-controlled MCP servers. When a developer clones the repo and clicks through the trust dialog, an unsandboxed Node.js process spawns with full user OS privileges — no further prompts. Three developer actions: clone, open, click. The attack also has a zero-click CI/CD variant where Claude Code runs headless in GitHub Actions against an untrusted pull request branch, bypassing the trust dialog entirely.

For the full technical details see TrustFall.

In threat modeling terms, the mechanism is a trust boundary problem. Project-scope config — committed to a repository — can activate MCP servers: external processes that run with full user OS privileges. The gate between untrusted repo content and those privileges is a single user prompt:

flowchart TB
    subgraph Untrusted["Untrusted · attacker-controlled"]
        PCfg["Project config\n(committed to the repo)"]
    end

    CLI["Claude Code CLI"]

    subgraph Gate["Trust gate"]
        Prompt["'Do you trust this folder?'\nsingle Yes/No"]
    end

    subgraph Privileged["Privileged · full user OS access"]
        MCP["MCP Server process\nNode.js · no sandbox"]
        OS["~/ · ~/.ssh · ~/.aws\nread / write / exec — unrestricted"]
    end

    PCfg --> CLI
    CLI --> Prompt
    Prompt -- "on 'Yes'" --> MCP

That single Yes/No covers four distinct capability grants:

Claude reading and editing project files (clearly implied)
Claude following project-level behavioral settings (reasonable)
Activating MCP servers defined in project config (not stated)
Those servers running as unsandboxed processes with full user privileges (definitely not stated)

Two of these carry significant security consequences — and neither appears in the prompt language. The gap between what the user consents to and what the system delivers is a textbook Elevation of Privilege (the E in STRIDE methodology): the subject grants more than they know.

The pattern holds across the tools TrustFall examined — Gemini CLI and Cursor do mention MCP servers in their consent language, Claude Code and Copilot don’t, but all four default to Yes or Trust.

2. “Outside our threat model”

Anthropic’s phrase is worth examining literally. Per TrustFall’s analysis, the missing enforcement isn’t outside Anthropic’s defined boundary — it’s inside it. The trust dialog is the perimeter; what happens after it is, by their own framing, the trusted zone. “Outside our threat model” means, in practice: inside our perimeter, but below the granularity we protect at.

That granularity isn’t uniformly coarse — the capability is demonstrably known. bypassPermissions gets a dedicated warning because it is dangerous; enableAllProjectMcpServers, enabledMcpjsonServers, and permissions.allow activate equally dangerous behavior without equivalent disclosure. TrustFall also notes that earlier versions of Claude Code included an explicit MCP consent prompt that was later removed. These are the tell-tale signs of a threat model that’s coarser than the reality it represents — some dangerous capabilities are visible enough to gate explicitly, others slip through the same boundary unexamined.

That pattern of selective gating is further undermined by the CVE record. Anthropic’s response to TrustFall was that the behavior functions as designed — clicking “trust this folder” means accepting the project’s configuration, MCP servers included. Yet over six months before TrustFall, Anthropic patched three related vulnerabilities: delayed MCP activation until after the trust dialog (CVE-2025-59536, Oct 2025), blocked ANTHROPIC_BASE_URL from project scope (CVE-2026-21852, Jan 2026), and blocked bypassPermissions from project scope (CVE-2026-33068, Mar 2026) — the same setting that already carried a UI warning. Each fix adds a specific gate or blocklist entry — the signature of a perimeter being hardened incrementally, one dangerous capability at a time, without a unifying policy. If the trust dialog truly constitutes full consent by design, there would be nothing to patch.

3. From perimeter to zero trust

The “functions as designed” response places the burden on the developer: audit what you clone. That position rests on a perimeter security architecture — verify once at the gate, trust everything inside:

flowchart LR
    A1["Repo"] -->|"✓ trust gate"| A2["CLI"] --> A3["MCP"] --> A4["OS"]

The perimeter pattern is one modern security has largely moved past. The alternative approach is a zero trust — identity propagated through the capability chain, evaluated at each grant. Git provides the primitives: clone origin, remote URL, commit author. Conceptually, it would look like this:

flowchart LR
    B1["Repo\norigin · author"] -->|"✓ id check"| B2["CLI"] -->|"✓ capability?"| B3["MCP"] -->|"✓ scope?"| B4["OS"]

Read through the zero trust lens, Anthropic’s position has three problems.

Shared responsibility requires the system to carry its half. Under a perimeter model, all verification burden falls on the user at the gate — the system has the identity signals, but leaves them unused. Zero trust distributes the burden: each capability is evaluated at the point it’s granted.

The consent gate can’t convey per-capability trust. A perimeter gate concentrates all trust decisions into one moment. Anthropic’s gate covers MCP server activation, process spawning, and full OS access — none of it signaled. The coarser the gate, the harder it is to make consent meaningful.

The perimeter model puts expert-level burden on non-expert users. A perimeter gate requires the user to reason about all downstream consequences of a single click — a reasonable ask for a security engineer, not for a vibe-coder. Zero trust shifts that burden to the architecture: each capability grant is evaluated by the system, not the user.

The three problems compound: the system ignores available identity signals, the gate doesn’t compensate by informing the user what it actually grants, and the users left holding that gap aren’t equipped to close it.

Final thoughts

This post looked at two connected things: what the trust gate actually grants, and why the security architecture makes that easy to miss. At its core, TrustFall is a consent gap — a single prompt covering MCP server activation and unsandboxed OS access without stating either. The perimeter model is the structural reason that gap is hard to surface at design time: inside the perimeter is trusted by definition, leaving no natural STRIDE targets. Threat modeling a perimeter-architected system, finding the EoP requires a modeler to look past the gate and ask what it actually grants — that’s a skill, not something the methodology surfaces automatically. The CVE record shows this playing out: each patch adds a specific gate or blocklist entry without restructuring the boundary, and “functioning as designed” remains the public response to TrustFall.

A zero trust security architecture changes the shape of the problem. Explicit trust boundaries at each capability grant are natural STRIDE targets — the EoP question surfaces at design time regardless of modeler experience, not because of better analysts, but because the architecture itself gives threat modeling more boundaries to work with.

TrustFall affected Claude Code, Cursor, Gemini CLI, and Copilot — evidently all due to the same perimeter model. The broader security industry made this architectural transition before: perimeter security dominated until systems grew complex enough that a single gate couldn’t hold, and zero trust emerged as the answer. Agentic tools are on a similar trajectory — gaining capability and OS access fast, with the same pressure building at the trust boundary.

Is zero trust the natural next step in the architectural evolution of agentic tools?

References

Adversa AI — TrustFall: Coding Agent Security Flaw Enabling RCE in Claude, Cursor, Gemini CLI, Copilot
Anthropic — Claude Code Settings

Thoughts on Product Security Career

Tue, 12 May 2026 09:51:22 -0700

I recently wrote about my product security principles — the operating frame I’ve built for doing the job well. This is the post that probably should have come first: what product security actually is as a career, whether it might be the right path for you, and what ten years of doing it has taught me.

Ten years in product security teaches you one thing above all: it is a hybrid discipline, and that is both its challenge and its appeal.

The role asks for coding skills — enough to read unfamiliar codebases, spot vulnerability patterns, and write the automation that makes security scale — but not at the level of a senior software engineer. It asks for offensive security knowledge — how attackers think, how systems break — but you’re not a red teamer or a dedicated pentester. You need architectural judgment and systems-level thinking to design security solutions that fit inside complex systems, but you’re not designing the products themselves. Program management skills come into play when you’re owning a roadmap and driving cross-functional initiatives, but your customers are internal. Risk and compliance fluency matters — understanding risk is what drives prioritization decisions — without being a GRC officer. Enough ITSec grounding to be credible in an IR conversation, without being a SOC analyst.

Rarely all of these at full depth — but all of them at working depth. The breadth is the job.

The technology surface is equally wide. Multi-cloud environments, Kubernetes and container security, CI/CD pipeline hardening, secrets management, HSM-backed key hierarchies, OS-level hardening, infrastructure-as-code, supply chain integrity, identity and access management, compliance frameworks — the list is long and grows with the industry. You don’t need to be the expert in all of it, but you need to be fluent enough in each area to ask the right questions, spot the gaps, and know when to go deeper.

Context switching is another constant demand. Security teams are undersized by design, so people come to you constantly: a quick auth question from a developer, a compliance clarification from legal, an architecture review that landed in your queue, an incident that just got escalated. Each requires a different mode — deep focus for a thorough threat model, quick confident judgment for the everyday interruptions. The instinct might be to guard your time against the noise. Resist it. Those questions are how you move the needle. Embrace them.

The human side carries equal weight with technical depth — and this often goes unsaid. Influencing teams that don’t report to you, competing for roadmap space without turning adversarial, partnering with engineering instead of policing it, enabling people rather than gatekeeping them. Add to that customer-facing and executive communication — translating technical risk into language that lands with a non-technical audience is a distinct skill, and a critical one. Product security lives inside organizations with competing priorities, and how far you move the needle depends as much on how well you work with people as on what you know.

High stakes raise the difficulty. There’s the obvious pressure of a security incident — high-visibility, fast-moving, unforgiving. But there’s also the quieter, constant pressure of not missing something: a vulnerability in a design review, a misconfiguration in a new service, a risk that slips through and becomes next quarter’s incident. The job requires staying sharp under both.

Invisible success is the other side of that coin — and something I touched on in my earlier post. When nothing goes wrong, there’s nothing visible to point to. Security’s value is counterfactual by design, and that takes some getting used to.

If you’re hiring for product security

Understanding what the role actually requires has a direct implication for how you hire — and most interview processes get this wrong.

The most common mistake I see is screening candidates with a LeetCode-style assessment. I’ve worked with some of the brightest security engineers in the industry — holding them to an algorithmic coding bar doesn’t filter for talent, it filters out the wrong people. That’s not what you’re hiring them for.

The same applies to system design. I’d bet many of the best security engineers I’ve worked with couldn’t design a scalable distributed system end-to-end — but they can dissect an existing one and find its security design flaws faster than anyone in the room. The blank-whiteboard system design exercise misses the point entirely.

What works instead: for the coding round, give them a real code sample and ask them to review it for vulnerabilities. Give them a CVE and walk through the risk assessment — what’s the realistic impact, what systems are exposed, how would you prioritize remediation? Keep a human in the loop; you want to see how they think, what they catch, what questions they ask. For system design, hand them an actual design document or a system diagram and ask them to threat model it: identify the assets worth protecting, map the trust boundaries, enumerate the threats at each boundary, reason through attack vectors — then recommend layered defenses to mitigate the risks they’ve identified. A candidate who can do that credibly is showing you the core of the job.

Software engineers and security engineers look at the same systems from different angles. Tailoring the interview process to the role isn’t lowering the bar — it’s raising the accuracy of the bar.

Product security demands technical breadth, strong soft skills, and the ability to navigate complex organizational dynamics — all at once. Hiring for it requires a process calibrated to that reality. Is it the right career for you? Ten years in, it still is for me.

My Product Security Principles

Sun, 10 May 2026 00:00:00 -0700

In my recent job search I read dozens of Product Security job descriptions. They all contain the same buzzword soup: shift-left, secure-by-default, defense in depth, paved roads. In practice, they mean different things at different companies — but what do they actually mean for the Product Security team?

What follows is my personal operating frame. One security engineer for every hundred in engineering is roughly where the industry sits — these are my principles for operating and succeeding in that reality. And I believe they hold in a world of vibe-coded apps and AI-accelerated production code.

1. Risk is the unit of work, not findings

Everything flows from business risk: vulnerabilities, architecture gaps, compliance requirements are all risks to be scored, prioritized, and decided on. They belong in a risk register that the business actually owns, and part of the security team’s job is ensuring it does: it should be a living record that business owners understand, contribute to, and sign off on, not a document security maintains in isolation.

Proactive and continuous risk reduction is how I formulate the security team’s mission.

2. Frame risk in business terms

A CVSS score means nothing to an executive. A risk item must answer: what’s the realistic scenario, what does it cost if it happens, what does it cost to fix, what’s your recommendation? When security decisions carry significant business risk, frame them in business language.

Seek explicit executive sign-off for security exceptions and risks above a materiality threshold — it moves accountability to where the decision lives.

Risk = Severity × Likelihood: the key formula that turns a vulnerability into a business decision.

UPD 5/30: A reader pointed out that my original formula (Risk = Severity × Potential Impact) was incorrect; it conflated severity and potential impact (which represent the same measurement), and missed the likelihood completely. I’ve updated the post with the correct formula above.

3. Security Architecture

A whiteboard conversation at design time costs an hour; a redesign after implementation costs a sprint. Security belongs at the beginning of the design process, not at the end as a gate. The goal is to be the person engineers call when they’re designing, not when they’re shipping.

Don’t overthink threat modeling; formal frameworks have their place, but if the overhead of the methodology is slowing teams down, drop it. A napkin sketch of trust boundaries and a list of “what could go wrong” is a threat model, an imperfect one done at design time beats a rigorous one that never happens.

Good security architecture is transparent. Document it publicly; it builds trust and is the right counterweight to security through obscurity. If the design is sound, exposure doesn’t weaken it. If your code ever leaks, there should be no secrets in it worth finding.

4. Assume controls fail — design and test for it

The operating posture is proactive, not reactive. Find the gaps before an attacker or a customer does. No single control holds forever: when this fails, what’s the worst reachable outcome? Isolation, least privilege, and short-lived credentials aren’t redundancy, they’re blast radius reduction. Treat defense in depth as a system property.

Designing for failure isn’t enough — validate that your controls actually perform as designed. Security audits, red and purple team exercises, and bug bounty programs all serve the same function: actively probing your own assumptions.

5. Friction is the enemy

Culture is a big one. Most engineers want to build secure software — they’re just operating under deadlines, competing priorities, and finite cognitive bandwidth. When security loses, it’s usually not because engineers don’t care; it’s because the secure path was harder than it needed to be, or they simply didn’t know what it was. Security expertise isn’t a given — engineers are experts in their domain, not ours.

Every process, template, and gate should make the secure choice the default, not the tax. Security has to live inside the workflows engineers already use. A separate system they have to visit is a system that will fail adoption. Friction reduction is the mechanism; the goal is cultural: security becoming a natural part of how the team ships.

6. Influence over formal authority

Security teams often have no direct power over engineering decisions. Authority comes from technical credibility, consistent judgment, and being right often enough that people seek your input. A security control engineers chose is worth more than one you mandated.

Influence runs in both directions. Top-down: executive sponsorship sets the tone and makes security non-negotiable at the policy level. Bottom-up: invest in building relationships with engineering teams — understand their roadmaps, empathize with their pressures — that’s where actual adoption happens.

7. Partners, not adversaries

Competing with engineering for resources — security work versus features on the roadmap — comes with the territory. That tension is structural and it never fully goes away. Recognize it as part of the job; the risk is letting it harden into an us-versus-them mentality that undermines collaboration. Security and engineering look at the same problems from different angles, but there is one goal: ship secure software. Learn each other’s stack, understand the roadmap, show up as a collaborator rather than a reviewer. The security team engineers want to call is more effective than the one they’re required to consult.

8. Know when to stand down — and when to push back

The willingness to say “network-layer isolation is sufficient here” or “this threat is acceptable risk” is what earns credibility for the fights that matter. Security maximalism destroys trust; knowing when to stand down builds it.

When you do push back, come with data — exploit likelihood, realistic impact, cost to fix. And maximize the context you hand to developers: a finding with a clear severity rationale, a realistic attack scenario, and a suggested remediation gets acted on. A bare vulnerability ID with no explanation gets triaged into a backlog and forgotten. The goal isn’t to be right — it’s to be useful.

9. Disagree and commit — deliberately

Sometimes a feature ships with known security gaps. That’s a business decision, and it’s often the right one. The security team’s job in that moment isn’t to block or to silently acquiesce — it’s to make the decision deliberate: agree on the minimal security bar, add basic compensating controls, document the residual risk, and put the remediation work on the roadmap. Ship it, then follow through. The danger isn’t shipping with known gaps — it’s shipping with undocumented gaps and no agreed plan to close them.

10. Scale through systems, not headcount

A small security team can’t review everything a hundred engineers build. Security scales through parallel tracks:

Enablement: templates, reference architectures, security champions, and training that make good security judgment transferable — so engineers make secure decisions without needing a security review at every turn.
Automation: SAST, dependency scanning, secrets detection, security gates in CI/CD that run on every PR, and most recently, LLMs.
Holistic remediation: when a vulnerability pattern surfaces in a functional area, drive an initiative to close the class — a shared library, a framework guardrail, a linting rule. Closing the class beats closing the tickets.

11. Security success is invisible — until it is a failure

Security’s value is counterfactual by design. You’re selling the absence of bad outcomes, which is invisible until it isn’t. The “we didn’t get hacked — why do we even need a security team?” question is a predictable trap. When things are quiet, there’s nothing visible to point to; when something goes wrong, the case for security makes itself — but at too high a cost.

The measurement gap is real — work around it. SLA compliance, MTTR, vulnerability age, findings caught pre-production are useful signals. Tell the risk reduction story proactively: here’s what we found before it became a breach, here’s how the attack surface changed over the past year, here’s what we closed before a researcher or an attacker got there first.

Security that only shows up in the numbers after an incident has already lost the framing war — and ironically, that’s often when companies make their first Product Security hire.

Done well, product security is invisible: engineers ship without friction, teams collaborate without tension, and executives make informed decisions without needing a crisis to focus them. Getting there is a journey, but not an impossible one.

Honeypots for AI Agents

Wed, 06 May 2026 08:00:07 -0700

AI agents run with broad access to the local environment — credentials, config files, shell history, dotfiles — and nothing stops a misbehaving one from reading things it was never meant to touch. The question is: how do you catch it?

My professional career in network security started in 2005 when I joined Juniper Networks’ IDP team — which is probably why, when thinking about how to detect a misbehaving agent, my mind went straight to a technique from that era.

The IDS world solved an analogous problem decades ago. A network intruder probes hosts and services they have no business touching; a honeypot sits in those places looking like an attractive target — an exposed SSH server, an unprotected admin console — but it’s a decoy, and it alerts when anything interacts with it. Honeypots are traps, tripwires — there’s no legitimate use for a fake host, so the false-positive rate is near zero.

The same idea can be applied to AI agents: plant bogus credentials where a rogue agent would look, and treat any access as a signal. Rather than trying to define what legitimate agent behavior looks like — which is hard, because legitimate agents do a lot of things — it defines what illegitimate behavior looks like: touching credentials that were never surfaced as part of the task. That’s a tighter definition, and it maps well to the threat model.

1. The threat

An AI agent running on a developer machine or in CI has access to everything its process can read: ~/.aws/, ~/.ssh/, ~/.kube/, ~/.config/, environment variables, dotfiles, every file in the project directory. That access is usually not scoped to what the agent actually needs to do its job — it’s scoped to the developer’s own permissions, because that’s how agents tend to run inside an IDE.

There are three failure modes this matters for, in increasing order of intent.

Overprivileged agents. A legitimate, well-behaved agent still runs with the developer’s full filesystem permissions. It may passively sweep up credential files while reading context, or actively reason that credentials it found are relevant to the task — either way, not out of malice, just because nothing stops it. The agent’s judgment about what it’s authorized to do is not the same as what it’s actually authorized to do.
Prompt injection. A malicious instruction embedded in content the agent processes — a README, a web page, a database row — redirects the agent to look for credentials and exfiltrate them. The agent is not compromised; it’s following instructions that were smuggled into its context. It tops the OWASP Top 10 for LLM Applications as LLM01:2025, and for good reason — it exploits the design of LLMs rather than a flaw that can be patched.
Compromised agents. A model that has been fine-tuned adversarially, or a scaffolding layer with a supply-chain compromise, probes the environment deliberately. Skills and MCP servers install from the internet with no signing and no scanning — malicious ones are a real and active delivery mechanism, not a future concern. The emergence of dedicated skills security solutions is evidence enough that this threat is already being taken seriously.

The distinguishing behavior is the same in all three cases: the agent accesses resources it was never explicitly directed to use.

2. Why defenses fall short

What makes this more dangerous is how creative agents get in pursuing their goals. Anthropic’s agentic misalignment research stress-tested models in simulated corporate environments and found that agents would pursue their assigned goals through unexpected means when straightforward paths were blocked: leaking sensitive information to competitors, attempting to influence their own evaluation, sabotaging research they disagreed with. The models weren’t broken; they were doing what they were optimized to do, just not in the way their operators intended.

This creativity makes naive defenses, such as blacklisting certain paths, blocking specific tool calls, or simply instructing the agent not to touch certain things — largely moot. A sufficiently motivated agent will reason around them, find an indirect path, or reframe the task in a way that makes the restricted action seem justified.

This compounds with a structural problem on the human side: agents increasingly run with minimal oversight. Skip-permissions modes bypass approval prompts. Long-running background tasks accumulate hundreds of tool calls nobody reviewed. Users experiencing approval fatigue click through without reading. The result is an agent with broad local access, a disposition to find creative paths to its goals, and nobody watching closely.

Simon Willison coined the lethal trifecta as a framework for exactly this risk: private data access, exposure to untrusted content, and exfiltration capability. When all three are present, an attacker who can inject instructions into the agent’s context has essentially won. It has become a widely accepted shorthand for why agentic deployments require a different security posture than traditional software.

3. Existing tools

Network honeypots have been around since the early 2000s. The idea is unchanged across thirty years: deploy something that looks real but has no legitimate use, and alert on any interaction. A low-interaction honeypot like Honeyd emulates network services; a high-interaction one like Cowrie runs a full fake SSH daemon. Either way, any connection is anomalous by definition — legitimate users don’t hit the honeypot.

Canarytokens applied this to credentials and files rather than network services. You generate a fake AWS access key or a Word document with a beacon embedded; when someone uses the key or opens the document, you get an alert. The AWS canary creates a real IAM user and monitors CloudTrail — there’s a lag of minutes, and it requires external AWS infrastructure.

Snare is a newer honeypot built specifically for the AI agent threat model, with a few meaningful differences from Canarytokens. It covers 18 credential types in one shot — including AI-native ones like OpenAI, Anthropic, and MCP server configs that Canarytokens doesn’t have — placing canaries in all the standard locations an agent would probe. The AWS canary fires at credential-resolution time via a local shell hook, before any API call is made, which is faster and doesn’t require external AWS infrastructure. Alerts include the SDK user agent and ASN, with a “Likely AI agent” flag when the request originates from cloud infrastructure — context that Canarytokens doesn’t surface.

The conceptual lineage runs from Honeyd to Canarytokens to Snare with the same core insight at each step: if you can define what legitimate access looks like, anything outside that definition is a signal.

4. Limitations

It detects, it doesn’t prevent. The agent already misbehaved by the time you get the alert. These are detection controls, not prevention controls. Knowing an agent touched a credential is useful — it’s not the same as stopping it. That said, detection data has value beyond the alert itself: observing what agents actually reach for in practice — even legitimate ones — tells you where the real access boundaries need to be. That’s useful input for building guardrails, scoping permissions, or writing policies grounded in observed behavior rather than guesswork.

Placement is manual. Canaries live where tools naturally look for credentials. If an agent is directed to a custom config path or a non-standard environment variable, the canary won’t be there. Coverage is bounded by where you planted the wires — the same fundamental limitation as any tripwire-based detection.

Detection confidence varies. Precision depends on the canary design and how deeply it hooks into the agent’s execution environment. Not all credential access paths are equally observable.

Shared machines. The low false-positive guarantee relies on the canary being invisible to legitimate users. On a machine shared by multiple developers, someone may stumble across a planted credential and use it for a real task — generating an alert that has nothing to do with a misbehaving agent. Dedicated agent environments sidestep this; shared workstations require more care.

Conclusion

The IDS-era idea is simple: legitimate users only access resources they have business accessing — so any interaction with a decoy is a signal by definition. What’s new is the target — an agent running under your own account, following instructions that arrived via a file you never meant to treat as executable, in a world where the line between “following instructions” and “going rogue” is invisible to a monitoring system that only observes API calls.

The tools are already there. Snare in particular caught my attention — built specifically for the AI agent threat model, it covers the credential surface an agent would probe and fires faster than any CloudTrail-based approach. An old technique that still holds up in the agentic age.

References

LLM01:2025 Prompt Injection — OWASP Top 10 for LLM Applications
Agentic Misalignment: How LLMs Could Be Insider Threats — Anthropic
The lethal trifecta for AI agents — Simon Willison
Snare — honeypot canaries for AI agents

copy.fail: From kernel CVE to Kubernetes Container Escape

Sat, 02 May 2026 11:58:11 -0700

My previous posts looked at what happens when a shared object — the LLM KV cache — has no per-tenant namespace: co-tenants can read each other’s data and starve each other’s resources. Coincidentally, the newly dropped CVE-2026-31431 (copy.fail) is the same pattern at a lower layer. The shared object is the Linux page cache (yup, a cache again!). The co-tenants are Kubernetes pods. The isolation boundary that does not exist is the host kernel.

The xint.io team published a writeup of the LPE vector; as of time of this post their container-escape follow-up is not yet out. This post covers a concrete container escape chain against Talos Linux and what the vulnerability class says about shared-kernel container security.

TL;DR

copy.fail is a local privilege escalation — but the xint.io authors note it can also be used to escape Kubernetes pods. In that scenario, the LPE itself is not the point: the attacker is already running in a pod and doesn’t need to escalate within it. Instead, copy.fail is used as a page cache corruption primitive. Containers on the same node share the host kernel, and through it, the page cache: the kernel’s in-memory representation of file contents. If two containers share an image layer, they share page-cache pages for every file in it. Corrupt the right file in a layer shared with a privileged container, and you get code execution in that container’s context — no LPE in the attacker’s pod required.

Talos Linux — a minimal, immutable OS for Kubernetes with no shell, no SSH, and a read-only root filesystem — is a concrete example where this plays out. On Talos worker nodes, kube-proxy and user workload containers share an overlayfs layer containing /usr/sbin/nft. kube-proxy calls nft as root every few seconds to reconcile nftables rules. An attacker in an unprivileged pod overwrites nft’s page-cache pages with shellcode and waits. The next reconciliation tick executes it as root, with access to the host filesystem.

At the end I discuss how microVM and sandboxed runtime architectures address this class of vulnerability by design.

1. Background: the copy.fail primitive

The full mechanics are covered in the xint.io writeup, and IMO the PoC exploit is very elegant. The short version:

Linux exposes kernel crypto to unprivileged userspace via AF_ALG sockets. A 2017 “in-place optimization” allowed the AEAD encryption engine to use the destination buffer as scratch space during intermediate steps — a reasonable choice, unless that destination buffer is backed by page cache pages.

The exploit feeds file data into an encryption operation using splice(), which passes page-cache pages directly rather than copying them. The AEAD engine writes a small amount of attacker-controlled data into those pages as scratch. The file on disk is untouched. The kernel’s in-memory view of it changes. No filesystem permissions are checked; no root is required; any process that can open an AF_ALG socket and read a file can do this.

2. Why the page cache crosses container boundaries

The Linux page cache is the kernel’s in-memory cache of file contents. When a process reads a file, the kernel loads it into the page cache and serves future reads from there. Writes to the page cache via copy.fail are immediately visible to any other process reading the same file — regardless of which container that process is in.

Linux namespaces isolate process trees, network interfaces, mount points, and user IDs. The page cache has no namespace. There is no per-container view of cached file contents. Every container on a node shares one page cache, managed by one kernel.

Container images are composed of layers. containerd stores each unique layer once on disk; multiple containers that share a base image mount the same underlying inodes through overlayfs. Shared inodes mean shared page cache pages. If two containers on the same node have pulled images that share a layer, any file in that layer lives in memory exactly once, visible to both.

3. The Talos escape chain

Talos Linux is built around the claim that “Talos Linux is the best OS for Kubernetes”. It ships with security-focused design: no SSH daemon, no package manager, no operator shell, a read-only root filesystem, and a gRPC-only management plane. Unfortunately all of that hardening operates at the OS level — and copy.fail operates below it, at the kernel. The Talos security team documented the vector in advisory GHSA-m38g-vww2-mvgx and patched it in v1.12.7 and v1.13.0, which ship the updated Linux kernel.

How the attack works in killchain terms: initial access to an unprivileged pod → page cache corruption via copy.fail → hijack code execution in a privileged pod via shared layer → container escape to the host. The escape has four components:

Shared layer with a privileged trigger. kube-proxy runs as root on every Talos node, and its image shares a layer containing /usr/sbin/nft with images built on compatible base distributions. kube-proxy calls nft automatically every few seconds to reconcile nftables rules with the cluster’s Service objects.

The exploit loop. The attacker’s process — running as an unprivileged user with no capabilities — opens an AF_ALG socket, reads the nft binary’s file descriptor, and splices its pages into an AEAD encryption operation. Each call writes four bytes of shellcode into the page-cache image of nft at a chosen offset. After enough iterations to cover the payload, the entire in-memory binary has been replaced. The disk binary is untouched.

Automatic execution. kube-proxy’s next reconciliation tick — within five seconds — causes the kernel to exec what it believes is nft. It is the attacker’s shellcode, running as root inside a privileged pod.

Host filesystem access. kube-proxy runs with privileged: true, which grants CAP_SYS_ADMIN and access to host block devices. The shellcode can mount the host root filesystem and read or modify anything on it — kubeconfig credentials, certificates, Talos configuration, etcd data. Full node compromise.

Architecture: attacker pod and kube-proxy share the same page-cache pages through a common overlayfs layer, giving the attacker write access to memory that kube-proxy will execute.

graph TB
    subgraph node["Talos Node"]
        subgraph ap["Attacker Pod
unprivileged · no capabilities

Runs copy.fail exploit to replace nft in-memory with shellcode"]
        end

        subgraph kp["kube-proxy
root · privileged: true · hostNetwork: true

Runs nft on schedule and executes shellcode"]
        end

        subgraph layer["Shared overlayfs lower layer"]
            nft["/usr/sbin/nft · same inode"]
        end

        subgraph kern["Linux Kernel"]
            pc["Page Cache
nft inode · same RAM"]
        end

        hfs["Host Filesystem
kubeconfig · certs · Talos config

Accessible via block device mount"]

        ap ~~~ kp
        ap --- layer
        kp --- layer
        layer --- pc
        kp --- hfs
    end

Attack sequence: the exploit loop overwrites nft’s page-cache pages 4 bytes at a time; kube-proxy’s next scheduled execution runs the shellcode and gains access to the host filesystem.

sequenceDiagram
    participant A as Attacker pod
(unprivileged)
    participant OL as overlayfs
lower layer
    participant PC as Page Cache
(nft inode)
    participant KP as kube-proxy pod
(root · privileged)
    participant H as Host Filesystem

    A->>OL: open("/usr/sbin/nft", O_RDONLY)
    OL-->>A: fd → shared lower-layer inode

    loop payload: 4 bytes at offset i
        A->>PC: socket(AF_ALG) + splice(fd→pipe→ALG)
        Note over PC: authencesn scratch write
payload[i:i+4] → page cache[i:i+4]
    end

    Note over A: page cache poisoned
disk binary untouched

    Note over KP: ~5 seconds later
(scheduled)
    KP->>PC: exec("/usr/sbin/nft")
    PC-->>KP: serves shellcode
(not real nft binary)

    Note over KP: shellcode runs as root
inside privileged pod
    KP->>H: mount /dev/{host root} → /mnt/host
    KP->>H: read/write host filesystem
(kubeconfig, certs, Talos config)

4. Attack surface in production clusters

The Talos chain requires layer overlap between the attacker’s container and a privileged process. In a production cluster where the attacker finds themselves in a pod they didn’t build, that overlap is not guaranteed — but the surface is larger than it looks.

Shared base images. Most organizations build both application containers and cluster infrastructure (DaemonSets, agents, operators) from a small number of internal base images, often the same one, to standardize patching cadence. That creates the layer overlap this attack needs.

CNI plugins. Cilium, Calico, Flannel, and Weave run as privileged DaemonSets on every node with CAP_NET_ADMIN or CAP_SYS_ADMIN. They ship binaries — eBPF loaders, nftables wrappers, VXLAN tools — that may appear in a shared layer if the attacker’s image is based on a compatible distribution.

Monitoring and logging agents. Prometheus node_exporter, Datadog agents, Falco, Fluent-d — all privileged DaemonSets, all touching the filesystem on a schedule. A monitoring agent that calls a shell or a compression tool from a shared layer on a periodic schedule is structurally identical to the kube-proxy/nft pattern.

There is an irony here: a workload running with no added capabilities and a distroless image is a poor target for this attack. The security agents deployed to watch it are better targets — always present and privileged by definition. The tooling added to monitor for anomalies itself becomes the attack surface.

CSI drivers. AWS, GCP, and Azure CSI storage drivers are privileged DaemonSets that invoke mount, resize2fs, and related tools regularly.

Setuid binaries as deferred triggers. Even without a continuously-running privileged process as a trigger, a setuid binary in a shared layer is exploitable when a pod restart or rolling deployment causes a privileged init container to exec it. Kubernetes restarts pods constantly.

The attacker’s recon. From inside a container, /proc/*/exe symlinks and inode comparisons reveal which binaries are being executed by other processes and whether the attacker shares page-cache pages with them. The recon is cheap and requires no special permissions.

The overlap requirement makes this harder than a pure LPE. In a cluster with strict image provenance — application images and system DaemonSets built from entirely separate supply chains — the overlap may genuinely not exist. But clusters tend toward sprawl, and the surface grows with every additional agent and plugin.

I’m curious which of these vectors the xint.io team will cover in their part 2 — or whether they’ll use a completely different approach.

5. The shared kernel problem and what to do about it

The container boundary is a namespace boundary, not a kernel boundary. Every piece of OS hardening — Talos’s read-only root, absent shell, gRPC-only management — operates above the kernel. copy.fail operates at the kernel. This is not specific to Talos: any shared-kernel container OS is subject to the same class of attack. Patching copy.fail addresses this specific CVE; the architectural question is what prevents the next one.

This vulnerability class has a ten-year track record. DirtyCow (CVE-2016-0728) exploited a race condition in copy-on-write handling to corrupt page cache pages and write to read-only files — container escapes followed. DirtyPipe (CVE-2022-0847) exploited a pipe buffer flag bug to achieve the same page cache write primitive; Replit documented how modifications were immediately visible across all containers on the same host, even with unprivileged containers and hardening in place. copy.fail (2026) uses the AEAD in-place optimization. Three separate bugs, a decade apart, all exploiting the same architectural property: the page cache has no tenant boundary. Each time, the kernel was patched. No structural change followed. So what solutions emerged to actually address this class of vulnerability?

Sandboxed runtimes (gVisor). gVisor’s runsc runtime intercepts all syscalls in a userspace kernel called the Sentry, which implements the Linux syscall surface without delegating to the host kernel. An AF_ALG socket call from inside a gVisor container is handled entirely by the Sentry — the host kernel’s algif_aead.c is never invoked. The copy.fail primitive does not exist from inside a gVisor container. gVisor is used in production at Google and is available as a node pool option in GKE.

MicroVMs (Firecracker). Firecracker runs each workload inside a lightweight virtual machine with its own kernel — typically adding only ~125ms to cold start with negligible steady-state overhead. The page cache is per-VM; it cannot cross VM boundaries. A copy.fail exploit in one VM writes into that VM’s private kernel memory and goes no further. The host kernel and all other workloads are unaffected. Firecracker is the runtime behind AWS Lambda and Fargate.

Both approaches trade some performance or compatibility for a structural guarantee that kernel CVEs in one workload cannot reach other workloads — something no amount of OS hardening on a shared-kernel architecture can provide.

	Container	gVisor	Firecracker
Kernel shared	Yes	No	No
copy.fail reachable	Yes	No	No
Boundary type	Software (seccomp)	Software (Sentry)	Hardware (KVM)
Host attack surface	Large	~50 syscalls	KVM + minimal VMM
Guest kernel	Shared host	Sentry (userspace)	Vendor-built, stripped
Memory overhead	~MB	~15 MB	~125 MB
Syscall compat	Full	Partial	Full

UPDATE — June 3, 2026

This post was published on May 2nd, while waiting for xint’s container-escape follow-up. Both that and a community PoC have since landed.

xint.io part 2 is out. Their container escape follow-up (May 19) covers three paths. Two of them — shared-layer DaemonSet poisoning (what this post covers) and hostPath-mounted DaemonSets reaching host-side binaries — require layer overlap between the attacker’s image and a privileged workload. The third doesn’t: runc is bind-mounted read-only into every container via /proc/self/exe, so the attacker can poison the host runc binary directly through the page cache with no shared layer needed. That removes the biggest constraint on the container escape.

PoC validated across major cloud providers. Percivalll published a proof-of-concept demonstrating all three escape paths against production k8s clusters on Alibaba Cloud ACK, Amazon EKS, and Google GKE, including the runc path.

References

copy.fail vulnerability: copy.fail
xint.io LPE writeup (part 1): xint.io/blog/copy-fail-linux-distributions
Talos security advisory: GHSA-m38g-vww2-mvgx
DirtyPipe cross-container contamination by Replit: blog.replit.com/dirtypipe-kernel-vulnerability
DirtyCow (CVE-2016-0728): dirtycow.ninja
gVisor sandboxed runtime: gvisor.dev
Firecracker microVM: firecracker-microvm.github.io

KV Cache Flood: DoS Against Multi-Tenant LLMs

Mon, 27 Apr 2026 01:53:01 -0700

My previous post covered the KV cache timing side-channel — a known attack class, for which I built a PoC and DIY test tooling. This post covers a different application of the same shared-cache vulnerability: using cache eviction as a DoS attack against co-tenants, with cost structure which may be asymmetric in the attacker’s favor.

TL;DR

Every token prefilled by an LLM inference server can be cached. If you evict another tenant’s cached prefix, every one of their subsequent requests pays full cold-prefill cost until they re-warm it — at which point you evict it again. The attacker sends requests that generate only a single output token (near-zero compute cost) but carry long prompts that fill cache blocks. The victim pays for repeated full re-prefill at scale. On a shared RTX 3090 running vLLM, 4 flood threads caused significant TTFT degradation peaking at 9.7× within the first minute.

Weaponized cache eviction is a well-known attack class: Prime+Probe fills CPU cache sets with attacker data to evict a victim’s lines; CDN cache-busting floods origin servers by bypassing edge caches with unique URLs. I haven’t seen it applied to LLM KV caches, which is what I describe here.

⚠️ DISCLAIMER: Research purposes only. Run flooder.py only against infrastructure you own or have explicit written permission to test. Pointing it at a shared provider is a DoS attack, whether or not their cache is isolated.

1. The shared-cache surface

The foundation is covered in my earlier post, but here is a quick recap: production LLM inference servers (vLLM, SGLang, llama.cpp, TensorRT-LLM) cache the KV tensors produced during prefill and reuse them when a later request shares a prefix. In their default configurations, these caches are keyed only by the token sequence — not by tenant identity. When the same cache serves multiple tenants, two tenants with identical prompt prefixes hit the same cache entry, and two tenants with different prompt prefixes compete for the same LRU eviction pool.

flowchart LR
    subgraph Tenants["Tenants (isolated)"]
        direction TB
        V["Victim Pod"]
        A["Attacker Pod"]
        V ~~~ A
    end

    subgraph Cluster["Shared Inference Cluster"]
        direction TB
        S["Inference Server"]
        K[("Shared KV Cache")]
        S <--> K
    end

    V <--> S
    A <--> S

That last point is this post. The timing side-channel I described earlier is a read on the shared cache. This attack is a write — deliberately filling the shared pool to evict a target tenant’s entries.

2. The attack, mechanically

The mechanism is LRU eviction pressure. vLLM (and most other servers) use an LRU policy over a fixed KV cache pool. When the pool is full, the least recently used blocks are evicted. The attacker submits a sustained stream of requests with unique long prefixes, each filling cache blocks with attacker-owned entries. Under sustained pressure, the victim’s hot prefix blocks are displaced. The victim re-prefills from cold on their next request — potentially thousands of tokens — then re-warms, at which point the flood displaces them again.

flowchart LR
    subgraph S1["1. Warm cache"]
        direction TB
        a_top["[ MRU ]"]:::label
        a1["Victim prefix block"]:::victim
        a2["Victim prefix block"]:::victim
        a3["Victim prefix block"]:::victim
        a_bot["[ LRU ]"]:::label
        a_top --- a1
        a1 --- a2
        a2 --- a3
        a3 --- a_bot
    end

    subgraph S2["2. Flood writes garbage"]
        direction TB
        b_top["[ MRU ]"]:::label
        b1["Attacker garbage"]:::attacker
        b2["Attacker garbage"]:::attacker
        b3["Victim prefix block"]:::victim
        b_bot["[ LRU - evict next ]"]:::label
        b_top --- b1
        b1 --- b2
        b2 --- b3
        b3 --- b_bot
    end

    subgraph S3["3. Victim evicted"]
        direction TB
        c_top["[ MRU ]"]:::label
        c1["Attacker garbage"]:::attacker
        c2["Attacker garbage"]:::attacker
        c3["Attacker garbage"]:::attacker
        c_bot["[ LRU ]"]:::label
        c_top --- c1
        c1 --- c2
        c2 --- c3
        c3 --- c_bot
    end

    S1 ==>|"flood begins"| S2
    S2 ==>|"flood continues,
victim ages out"| S3

    classDef victim fill:#cfe8ff,stroke:#3b82f6,color:#1e40af
    classDef attacker fill:#ffe4cc,stroke:#ea580c,color:#9a3412

The attacker’s requests use max_tokens=1 asking the model to stop after generating a single output token. Output generation is where most GPU compute goes; by minimizing output, the attacker lowers their own cost while maximizing cache pressure.

flooder.py implements the core flood loop in a few lines:

def flood_worker(stop_event, counter):
    while not stop_event.is_set():
        prompt = unique_garbage_prompt()   # long unique prefix — fills cache blocks, never matches victim
        send_request(prompt, max_tokens=1) # fire and forget; 1 output token = minimal compute cost
        counter[0] += 1

in 4 concurrent looping threads. The script measures TTFT before the flood, during, and after — as a stand-in for what a real victim tenant would observe. In a real multi-tenant deployment, the victim is a co-tenant sending their own requests: their TTFT is fast when their prefix is in cache, and spikes when the attacker’s flood displaces those blocks — with no visibility into why. That is the impact the measurements below represent.

3. Attack economics

The core impact is TTFT degradation and throughput loss. Like any DoS attack, the attacker must sustain the pressure — once the flood stops, the victim re-warms on their next request and recovers immediately. The cost angle is worth noting but depends heavily on the provider’s pricing structure: providers that discount cached input tokens make cache misses expensive for the victim; providers that charge a premium for cache writes make flooding expensive for the attacker; pay-per-hour GPU deployments see the impact as throughput loss rather than a direct billing difference. The primary harm is latency degradation regardless of billing model.

Model size amplifies the attack surface. KV cache memory scales linearly with sequence length — each additional token adds a fixed amount of VRAM. What changes with model size is how much VRAM is left over for the cache after model weights are loaded. Larger models leave less headroom, and the attacker’s flood only needs to fill that available space to trigger eviction. Operators running large models on minimum-viable hardware are the most exposed.

4. Test results

I ran the flood PoC in two environments: a local llama.cpp setup on my Apple M4 Pro laptop, and a GPU pod running vLLM on an RTX 3090. Both confirmed the attack. The local result is stronger in relative terms because llama.cpp is single-threaded, so flood threads add queuing contention on top of the cache eviction effect.

4.1 llama.cpp · Apple M4 Pro · TinyLlama 1.1B

4 flood threads, 60-second flood window, ~400-token garbage prompts. The victim’s SECRET_PREFIX is ~619 tokens.

Phase	Victim TTFT
Before flood (cache warm)	31 ms
During flood — t=20s	1178 ms
During flood — t=45s	1315 ms
During flood — t=68s	1342 ms
After flood — first probe	284 ms
After re-prime (recovered)	33 ms
Sustained degradation factor	42.8×

The flood sent 110 requests in 60 seconds. The extreme degradation factor reflects both eviction and contention: llama.cpp processes requests sequentially, so 4 concurrent flood threads also act as a request queue, stacking latency on the victim’s probes during the flood window.

4.2 vLLM 0.19.1 · RTX 3090 · Qwen2.5-3B-Instruct

Same parameters. vLLM batches requests, so flood threads contribute less queuing contention — the degradation reflects mostly cache eviction.

Phase	Victim TTFT
Before flood (cache warm)	73 ms
During flood — t=20s	670 ms
During flood — t=42s	701 ms
During flood — t=64s	582 ms
After flood — first probe	113 ms
After re-prime (recovered)	93 ms
Sustained degradation factor	9.7×

The flood sent 423 requests in 60 seconds.

Two effects compound during the flood. The primary effect is cache eviction: the victim’s cached prefix blocks are displaced, forcing full cold prefill. The secondary effect is request queuing contention: flood threads compete for GPU compute. Both effects are visible in the peak reading of 701 ms — well above the 113 ms cold-miss latency measured immediately after the flood stopped. An attacker with more threads or a higher request rate would see higher contention and correspondingly higher victim TTFT.

The first post-flood probe confirms eviction (cold miss in both environments). Recovery takes exactly one request. This means the attacker must sustain the flood continuously to maintain the degradation — but nothing about the attack requires it to be a one-shot event.

5. Why detection is hard

From the provider’s side, the flooder looks like a high-volume legitimate customer: diverse prompts, many requests, short outputs — indistinguishable from a batch inference workload without per-tenant cache-hit-rate monitoring. Standard rate limiting misses it if the attacker stays within their tier. The victim sees elevated TTFT and higher-than-expected input token costs but has no direct visibility into what caused the eviction.

The distinguishing signals that could catch it:

Per-tenant cache write rate (bytes/sec and blocks/sec) anomalously high from one API key
Per-tenant cache hit rate for the victim anomalously low relative to their historical baseline
Correlation between victim hit-rate drops and attacker write-rate spikes

None of these are surfaced by default in vLLM’s metrics.

6. Defenses

If you’re building a multi-tenant product on top of a self-hosted inference server — an API wrapper, an agent platform, a SaaS product where multiple customers share a vLLM or SGLang backend — this attack applies to you directly. The server doesn’t know your tenants exist; it sees one request queue and one cache pool. A single misbehaving or malicious customer can degrade service for everyone else, and nothing in the default configuration stops them. The mitigations below are things you need to add.

Per-tenant cache footprint quotas directly address the flood. Each tenant gets a quota of cache blocks; eviction targets are chosen proportionally within the offending tenant’s allocation rather than globally. A tenant filling the pool with garbage only displaces their own entries, not other tenants’. Rate-limiting cache writes per tenant (blocks/sec) is the enforcement mechanism.

Tenant-scoped cache keys (the fix described in the companion post) also eliminate the flood attack entirely as a side effect. If each tenant’s entries are keyed with their identity, there is no shared pool to flood — a garbage entry from attacker key A can never displace a legitimate entry from victim key B, because they exist in logically separate namespaces. This is the cleaner fix, but it requires changes to the cache key derivation that per-tenant quotas do not.

Defenses observed in practice. Anthropic’s prompt caching largely neutralizes any economic asymmetry in the flood attack. Cache writes are billed at a premium (1.25× base input rate for 5-minute TTL, 2× for 1-hour), so the attacker pays more per token than a normal request. There is a minimum cacheable prompt size (1024–4096 tokens depending on model), so each flood request must be substantial to even register a cache write. The net economics: the attacker pays 1.25× base input rate per flood request; each flooded victim request that misses cache pays 1.0× instead of the usual 0.1× — a 10× cost increase for the victim. The attacker’s per-token cost is only marginally higher than the per-victim damage they cause, though the aggregate victim impact grows with victim count. In practice, sustaining enough flood volume on a managed API to matter would likely trip rate limits before the economics become interesting.

Conclusion

The shared KV cache is not just an information channel — it is also a resource that can be weaponized against co-tenants. The attack requires one API key and a loop. The attacker minimizes output tokens; the victim pays for repeated full re-prefill. Whether that translates into a cost advantage for the attacker depends on the deployment’s pricing structure, but the latency degradation is real regardless.

The root cause is the same as for the timing side-channel: cache entries are keyed by token sequence without tenant identity. The mitigations are tenant-scoped cache keys, per-tenant cache quotas, and rate limits — or economic leverage that raises the attacker’s cost: write premiums and minimum cacheable prompt sizes.

KV Cache Timing Side-Channel in Multi-Tenant LLMs

Sat, 25 Apr 2026 01:51:22 -0700

I built a small tool to test whether a managed inference provider’s prefix cache leaks timing information across tenant boundaries. Here’s what I found, how to run the same test yourself, and what providers should do about it.

This post covers the timing side-channel — a known attack class I’m adding a concrete PoC and DIY testing guide to. My next post covers a novel application of the same shared-cache vulnerability in a DoS attack.

TL;DR

Modern LLM inference servers cache the KV tensors produced during prompt prefill and reuse them when a later request shares a prefix. Cache hits can be 10–40× faster than cold prefill — a legitimate and widely deployed optimization. If the cache is shared across tenants without per-tenant key scoping, it becomes a cross-tenant information channel.

The timing side-channel works a bit like a chosen-plaintext attack in classical crypto. Recovering a system prompt from scratch is infeasible: the token-space is astronomical. But an attacker with a reasonable hypothesis — drawn from a competitor’s public product, a job posting, a leaked doc — needs only to confirm it. The shared cache becomes a binary oracle: submit the candidate, observe time-to-first-token (TTFT), and the cache tells you whether you are right. On a local llama.cpp setup I measured a 17.8× hit/miss ratio; the correct candidate ranked first in tens of requests total.

The fix — tenant-scoped cache keys — is not deployed by default in most stacks. This post is about why that matters and how to check whether your provider has it.

Get the code: PoC scripts live in my repo. See Testing your provider for how to run them.

1. Prior art

This attack class is not new; the academic literature here is solid:

PROMPTPEEK (NDSS 2025, Wu et al.) is the most directly relevant prior work. It demonstrates a complete prompt recovery pipeline — not just hypothesis confirmation — using a local LLM as an inference oracle to refine partial-match candidates. It targets multi-tenant SaaS LLM deployments, achieves ~99% prompt recovery accuracy, and works against vLLM and SGLang.

InputSnatch (Zheng et al., arXiv:2411.18191, 2024) demonstrates timing side-channel attacks against prefix caching in vLLM and SGLang, including token-by-token prompt reconstruction in chatbot scenarios. This is the canonical academic reference for the timing oracle in a single-node scope.

CPU cache side-channels (Flush+Reload, Prime+Probe, etc.) from the 2010s provide the conceptual template. The LLM case is mechanically different but shares the structure: a shared microarchitectural resource, timing as the observation channel, tenant isolation as the missing primitive.

This post adds a concrete, runnable PoC against open-source inference software, empirical measurements on both consumer CPU hardware and a GPU, and a provider testing guide you can run against a real managed API.

2. Background: prefix caching is standard

Every production inference server does prefix caching. The reasoning: prefill is quadratic-ish in prompt length, decode is roughly linear in output length, and most production workloads have long system prompts that are identical across requests. Caching the KV tensors for the shared prefix lets each subsequent request skip straight to decoding the user-specific suffix.

Concrete implementations:

vLLM — --enable-prefix-caching (on by default in recent versions). Block-granular at 16 tokens. Keyed by token hash. Optional per-tenant isolation via a cache salt; see Section 8.1.
SGLang — RadixAttention. Tree-structured, automatic.
llama.cpp — --cache-reuse N. Reuses KV when ≥N prefix tokens match something in cache.
TensorRT-LLM — KV cache reuse via kvCacheConfig.enableBlockReuse.
LMCache — cluster-level extension turning per-node caches into a shared pool across nodes.

All of the above, in their default configurations, key the cache only by the token sequence. If two tenants submit requests that share a prefix, they hit the same cache entry. The cache does not know, or care, which tenant put the entry there.

That property is desirable for batched inference inside a single tenant. It is a security gap when the same cache serves multiple distinct trust domains.

3. Threat model

flowchart LR
    subgraph Tenants["Tenants (isolated)"]
        direction TB
        V["Victim Pod"]
        A["Attacker Pod"]
        V ~~~ A
    end

    subgraph Cluster["Shared Inference Cluster"]
        direction TB
        S["Inference Server"]
        K[("Shared KV Cache")]
        S <--> K
    end

    V <--> S
    A <--> S

System. A multi-tenant inference service. Each tenant has API access and submits standard completion requests. Tenants may share a node, or sit on different nodes behind a cache-aware router. TTFT is observable to the tenant — either via streaming response, or by timing any request with max_tokens=1.

Attacker capability. One valid API key. No privileged access. No ability to inspect server state. Ability to time HTTP responses at millisecond resolution — trivially available from any HTTP client.

Attacker goal. Confirm or recover another tenant’s system prompt or shared context. System prompts frequently contain business logic and tenant-specific instructions, embedded credentials or internal URLs (a common deployment mistake), references to internal architecture, and in RAG setups verbatim passages from private documents.

Out of scope. Attacks requiring host access (kernel exploits, firmware, physical fabric taps). Attacks against model weights. Cache-based availability and cost attacks are covered in the companion post.

4. The attack, mechanically

The attacker:

Constructs a candidate prefix — a hypothesis about another tenant’s system prompt. Sources: the victim’s public product, job postings, leaked docs, known templates.
Appends a unique suffix (to avoid matching the attacker’s own prior probes).
Submits and records TTFT.
Compares against a pre-established baseline: is this TTFT in the cache-hit band, or the miss band?

A result in the hit band means the candidate matches something cached by another tenant. Full prompt recovery — binary search, token-by-token enumeration, local-LLM-guided refinement — is out of scope here; PROMPTPEEK covers that end-to-end. This post focuses on a prerequisite question: is the environment vulnerable to cross-tenant cache attacks in the first place? If the timing signal is absent, none of the more sophisticated attacks are viable.

5. Proof of concept

I ran this in two environments: first on my MacBook Pro (Apple M4 Pro, llama.cpp with Metal acceleration) to establish the baseline signal locally without any cloud spend, then on a rented RTX 3090 GPU pod running vLLM to confirm the attack holds on the inference stack that managed providers actually use. Both show the same result — one probe, correct identification — at different absolute latency scales.

Setup:

llama-server from llama.cpp, release build with Metal acceleration
Model: TinyLlama 1.1B (GGUF, ~670 MB)
Hardware: Apple M4 Pro, 24 GB unified memory
Server flags: -c 4096 --cache-reuse 16 -ngl 99 --port 8000

Two client processes (“tenants”):

Victim (victim.py): loops every ~10 seconds, sends SECRET_PREFIX + . Uses max_tokens=1 for measurement convenience — a real victim generates full responses, but the cache behavior is the same: the prefix is cached after prefill, before any output is produced. SECRET_PREFIX is a 743-token system prompt — realistic for a customer-support or internal tooling bot.
Attacker (attacker.py): runs a baseline to characterize hit/miss TTFT distributions, then submits five candidate prefixes.

The five candidates:

#	Candidate	Relationship to secret
0	Same company name, different prompt body	Near-miss
1	Same company, wrong escalation policy	Shallow partial match
2	`SECRET_PREFIX`	Exact match
3	Different company name, similar structure	Unrelated
4	Entirely different domain	Unrelated

All candidates padded to approximately the same token count as SECRET_PREFIX (~743 tokens). Why this matters is covered in Section 6.

Baseline (5 samples each):

Condition	Median TTFT
Cold miss (unique UUID prefix, never cached)	195 ms
Warm hit (prefix primed, then measured)	11 ms
Ratio	17.8×

The 17.8× ratio gives a detection threshold of ~97 ms (half the miss median). Any probe below that is, with high confidence, a cache hit.

Detection (1 probe per candidate, padded to ~743–765 tokens):

Candidate	Raw tokens	Norm tokens	TTFT	Classification
ACME near-miss, different body	148	748	268 ms	Miss
ACME assistant, wrong escalation policy	386	746	269 ms	Miss
Exact match (`SECRET_PREFIX`)	743	743	45 ms	Cache hit — ranked #1
Northwind Logistics (different company)	225	765	276 ms	Miss
Writing tutor, unrelated domain	146	746	268 ms	Miss

The exact-match candidate ranked first at ~45 ms, clearly separated from the cold-miss cluster of 268–276 ms. The miss candidates are tightly clustered because normalization brought all of them to within 20 tokens of each other.

The local result confirms the PoC works, but treat the absolute numbers with some skepticism: llama.cpp is single-threaded, so concurrent requests queue and inflate miss latency; Apple Silicon’s unified memory architecture produces different prefill timing characteristics than a discrete GPU. The vLLM replication below is the more representative data point for production inference stacks.

5.1 vLLM replication on GPU

This is a small-scale experiment — a single rented GPU pod, low concurrent load, a 3B model. Real production deployments differ in model size, cluster topology, and load patterns, all of which affect the absolute numbers. The goal here is to confirm the signal exists on the inference stack that providers actually use, not to quantify it at production scale.

Setup:

vLLM 0.19.1 with --enable-prefix-caching, block size 16 tokens
Model: Qwen/Qwen2.5-3B-Instruct
Hardware: RTX 3090 (24 GB VRAM), 32 vCPUs
Server flags: --enable-prefix-caching --max-model-len 4096

SECRET_PREFIX is 3,714 characters / 619 tokens (exact count via vLLM’s /tokenize API — the common 4 chars/token English heuristic significantly overestimates for natural-language prose; this model tokenizes at ~6 chars/token). The victim’s initial cold prefill of this prefix took ~1,460 ms, reflecting both CUDA warmup overhead on first inference and the 619-token prefill pass.

Baseline (5 samples each, prompts padded to 3,714 chars / ~619 tokens):

Condition	Median TTFT
Cold miss (unique UUID-prefixed, never cached)	120 ms
Warm hit (SECRET_PREFIX primed, then measured)	69 ms
Ratio	1.75×

The 1.75× ratio is specific to this setup — available VRAM, cluster load, hardware, virtualization layer, and network latency all affect the absolute numbers. In general, the ratio will vary, and a lower ratio doesn’t mean the attack doesn’t work: what matters is whether the hit and miss clusters are separable, which they were in both environments tested here.

Detection (1 probe per candidate, candidates token-padded to ~619–636 tokens):

Candidate	Raw tokens	TTFT	Classification
ACME near-miss, different body	126 → 630 padded	133 ms	Miss
ACME assistant, wrong escalation policy	348 → 636 padded	133 ms	Miss
Exact match (`SECRET_PREFIX`)	619	76 ms	Cache hit — ranked #1
Northwind Logistics (different company)	194 → 626 padded	136 ms	Miss
Writing tutor, unrelated domain	127 → 631 padded	136 ms	Miss

The exact-match candidate ranked first at ~76 ms, clearly separated from the cold-miss cluster of 133–136 ms. The miss candidates are tightly clustered because they are padded to nearly equal token counts with neutral filler — equalizing length is critical; a shorter cold-miss candidate prefills faster than a longer cache-hit candidate and inverts the expected ranking.

6. Measurement methodology

Building this PoC surfaced three requirements for reliable signal worth documenting.

Pad all candidates to equal length. Prefill time scales with token count — a short miss can prefill faster than a long hit, inverting the ranking. Candidates must be padded to match the target prefix length. Character-count padding is imprecise since tokenization density varies by content type. The PoC calls the server’s tokenizer API (/tokenize) to measure and pad to an exact token count.

One probe per candidate. The first probe caches that candidate on the server, polluting the oracle — a second probe always hits the attacker’s own entry. One measurement per candidate per API key; re-probe with a fresh key or wait for eviction.

Victim contention adds noise. In-flight victim requests can queue ahead and inflate the matching candidate’s TTFT on a given round. Taking min(TTFT) across multiple rounds handles it: misses never produce a reading near the hit floor regardless of how many rounds you take; genuine hits will produce at least one clean reading there.

7. Testing your provider

This is the part I built the tool for. If you use a managed GPU inference API and send sensitive system prompts, you can test whether the provider’s shared cache leaks across tenant boundaries. You need two separate accounts on the same provider and model — production and staging, two colleague accounts, whatever. Step-by-step instructions are in the repo README.

This is a self-assessment, not an attack. You’re testing infrastructure you pay for. Do not probe another organization’s prompts. If you find a shared cache without isolation, ask your provider whether they offer configurable per-tenant cache isolation — it may already exist as an opt-in feature, a contract tier, or a flag you can request.

The attacker script runs baseline characterization and candidate detection in one shot. The ratio alone is not the signal; what matters is whether the matching candidate ranks clearly below the cold-miss cluster. If all candidates cluster together, either the cache is isolated or the victim hasn’t warmed it yet — confirm at least one “keepalive ok” log line and retry.

8. Defenses

8.1 Tenant-scoped cache keys

This is the fix that categorically eliminates the timing side-channel.

Key the cache by HMAC(tenant_id, token_sequence) rather than by the token sequence alone. Cross-tenant hits become impossible; the side channel disappears because the shared primitive is gone.

Cost: zero within-tenant. Every tenant keeps the full prefix-caching benefit for their own traffic. You lose cross-tenant sharing — which should be the default stance anyway, and can be opted into for genuinely public prompts by scoping those entries to a public tenant identifier.

vLLM’s implementation. vLLM ships this as a first-class feature documented under “Cache Isolation for Security”. The salt must derive from the tenant identity — e.g. HMAC(server_secret, tenant_id).

8.2 TTFT jitter / Constant-time padding

Add uniform random delay to first-token emission. Jitter raises the number of probes an attacker needs to distinguish hit from miss — the wider the jitter window relative to the hit/miss gap, the more samples required.

The trade-off: jitter adds artificial latency to every response, including legitimate ones. A window wide enough to meaningfully obscure a 50–100ms hit/miss gap degrades TTFT SLA for all tenants.

Constant-time padding — holding first-token emission to a fixed ceiling — eliminates the signal entirely but effectively nullifies the business case for prefix caching.

Neither approach addresses the root cause. The shared cache namespace remains intact; you are making it harder to read, not removing it. Both are anti-attack measures that raise attacker cost, but hurt legitimate users, making the measures impractical.

8.3 Anomaly detection on probing patterns

Probing has a characteristic signature: many requests from the same API key with the same prefix and monotonically varying suffixes, or the same suffix with systematically varying prefixes. Useful as a tripwire; not reliable prevention against a well-resourced attacker distributing probes across keys and time.

9. Discussion

9.1 Why this is a multi-tenant problem

Within a single tenant, prefix cache hits are a pure win — faster responses, lower compute cost, no security concern. The attack requires a cache serving two distinct trust domains from a shared namespace. That describes: managed inference APIs serving multiple paying customers; internal multi-team platforms where teams run on shared infrastructure; any inference endpoint behind a cache-aware router in front of a shared pool.

9.2 Why it hasn’t been fixed

Here is a Spectre/Meltdown parallel. CPU architects optimized for performance — speculative execution, shared branch predictors, unified caches — and the security consequences of sharing microarchitectural state across trust boundaries were not addressed. The fixes such as retpoline imposed 10–30% performance regressions that made deployment a business decision, not just an engineering one.

The same tension applies here. Tenant-scoped cache keys are the correct fix and not technically difficult. But they eliminate cross-tenant prefix sharing, which is where a meaningful fraction of the advertised throughput and latency gains come from in a densely-packed multi-tenant cluster. Deploying the fix means accepting a performance regression that has to be sold to product and finance — and in a market where TTFT and cost-per-token are the headline numbers, that conversation is slow.

The result is the same pattern seen post-Spectre: the vulnerability is documented, the fix is known, deployment lags because the economic incentive to ship the fix is weaker than the incentive to maintain the performance headline.

9.3 On disclosure

This attack is public — PROMPTPEEK and InputSnatch cover it thoroughly, and the PoC here adds reproducibility and a runnable verification tool without introducing a novel primitive. Nothing here requires formal coordinated disclosure, but operators of multi-tenant inference APIs who have not implemented tenant-scoped cache keys should treat this as a prompt to do so.

Conclusion

Prefix caching is one of the single most impactful optimizations in modern LLM serving. It is also a cross-tenant side channel by default, because the cache key is usually only the token sequence.

Confirming a hypothesis about another tenant’s cached system prompt requires an API key, an HTTP client, and a handful of well-constructed probes — the correct candidate separated clearly from the field in both test environments. The PoC scripts in the repo make it reproducible without a research lab setup, and the provider testing guide in Section 7 gives anyone running a sensitive workload a practical way to check whether their provider has the fix deployed.

Tenant-scoped cache keys is a straighforward fix. The reason it is not yet default is not technical. This post is a small contribution to making the conversation louder.

My next post examines what else an attacker can do with a shared cache once confidentiality is off the table.

Appendix A — Environment

llama.cpp setup (Section 5):

llama.cpp, release build with Metal acceleration
Model: TinyLlama 1.1B (GGUF, ~670 MB)
Hardware: Apple M4 Pro, 24 GB unified memory
Server flags: -c 4096 --cache-reuse 16 -ngl 99 --port 8000

vLLM setup (Section 5.1):

vLLM version: 0.19.1
Model: Qwen/Qwen2.5-3B-Instruct
Hardware: RTX 3090 (24 GB VRAM), 32 vCPUs
Server flags: --enable-prefix-caching --max-model-len 4096 --port 8000
KV cache block size: 16 tokens (vLLM default)
No cache salt / tenant isolation configured (intentional — demonstrates the unmitigated attack surface)