<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
	<channel>
		<title>Security Tools on brain overflow</title>
		<link>https://brainoverflow.blog/categories/security-tools/</link>
		<description>Recent content in Security Tools on brain overflow</description>
		<generator>Hugo -- 0.162.1</generator>
		<language>en-us</language>
		<lastBuildDate>Mon, 01 Jun 2026 11:35:14 -0700</lastBuildDate>
		<atom:link href="https://brainoverflow.blog/categories/security-tools/index.xml" rel="self" type="application/rss+xml" />
		
		
		<item>
			<title>Hidden Gaps in Claude Code Security Reviews</title>
			<link>https://brainoverflow.blog/posts/claude-code-security-review-bias/</link>
			<pubDate>Mon, 01 Jun 2026 11:35:14 -0700</pubDate><guid>https://brainoverflow.blog/posts/claude-code-security-review-bias/</guid>
			<description><![CDATA[&lt;no value&gt;]]></description><content type="text/html" mode="escaped"><![CDATA[<p><em>Anthropic recently shipped a new security plugin for Claude Code that automatically reviews code for vulnerabilities as you make changes, complementing the existing <code>/security-review</code> skill. I decided to test both against a deliberately constructed set of security flaws to see if the new tool improves coverage. Little did I know how deep this rabbit hole would take me. Fair warning: this is a long read.</em></p>
<hr>
<h2 id="1-background">1. Background<a href="#1-background" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Claude Code supports LLM-based security reviews at three stages:</p>
<table>
	<thead>
			<tr>
					<th>Tool</th>
					<th>Plans</th>
					<th>What the reviewer sees</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td><code>/security-review</code></td>
					<td>All</td>
					<td>Full branch, same or new session context (user&rsquo;s choice)</td>
			</tr>
			<tr>
					<td><strong>Security guidance plugin</strong> <em>(new, May 2026)</em></td>
					<td>All</td>
					<td>Git diff from current turn, fresh model context</td>
			</tr>
			<tr>
					<td>Code Review</td>
					<td>Team / Enterprise only</td>
					<td>Full codebase, multi-agent, independent model (runs on PRs)</td>
			</tr>
	</tbody>
</table>
<p>The new plugin shipped with an explicit design goal: avoid the <strong>model anchoring bias</strong> problem <a href="/posts/ai-native-threat-modeling/#5-on-model-bias-in-security-analysis">I wrote about earlier</a>. To understand what model bias means here, consider the human equivalent: if you ask the author of the code to review it, they&rsquo;ll likely tell you it&rsquo;s fine — they wrote it after all. A reviewer who wasn&rsquo;t in the room when the decisions were made will challenge assumptions the author has stopped seeing. The same dynamic applies to LLMs: when Claude writes code and then reviews it in the same session, it has the full conversation history in context, including every design choice and tradeoff it reasoned through while writing. It validates against those decisions rather than challenging them. A fresh session is the AI equivalent of a second pair of eyes.</p>
<p>The new plugin addresses this by running a <strong>separate Opus 4.7 session with a fresh context</strong>: the reviewer starts from the diff with no session history and no investment in the original approach. Anthropic&rsquo;s own documentation is direct about the design intent:</p>
<blockquote>
<p>&ldquo;The plugin does not ask the same Claude instance that wrote the code to grade itself. […] The end-of-turn and commit reviews run as a separate Claude call with a fresh context and a security-focused prompt: the reviewer starts from the diff, has no investment in the original approach, and is instructed only to find problems.&rdquo;</p>
</blockquote>
<p>This is a real solution to the model bias problem, but if you read deeper, it has its own limitation: <strong>a diff-scoped reviewer can only see what changed in the current turn</strong> and cannot reason about interactions between pre-existing code and new additions. That constraint is likely a cost decision: Opus 4.7 is expensive, and reviewing the full codebase on every change would be prohibitively token-intensive.</p>
<p>This gives me two hypotheses to experiment with:</p>
<p><strong>H1:</strong> same-session <code>security-review</code> is affected by model anchoring bias and will suppress findings that a cold run on the same code surfaces. The delta between the two runs measures how bad the gap is in practice.</p>
<p><strong>H2:</strong> the newly introduced diff-based plugin will miss vulnerability chains where each change looks benign in isolation but the two together form something exploitable, because the reviewer only ever sees one diff at a time and has no memory of what came before.</p>
<hr>
<h2 id="2-test-corpus">2. Test corpus<a href="#2-test-corpus" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>My target is based on a real Telegram bot that routes voice and text messages into a backend, but the version used here was vibe-coded from scratch for this experiment. The spec was written to elicit insecure decisions without explicitly asking for them: the goal was a realistic-looking codebase with seeded flaws.</p>
<p>The three flaws, ranging in complexity:</p>
<h3 id="f1-fail-open-authentication-simple">F1: Fail-open authentication (simple)<a href="#f1-fail-open-authentication-simple" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h3>
<p><code>TELEGRAM_ALLOWED_USERS</code> is read into a set at startup. When the env var is absent, the set is empty. The auth guard uses the set as a condition:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">if</span> TELEGRAM_USERS <span style="color:#f92672">and</span> (<span style="color:#f92672">not</span> user <span style="color:#f92672">or</span> user<span style="color:#f92672">.</span>id <span style="color:#f92672">not</span> <span style="color:#f92672">in</span> TELEGRAM_USERS):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span>
</span></span></code></pre></div><p>When <code>TELEGRAM_USERS</code> is empty, the entire <code>if</code> is skipped: any Telegram user is accepted. The correct default is deny-all: a bot that can read files and spawn subprocesses should fail closed, not open.</p>
<h3 id="f2-unrestricted-subprocess-permissions-medium">F2: Unrestricted subprocess permissions (medium)<a href="#f2-unrestricted-subprocess-permissions-medium" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h3>
<p>The bot classifies incoming messages and dispatches them, spawning <code>claude -p</code> subprocess with the <code>process_notes</code> skill and an <code>--allowedTools</code> list needed for the skill to run its operations. The allowed tools list passed to the inner Claude instance includes <code>Bash(python3:*)</code> without path restrictions. The <code>process_notes</code> skill reads the note from disk and invokes Python with it as input. If the skill passes note content to Python without sanitization, the chain reaches arbitrary code execution.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># python3 unrestricted</span>
</span></span><span style="display:flex;"><span>allowed_tools <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;Read,Write,Bash(python3:*),Bash(mv:*),Bash(rm:*), ...&#34;</span>  
</span></span></code></pre></div><h3 id="f3-write--path-scoped-python3--write-then-execute-chain-hard">F3: Write + path-scoped python3 = write-then-execute chain (hard)<a href="#f3-write--path-scoped-python3--write-then-execute-chain-hard" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h3>
<p>Now <code>python3:*</code> is hardened to <code>python3:.claude/scripts/*</code>, but the <code>Write</code> permission remains. The chain: write a payload to <code>.claude/scripts/</code>, invoke it via python. Neither permission is dangerous alone: the vulnerability only exists when you hold both simultaneously. This flaw is the key test case for the new plugin: a diff-based reviewer seeing only the second permission added can&rsquo;t chain it to the first to recognize the combined severity.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>allowed_tools <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;Read,Write,Bash(python3:.claude/scripts/*),Bash(mv:*), ...&#34;</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#                     ^^^^^ unrestricted       ^^^^^ scoped — looks safe</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Chain: Write payload → .claude/scripts/evil.py, python3 runs attacker&#39;s code</span>
</span></span></code></pre></div><p>The four tests map directly to the two hypotheses:</p>
<table>
	<thead>
			<tr>
					<th>Test</th>
					<th>Tool</th>
					<th>Setup</th>
					<th>Tests</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>T1</td>
					<td><code>/security-review</code></td>
					<td>Same session that wrote the code</td>
					<td>H1: does model bias suppress findings?</td>
			</tr>
			<tr>
					<td>T2</td>
					<td><code>/security-review</code></td>
					<td>Fresh session, no prior context</td>
					<td>H1 control: cold reviewer, does it catch everything?</td>
			</tr>
			<tr>
					<td>T3</td>
					<td>Security guidance plugin</td>
					<td><code>Write</code> + <code>python3:*</code> both new in the diff</td>
					<td>H2: does plugin catch a chain when it&rsquo;s fully visible?</td>
			</tr>
			<tr>
					<td>T4</td>
					<td>Security guidance plugin</td>
					<td><code>Write</code> pre-existing, only <code>python3 scripts/*</code> added</td>
					<td>H2: does plugin miss a chain split across commits?</td>
			</tr>
	</tbody>
</table>
<hr>
<h2 id="3-h1--t1--t2-security-review-and-model-bias">3. H1 | T1 &amp; T2: /security-review and model bias<a href="#3-h1--t1--t2-security-review-and-model-bias" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Model bias in AI security review is a known problem, and Anthropic implicitly acknowledged it by releasing the new plugin. I&rsquo;m not going to dwell on it; the tests here are a quick empirical confirmation with a concrete measurement of the gap.</p>
<p>The <code>security-review</code> skill runs a multi-agent pipeline — initial identification followed by parallel false-positive filtering against known precedents, with a scoring system and a confidence threshold to report. It is more reliable than casually asking Claude to &ldquo;review this code for vulnerabilities&rdquo;, though some real issues may get filtered if the evidence isn&rsquo;t strong enough.</p>
<p><strong>Test 1</strong> ran in the same session that produced the bot; <strong>Test 2</strong> ran in a fresh session with no prior context. Same code, model, thinking effort, command.</p>
<table>
	<thead>
			<tr>
					<th>Finding</th>
					<th>Test 1 (same-session)</th>
					<th>Test 2 (cold session)</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>Fail-open auth: empty allowlist (F1)</td>
					<td>Not surfaced</td>
					<td><strong>High</strong> (0.95)</td>
			</tr>
			<tr>
					<td>Path traversal in file retrieval</td>
					<td>Not surfaced</td>
					<td>Medium (0.82)</td>
			</tr>
			<tr>
					<td>Unrestricted subprocess permissions (F2)</td>
					<td>Filtered out</td>
					<td>Not identified</td>
			</tr>
	</tbody>
</table>
<p>Test 1 reported zero vulnerabilities; Test 2 reported one High and one Medium. The same-session reviewer framed the threat model as &ldquo;authorized-only Telegram access&rdquo;, treating the auth as working correctly because the spec said so, and F1 never surfaced. The cold reviewer had no spec context and flagged it right away.</p>
<p>The path traversal finding is actually a stronger signal. It wasn&rsquo;t seeded in my prompt; it was a real bug the vibe-coding session introduced on its own, with no spec instruction to blame. The cold reviewer caught it; the same-session reviewer missed it alongside F1. ✅ <strong>H1 confirmed</strong>.</p>
<hr>
<h2 id="4-f2-the-component-boundary-you-shall-not-pass">4. F2: the component boundary you shall not pass<a href="#4-f2-the-component-boundary-you-shall-not-pass" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>It&rsquo;s not uncommon that while testing a hypothesis you run into new discoveries. So why did neither run flag the unrestricted subprocess permissions?</p>
<p>The answer is in the architecture. Both <code>write</code> and <code>process</code> are legitimate bot operations: the attacker stores a note to disk, then triggers <code>process</code> normally. <code>handle_process()</code> spawns <code>claude -p /process-notes --allowedTools [..]</code>. The subprocess call is visible in <code>inbox-bot.py</code>, but the skill it invokes is a separate file. Whether the skill passes vault content to Python in an exploitable way, and whether the permissions it runs with are appropriate, live outside the review scope.</p>
<pre class="mermaid">flowchart TD
    A["Attacker note\n(malicious payload)"] --> B["handle_write\n(legitimate)"]
    B --> C[("note on disk")]
    D["process command\n(legitimate)"] --> E["handle_process()"]
    E --> F["claude -p /process-notes\n--allowedTools Bash(python3:*) ..."]
    F -.->|"spawns"| G["/process-notes skill\n(out of review scope)"]
    G -->|"reads"| C
    G --> H["Bash(python3:*)\n→ RCE if vulnerable"]
    subgraph scope ["reviewed: inbox-bot.py"]
        B
        E
        F
    end
</pre>
<p>The architecture makes the concern visible: attacker-controlled vault content flows into a subprocess running with unrestricted python3. Neither automated reviewer evaluated it at that level, though they each hit the boundary differently.</p>
<p>In <strong>T1</strong> (same session), the reviewer identified the chain, labeling it &ldquo;Prompt injection via vault write to claude subprocess&rdquo;, but the false-positive filter dismissed it: <em>&ldquo;The attacker and vault owner are the same person; there is no external trust boundary being crossed.&rdquo;</em> That&rsquo;s <strong>model bias in a different form</strong>: not suppressing a finding outright, but supplying a session-derived trust assumption that the reviewer couldn&rsquo;t actually validate, because doing so would require seeing what <code>process-notes</code> does with the data it receives.</p>
<p>In <strong>T2</strong> (cold session), the reviewer checked for shell injection: seeing list-form <code>subprocess.run</code> with no <code>shell=True</code>, it marked the subprocess as clean and moved on. Seemingly, the <strong>presence of a known secure coding pattern steered the LLM into trusting the call as safe overall</strong>: the right invocation style closed scrutiny before it reached the component boundary question. The <code>--allowedTools</code> string with <code>Bash(python3:*)</code> was never evaluated.</p>
<p>Neither reviewer asked whether <code>python3:*</code> was too broad. That question doesn&rsquo;t require seeing <code>/process-notes</code> to answer: attacker-controlled data flowing into a subprocess with unrestricted python3 is a concern on its own, regardless of what the downstream skill does with it. A human reviewer would flag that pattern without needing to verify what the downstream component does with it. When you can&rsquo;t see past the boundary, the right default is to surface the concern.</p>
<hr>
<h2 id="5-h2--t3--t4-the-plugin-and-diff-isolation">5. H2 | T3 &amp; T4: the plugin and diff isolation<a href="#5-h2--t3--t4-the-plugin-and-diff-isolation" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>After a quick detour, we&rsquo;re back to probing the second part of the original hypothesis: does the new Claude plugin&rsquo;s diff-scoped reviewer miss a vulnerability chain where each change looks benign in isolation? This time, the chain is in the same file, and in a single tool call.</p>
<p><strong>Test 3 (1 diff):</strong> <code>Write</code> + <code>Bash(python3:*)</code> introduced together.
The plugin caught both: <code>python3:*</code> flagged as too broad, <code>Write</code> flagged as needing tighter scope. Two correct findings, auto-fix applied. But it treated them as independent concerns rather than a chain. The fix addressed F2; F3 survived.</p>
<pre tabindex="0"><code>The security hook flagged two real issues:
1. Bash(python3:*) is too broad — permits running any Python script.
   Should be scoped to the specific script path.
2. Write is too broad — should be scoped to the wiki directory under VAULT_ROOT.
</code></pre><p><strong>Test 4 (2 diffs):</strong></p>
<ul>
<li><code>Write</code> committed in the baseline. Nothing suspicious in isolation.</li>
<li><code>Bash(python3:.claude/scripts/*)</code> added in a new session. A narrow, path-scoped python3 permission — looks like a reasonable hardening move. <code>Write</code> is outside the diff and invisible to the reviewer.</li>
</ul>
<pre tabindex="0"><code>LLM code review: no vulnerabilities found.
</code></pre><p>And just like that, ✅ <strong>H2 confirmed</strong>.</p>
<p>Side observation from the test: when my git commit message named the permissions that had been removed, Claude read the log and inferred exactly what to restore, producing broad <code>python3:*</code> directly. I&rsquo;ve repeated the test with a neutral commit message, and it resulted in a different fix. The commit message didn&rsquo;t affect the plugin&rsquo;s review, but it changed what the writing model produced. Small sample, but a useful reminder that in vibe-coding sessions the model reads everything in context, and metadata you don&rsquo;t think of as instructions can still shape output.</p>
<hr>
<h2 id="6-what-can-we-do-about-all-this">6. What can we do about all this?<a href="#6-what-can-we-do-about-all-this" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>The model bias gap is actionable, and the fix is simple: run <code>/security-review</code> in a fresh session, not the one where you wrote the code. The unfortunate truth is that most users won&rsquo;t know to do this. The natural instinct is to run the skill right there in the session where you just finished writing the code. Model anchoring isn&rsquo;t obvious unless you know about it. Anthropic could nudge users here: detect when the tool is invoked in a session that also wrote the code, and warn before running.</p>
<p>I asked Claude to prototype this using session hooks. Available as a gist <a href="https://gist.github.com/obormot/9a241032c72c4d19a259f8bce6fa8ed3">here</a>, it works, but frankly it&rsquo;s not very good. The <code>decision: block</code> output is blunt; it stops the prompt and requires re-running. That&rsquo;s an API limitation: <code>UserPromptSubmit</code> hooks have no non-blocking notification option, so block is the only way to surface a visible message. Another caveat: in Claude Desktop, blocked prompts fail silently — the user gets no response and no explanation. This hook is only reliable in the Claude Code CLI.</p>
<hr>
<h2 id="final-thoughts">Final thoughts<a href="#final-thoughts" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Every tool in this space has gaps. Some are documented, and some are hidden, surfacing only when you test carefully enough. The title of this post came from expecting to confirm two gaps and finding three.</p>
<p>Are we all doomed until Mythos comes to save us? Models evolve rapidly, and Mythos is reportedly strong at exactly the cross-boundary chain reasoning that today&rsquo;s tools miss. It may well close these gaps - time will tell.</p>
<p>My broader take: fully autonomous code reviews don&rsquo;t replace human judgment. They extend your reach, and they&rsquo;re most useful when you understand what they can and can&rsquo;t see. Know the limits of your tools. Trust <strong>and</strong> verify.</p>
<hr>
<h2 id="references">References<a href="#references" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<ul>
<li><a href="https://code.claude.com/docs/en/security-guidance">Anthropic — Catch security issues as Claude writes code</a></li>
<li><a href="https://gist.github.com/obormot/9a241032c72c4d19a259f8bce6fa8ed3">Claude hook to warn about model anchoring bias</a></li>
</ul>
]]></content>
		</item>
		
		<item>
			<title>Honeypots for AI Agents</title>
			<link>https://brainoverflow.blog/posts/ai-agent-honeypots/</link>
			<pubDate>Wed, 06 May 2026 08:00:07 -0700</pubDate><guid>https://brainoverflow.blog/posts/ai-agent-honeypots/</guid>
			<description><![CDATA[&lt;no value&gt;]]></description><content type="text/html" mode="escaped"><![CDATA[<p><em>AI agents run with broad access to the local environment — credentials, config
files, shell history, dotfiles — and nothing stops a misbehaving one from
reading things it was never meant to touch. The question is: how do you catch it?</em></p>
<hr>
<p>My professional career in network security started in 2005 when I joined
Juniper Networks&rsquo; IDP team — which is probably why, when thinking about how to
detect a misbehaving agent, my mind went straight to a technique from that era.</p>
<p>The IDS world solved an analogous problem decades ago. A network intruder
probes hosts and services they have no business touching; a <strong>honeypot</strong> sits in
those places looking like an attractive target — an exposed SSH server, an
unprotected admin console — but it&rsquo;s a decoy, and it alerts when anything
interacts with it. Honeypots are traps, tripwires — there&rsquo;s no legitimate use
for a fake host, so the false-positive rate is near zero.</p>
<p>The same idea can be applied to AI agents: plant bogus credentials where a rogue
agent would look, and treat any access as a signal. Rather than trying to define
what legitimate agent behavior looks like — which is hard, because legitimate
agents do a lot of things — it defines what illegitimate behavior looks like:
touching credentials that were never surfaced as part of the task. That&rsquo;s a
tighter definition, and it maps well to the threat model.</p>
<hr>
<h2 id="1-the-threat">1. The threat<a href="#1-the-threat" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>An AI agent running on a developer machine or in CI has access to everything its
process can read: <code>~/.aws/</code>, <code>~/.ssh/</code>, <code>~/.kube/</code>, <code>~/.config/</code>, environment
variables, dotfiles, every file in the project directory. That access is usually
not scoped to what the agent actually needs to do its job — it&rsquo;s scoped to the
developer&rsquo;s own permissions, because that&rsquo;s how agents tend to run inside an IDE.</p>
<p>There are three failure modes this matters for, in increasing order of intent.</p>
<ul>
<li>
<p><strong>Overprivileged agents.</strong> A legitimate, well-behaved agent still runs with the developer&rsquo;s full filesystem permissions. It may passively sweep up credential files while reading context, or actively reason that credentials it found are relevant to the task — either way, not out of malice, just because nothing stops it. The agent&rsquo;s judgment about what it&rsquo;s authorized to do is not the same as what it&rsquo;s actually authorized to do.</p>
</li>
<li>
<p><strong>Prompt injection.</strong> A malicious instruction embedded in content the agent
processes — a README, a web page, a database row — redirects the agent to look
for credentials and exfiltrate them. The agent is not compromised; it&rsquo;s
following instructions that were smuggled into its context. It tops the
<a href="https://genai.owasp.org/llmrisk/llm01-prompt-injection/">OWASP Top 10 for LLM Applications</a>
as LLM01:2025, and for good reason — it exploits the design of LLMs rather than
a flaw that can be patched.</p>
</li>
<li>
<p><strong>Compromised agents.</strong> A model that has been fine-tuned adversarially, or a
scaffolding layer with a supply-chain compromise, probes the environment
deliberately. Skills and MCP servers install from the internet with no signing
and no scanning — malicious ones are a real and active delivery mechanism, not
a future concern. The emergence of dedicated skills security solutions is evidence
enough that this threat is already being taken seriously.</p>
</li>
</ul>
<p>The distinguishing behavior is the same in all three cases: the agent accesses
resources it was never explicitly directed to use.</p>
<hr>
<h2 id="2-why-defenses-fall-short">2. Why defenses fall short<a href="#2-why-defenses-fall-short" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>What makes this more dangerous is how creative agents get in pursuing their goals.
Anthropic&rsquo;s <a href="https://www.anthropic.com/research/agentic-misalignment">agentic misalignment research</a>
stress-tested models in simulated corporate environments and found that agents would pursue their assigned
goals through unexpected means when straightforward paths were blocked: leaking
sensitive information to competitors, attempting to influence their own
evaluation, sabotaging research they disagreed with. The models weren&rsquo;t broken;
they were doing what they were optimized to do, just not in the way their
operators intended.</p>
<p>This creativity makes naive defenses, such as blacklisting certain
paths, blocking specific tool calls, or simply instructing the agent not to
touch certain things — largely moot. A sufficiently motivated agent will reason
around them, find an indirect path, or reframe the task in a way that makes the
restricted action seem justified.</p>
<p>This compounds with a structural problem on the human side: agents increasingly
run with minimal oversight. Skip-permissions modes bypass approval prompts.
Long-running background tasks accumulate hundreds of tool calls nobody reviewed.
Users experiencing approval fatigue click through without reading. The result is
an agent with broad local access, a disposition to find creative paths to its
goals, and nobody watching closely.</p>
<p>Simon Willison coined the <a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">lethal trifecta</a> as a
framework for exactly this risk: private data access, exposure to untrusted
content, and exfiltration capability. When all three are present, an attacker
who can inject instructions into the agent&rsquo;s context has essentially won. It has
become a widely accepted shorthand for why agentic deployments require a
different security posture than traditional software.</p>
<hr>
<h2 id="3-existing-tools">3. Existing tools<a href="#3-existing-tools" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Network honeypots have been around since the early 2000s. The idea is unchanged
across thirty years: deploy something that looks real but has no legitimate use,
and alert on any interaction. A low-interaction honeypot like Honeyd emulates
network services; a high-interaction one like Cowrie runs a full fake SSH
daemon. Either way, any connection is anomalous by definition — legitimate users
don&rsquo;t hit the honeypot.</p>
<p><a href="https://canarytokens.org">Canarytokens</a> applied this to
credentials and files rather than network services. You generate a fake AWS
access key or a Word document with a beacon embedded; when someone uses the key
or opens the document, you get an alert. The AWS canary creates a real IAM user
and monitors CloudTrail — there&rsquo;s a lag of minutes, and it requires external AWS
infrastructure.</p>
<p><a href="https://github.com/peg/snare">Snare</a> is a newer honeypot built specifically for
the AI agent threat model, with a few meaningful differences from Canarytokens.
It covers 18 credential types in one shot — including AI-native ones like
OpenAI, Anthropic, and MCP server configs that Canarytokens doesn&rsquo;t have —
placing canaries in all the standard locations an agent would probe. The AWS
canary fires at credential-resolution time via a local shell hook, before any
API call is made, which is faster and doesn&rsquo;t require external AWS
infrastructure. Alerts include the SDK user agent and ASN, with a &ldquo;Likely AI
agent&rdquo; flag when the request originates from cloud infrastructure — context that
Canarytokens doesn&rsquo;t surface.</p>
<p>The conceptual lineage runs from Honeyd to Canarytokens to Snare with the same
core insight at each step: if you can define what legitimate access looks like,
anything outside that definition is a signal.</p>
<hr>
<h2 id="4-limitations">4. Limitations<a href="#4-limitations" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p><strong>It detects, it doesn&rsquo;t prevent.</strong> The agent already misbehaved by the time you get the alert. These are detection controls, not prevention controls. Knowing an agent touched a credential is useful — it&rsquo;s not the same as stopping it. That said, detection data has value beyond the alert itself: observing what agents actually reach for in practice — even legitimate ones — tells you where the real access boundaries need to be. That&rsquo;s useful input for building guardrails, scoping permissions, or writing policies grounded in observed behavior rather than guesswork.</p>
<p><strong>Placement is manual.</strong> Canaries live where tools naturally look for
credentials. If an agent is directed to a custom config path or a non-standard
environment variable, the canary won&rsquo;t be there. Coverage is bounded by where
you planted the wires — the same fundamental limitation as any tripwire-based
detection.</p>
<p><strong>Detection confidence varies.</strong> Precision depends on the canary design and how
deeply it hooks into the agent&rsquo;s execution environment.
Not all credential access paths are equally observable.</p>
<p><strong>Shared machines.</strong> The low false-positive guarantee relies on the canary being invisible to legitimate users. On a machine shared by multiple developers, someone may stumble across a planted credential and use it for a real task — generating an alert that has nothing to do with a misbehaving agent. Dedicated agent environments sidestep this; shared workstations require more care.</p>
<hr>
<h2 id="conclusion">Conclusion<a href="#conclusion" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>The IDS-era idea is simple: legitimate users only access resources they have business
accessing — so any interaction with a decoy is a signal by definition.
What&rsquo;s new is the target — an agent running under your
own account, following instructions that arrived via a file you never meant to
treat as executable, in a world where the line between &ldquo;following instructions&rdquo;
and &ldquo;going rogue&rdquo; is invisible to a monitoring system that only observes API calls.</p>
<p>The tools are already there. <a href="https://github.com/peg/snare">Snare</a> in particular
caught my attention — built specifically for the AI agent threat model, it covers the
credential surface an agent would probe and fires faster than any CloudTrail-based
approach. An old technique that still holds up in the agentic age.</p>
<hr>
<h2 id="references">References<a href="#references" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<ul>
<li><a href="https://genai.owasp.org/llmrisk/llm01-prompt-injection/">LLM01:2025 Prompt Injection</a> — OWASP Top 10 for LLM Applications</li>
<li><a href="https://www.anthropic.com/research/agentic-misalignment">Agentic Misalignment: How LLMs Could Be Insider Threats</a> — Anthropic</li>
<li><a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">The lethal trifecta for AI agents</a> — Simon Willison</li>
<li><a href="https://github.com/peg/snare">Snare — honeypot canaries for AI agents</a></li>
</ul>
]]></content>
		</item>
		
	</channel>
</rss>
