Can't, not shouldn't: why AI security is architecture, not a rulebook
You don't want your AI agent reading the .env? So you tell it not to — "don't access secrets, never print tokens" — in CLAUDE.md, in the system prompt, in the tool description. It feels like a control. It isn't. It's a request. And a request is not a security boundary.
That's the reflex almost everyone starts with — and the most expensive mistake in working with AI agents. It conflates two very different things: telling a system it should behave, and building a system where the misbehaviour isn't possible in the first place. Shouldn't — versus can't.
TL;DR — A prompt rule tells the agent what it shouldn't do. Architecture makes sure it can't. Only the second holds when it matters. AI security is a question of structure, not instruction.
Why "shouldn't" doesn't hold
An instruction in a prompt is advice — and advice gets ignored. By this agent, by the next one, by a tool wrapped around it, or by an attacker who slips an instruction into the very data the agent already reads (indirect prompt injection). You cannot foresee every input your rule has to survive.
And this isn't my private opinion — it's the state of the research. Even the people who build these guardrails can't make them hold:
- A team from Lancaster University / Mindgard tested six prominent protection systems, including Microsoft's Azure Prompt Shield and Meta's Prompt Guard — and evaded them with simple character injection, in some cases with up to 100% success (“Bypassing LLM Guardrails”, arXiv).
- When OpenAI shipped its own Guardrails framework in October 2025, it was bypassed within days — by HiddenLayer, who prompt-injected the very model meant to detect prompt injection (HiddenLayer: “Same Model, Different Hat”).
The second case exposes the structural problem behind "an AI guarding an AI": the guard inherits the weakness of the thing it guards. Put a language model in judgment over a language model, and the same hole sits in the judge as in the defendant. Stacking on more models doesn't harden the boundary — it just makes it more expensive.
"Shouldn't" is not "can't"
Here is the line that actually matters. Every security measure sits on one of two sides:
- "Shouldn't" — depends on someone following a rule: the agent, the user, a guardrail model. It breaks the moment one of them doesn't — out of malice, manipulation, or plain carelessness at the next tool in the chain.
- "Can't" — the misbehaviour is architecturally impossible. There's no rule to follow and nothing to bypass. The attacker can be as creative as they like; the path simply doesn't exist.
From that follows a single question you can use to test any "AI security" on offer:
Does this depend on someone following a rule — or does it make the failure impossible?
If the answer is "a rule," treat it as a best-effort nudge, not a boundary. That's the first test for anything that calls itself a security boundary — detection, monitoring and rotation go on top, not in its place.
| Measure | Type | What an attacker needs to beat it |
|---|---|---|
| "Don't print secrets" in the system prompt | shouldn't | just a crafted instruction |
| Guardrail model screens input/output | shouldn't (probabilistic) | a crafted instruction |
| Secret never in the agent's context (env injection) | can't | an actual exploit (e.g. read process env) |
| Network egress on an allowlist | can't | an actual exploit (e.g. DNS exfiltration) |
| Sandbox + least privilege | can't | a sandbox escape |
| Generation separated from execution | can't | a gap in the policy |
The difference isn't "secure" versus "insecure" — structural boundaries aren't infallible. The point is different: you bypass a behavioural rule with a sentence; you bypass a structural boundary only with a real exploit. You force the attacker from typing an instruction to finding a vulnerability — from minutes to weeks, from anyone to a few. That's the win.
What "can't" looks like
The good news: the structural side isn't exotic. These are the same principles you already apply across any serious infrastructure — just carried through to the agent.
The secret doesn't exist in the agent's context. You give it the ability to use a credential, not to read it: the value lives in the environment of the child process the agent spawns — not in the conversation buffer the model reads. No prompt can make the model print a value that was never in its context. (A fully compromised agent with a free shell can still read it from the process environment or re-fetch it — that's a different threat model, the one you narrow with the next points.) I walked through the read-vs-use trick step by step in How AI agents leak your credentials.
Egress on an allowlist. The agent can only reach approved destinations. If the prompt says "send this to evil.example.com," it goes nowhere — that destination doesn't exist for it. It is not an air gap: data can still be smuggled out through DNS lookups or through an already-permitted destination — which is why DNS-layer control belongs here and permitted destinations stay semi-trusted. But the obvious channel — "send it somewhere" — is closed.
A sandbox plus least privilege. Generated code runs isolated, with short-lived credentials and exactly the permissions the task needs — no more. That bounds the blast radius; it doesn't make execution "safe," because a coding agent has to run exactly the untrusted code that is the risk. So the robust form is disposable environments with no durable credentials: whatever goes wrong stays in the container instead of spreading.
Separate generation from execution. An independent layer decides whether an action is allowed — and it has to be a deterministic policy of capabilities and allowlists, not a second model that would inherit the same weakness. A hijacked plan then runs into a wall, as long as it wants something the policy doesn't already permit.
None of these trust the agent to behave. That is precisely the point.
It's not just me saying this
The industry has landed in the same place. The OWASP Top 10 for Agentic Applications (December 2025) list, under ASI05 – Unexpected Code Execution, doesn't recommend "tell the agent to be careful." It recommends separating generation from execution and running code in ephemeral micro-VMs or sandboxes (OWASP Gen AI Security Project). The NIST Generative AI Profile (NIST AI 600-1) treats prompt injection as a first-class risk to design against, not a footnote (NIST).
And the attack surface is real, not hypothetical: the "IDEsaster" research disclosed 30-plus vulnerabilities (24 assigned CVEs) across GitHub Copilot, Cursor, Windsurf, Claude Code and more in late 2025 — the AI IDEs tested were vulnerable across the board (reported by The Hacker News). The tools you already trust are part of the threat model. A prompt rule does nothing about that; a sandbox does.
The one question
Running AI agents in production isn't about teaching them to behave. It's about building the environment so the obvious mistake runs into nothing. Behaviour can be requested; architecture can be enforced — and only the enforced kind holds when a real attacker, a confused model, or just next quarter's deadline pressure leans on it.
So the next time someone sells you "we told the AI not to" as security, ask the one question: does this depend on the agent following a rule, or does it make the failure impossible? The first is theatre. The second is security.
Drawing those boundaries cleanly — so automation stays fast without quietly building up risk — is part of what I do. Let's talk.