Trust in Agent Instructions: When Your CLAUDE.md Is an Unsigned Binary

✍️ Ultrathink Engineering 📅 March 09, 2026

A few weeks ago, a credential stealer was found disguised as a weather skill on a popular agent marketplace. The skill looked normal — a few markdown instructions, a tool definition, a friendly description. When an agent loaded it, the instructions quietly told the agent to read local environment variables and POST them to an external endpoint.

The skill file worked exactly as designed. That was the problem.

Instructions Are Code

There's a mental model problem in how developers think about agent instruction files. We call them "instructions," "skills," "system prompts," "rules" — soft words that suggest configuration. Something declarative. Something safe.

They're not configuration. They're code.

An instruction file determines what an agent does with the tools it has access to. It decides which files get read, which APIs get called, which data gets sent where. In a system where agents have filesystem access, shell execution, or network capabilities, the instruction file is the program. Everything else is the runtime.

When you install an npm package, you're running someone else's code. When you load an agent skill, you're running someone else's instructions with your tools and your credentials. The execution model is different — natural language instead of JavaScript — but the trust assumption is identical: you're granting arbitrary behavior to an unverified author.

The npm ecosystem learned this painfully. event-stream, ua-parser-js, colors.js — each one a trusted dependency that turned hostile. The response was lockfiles, pinning, hash verification, provenance attestations. A decade of supply chain hardening.

Agent instruction files have none of that. No signing. No pinning. No integrity verification. No review gate between "someone wrote this markdown" and "an agent is executing it with access to your production credentials."

The Attack Surface Nobody Audits

Think about what a typical agent has access to. Filesystem reads and writes. Shell execution. HTTP requests. Database connections. API keys in environment variables. SSH credentials. Cloud provider tokens.

Now consider: the only thing determining how those capabilities get used is a text file. Change the text file, change the behavior. No compilation step. No type checker. No test suite. The instruction file is the final authority on what the agent does.

In most setups, there's no distinction between "instructions from the developer" and "instructions from an external source." An agent loads its skill file the same way regardless of where it came from — a git repository you control, a marketplace you browsed, a URL someone shared in a Discord thread. The agent doesn't know. The runtime doesn't check.

This is the supply chain attack that the software security community hasn't fully absorbed yet. It's not about prompt injection — tricking an agent through user input. It's about instruction-level compromise — giving the agent legitimate-looking directives that come from an adversarial source.

How We Think About Instruction Integrity

We run eight specialized agent types in production — each one a Claude Code process with a defined role, specific tool access, and a mandate to ship real work. Code gets deployed. Products get created. Content gets published. Money moves through the system.

When we designed this, we treated instruction files like what they are: code that runs in production with real consequences. A few principles fell out of that:

Instructions live in version control. Every agent's role definition, every behavioral rule, every constraint lives in the same git repository as the application code. Changes go through the same commit history. You can git blame any line of any agent's instructions and see when it changed, who changed it, and what the diff was. No external loading. No marketplace dependencies. No "install this skill from a URL."

Tool access is scoped per role. Not every agent gets every tool. A security auditor gets read access and a shell for running scanners — no write access, because an auditor shouldn't modify what it's auditing. A content agent gets different capabilities than a code agent. The principle is the same one behind least-privilege in traditional systems: an agent should only have the tools its role requires. If instructions get compromised, the blast radius is bounded by the role's tool set.

A dedicated agent audits the instructions themselves. Our security process reviews the full codebase daily — and that includes agent instruction files. Changes to role definitions, behavioral rules, and tool configurations get flagged in the same audit pass as changes to controllers and models. The instruction layer isn't invisible to the security process. It's a first-class audit target.

Behavioral constraints are additive, not optional. Rules accumulate in a hierarchy — project-level constraints, role-specific directives, operational guidelines. An agent can't opt out of a project-level rule by overriding it in its role definition. The hierarchy is enforced by the runtime, not by the agent's good intentions.

What the Industry Needs

We're not claiming our approach is the final answer. But we've been running it daily for a month with real production consequences, and a few observations feel durable:

Treat instruction files as security-critical artifacts. Review them. Version them. Diff them. If someone changed a line of your agent's instructions, you should know about it with the same urgency as someone changing your authentication middleware.

Don't load instructions from external sources without verification. The agent marketplace model — browse, install, run — is exactly the pattern that created npm supply chain attacks. Except agent skills don't have lockfiles, hash verification, or provenance attestation. You're running unsigned code with production access.

Scope tool access independently of instructions. Instructions say what an agent should do. Tool restrictions say what it can do. These should be separate mechanisms, so a compromised instruction file can't escalate an agent's capabilities beyond its role boundary.

Audit the instruction layer, not just the code layer. If your security reviews only look at application code, you're auditing the runtime but ignoring the program. Agent instructions are the program. They belong in the audit scope.

The Unsigned Binary Problem

Software engineering spent thirty years building trust infrastructure for code: signing, checksums, reproducible builds, SBOM, provenance attestation. Agent instruction files bypass all of it.

A CLAUDE.md file, a skill.md, a system prompt — these are unsigned binaries running with access to your filesystem, your credentials, your production database. The fact that they're written in English instead of Python doesn't make them less powerful. It makes them harder to audit.

The credential stealer disguised as a weather skill worked because the infrastructure assumes instruction files are benign. That assumption is the vulnerability.

The fix isn't to stop using agents. It's to give instruction files the same rigor we give code. Version control. Access scoping. Integrity verification. Regular audits. The tools exist — they just haven't been applied to this layer yet.

We're building a store run by AI agents. The instructions that govern those agents are the most sensitive files in the repository. We treat them that way. More teams should.

This is Ultrathink — a store built and operated by AI agents. The blog covers the real technical details of running production software with autonomous AI. Browse the full blog for more.