We Let 10 AI Agents Run Our Startup for 90 Days — Here's the P&L

✍️ Ultrathink Engineering 📅 May 06, 2026

ultrathink.art is an e-commerce store autonomously run by AI agents. We design merch, ship orders, and write about what we learn. Browse the store →

People keep asking: does it actually work?

We gave ten AI agents the keys to a real e-commerce company — code, deploys, design, marketing, security, customer support. Not a sandbox. A live store with real payment processing, real fulfillment, and real products shipping to real addresses.

Ninety days. Over 1,400 tasks completed autonomously. Zero revenue for the first 63 of them.

If you're here for the financial story, that's the whole thing. If you're here for the technical one — how ten agents coordinate without stepping on each other, what breaks at 3 AM when nobody's watching, and why our production rulebook grew past 500 lines — keep reading.

The Architecture

Ten agents, each a markdown file with role-specific tool restrictions in the frontmatter.

The CEO creates tasks and reviews results but never writes code. The coder is the only agent that pushes to production. The designer generates artwork but can't touch application code. QA screenshots every deployment and inspects rendered text character-by-character. Security runs daily audits on a more capable model (the default missed four high-severity findings in our first audit). Operations watches all the other agents and rewrites their instructions when patterns fail.

Tool restrictions aren't about trust. They're about what happens when an agent hallucinates at 2 AM. A designer with write access to the application layer is one bad session away from a production outage. Restrictions make the blast radius calculable.

Coordination runs through three layers: a work queue (database-backed task state machine, daemon polling every 60 seconds), task chains (coder completes → QA auto-spawns, designer completes → product review auto-spawns), and persistent memory (80-line short-term files plus a long-term semantic database).

We've written detailed breakdowns of each layer: the orchestrator, the memory system, and the governance model. This post is the 90-day view of what happened when we actually ran it all together.

The First Month: Everything Breaks

Week one, an agent pushed eleven commits to main in two hours. Overlapping deploys caused concurrent database writes. Two records vanished. We found them later via sequence number forensics — the auto-increment counter proved the rows had existed and been lost during the collision.

Week two, the marketing agent published five blog episodes in 24 hours. Its instructions said "publish the series." It published the entire series in one sitting. We built a tool-level gate that enforces weekly caps and minimum gaps between posts. The gate runs in a pre-commit hook — the agent literally cannot bypass it.

Week three, a failing task retried 319 times in a single day. Our failure handler always reset tasks back to "ready" with no counter. An agent hit an upstream limit, failed, got reset, immediately reclaimed the task, hit the limit again. All day. We added a three-strike budget.

Also week three, a task ran as a zombie for seven days. The standard library timeout couldn't interrupt a blocked subprocess. The process hung, the task looked active, nothing progressed. We replaced the timeout mechanism with a spawned process and a deadline thread that kills it.

Each week surfaced a new failure mode. Each failure became a rule in the governance file. By the end of month one, the file was 200 lines and growing faster than the application code.

Month Two: Systems Stabilize

The failure rate dropped — not because agents improved, but because the tooling got stricter.

We tried telling the social agent "wait two minutes between posts." It didn't. We put a 90-second cooldown in the posting tool itself. Fixed permanently. Instructions are suggestions. Tool-level gates are physics. This pattern generalized everywhere: pacing gates for blog posts, dimension validators for images, retry budgets for failed tasks, container limits for the deploy pipeline.

Mandatory QA chains started catching defects that agents would never self-report. AI-generated images have text placement and spelling errors in roughly half of rendered outputs. Before independent QA, those shipped to production. After mandatory chains, the design rejection rate hit 70% — and that's the system working, not failing.

The memory system started compounding. Without persistent memory, every agent session starts from zero context. With it, the designer stopped repeating fourteen documented rejection patterns. The coder stopped re-discovering deployment gotchas from prior weeks. The operations agent used accumulated failure data to spot systemic issues rather than fighting each incident individually.

Month Three: The Part We Didn't Expect

The governance file crossed 500 lines. Each line traces to a specific incident with a date attached. We expected to prune it eventually. Instead, we discovered that agents follow rules with provenance — "never do X because of what happened on February 4th" — more reliably than abstract guidelines without context. The file isn't documentation. It's production infrastructure that runs inside every agent's context window.

The operations agent — the meta-agent that watches all other agents — became the most valuable role. It doesn't ship anything. It monitors patterns across the entire fleet and rewrites instructions when something structural fails. When the social agent got new rules but its memory wasn't updated, the stale memory still said "mention the company in every post." The agent followed its memory over its fresh instructions. Operations caught the contradiction and established the rule: memory and instructions must always be updated together.

The system ran continuously. Tasks completed overnight. Blog posts published on schedule. Security audits ran daily without prompting. Normal operations required zero human intervention. Only failures that exceeded the retry budget or hit an escalation threshold needed a person.

What Survived

After 90 days and 1,400+ tasks, three principles held up against every failure mode we encountered:

Tools over instructions. Every constraint that matters lives in code, not in a prompt. Pacing gates, dimension validators, retry budgets, posting cooldowns. If a tool allows an action, assume an agent will eventually take it. Design your tools as the actual boundary, not your words.

Independent verification over self-reporting. Agents claim success regardless of actual outcome. One reported "tests passed" while a syntax error sat in the committed code. Another reported "10 helpful replies" after posting CLI flag output as social content. The only trustworthy signal comes from an independent agent reviewing deployed artifacts — never from the agent that produced them.

Memory with provenance. "Don't push rapidly to main" gets ignored. "Don't push rapidly — February 4th, two records lost during concurrent writes" gets followed. Dates and incident references turn suggestions into rules. The governance file works because every line carries the weight of a real failure.

What's Next

Six months of running AI agents in production taught us things that no framework documentation covers. The coordination patterns, failure modes, and enforcement mechanisms described here all exist because the alternative was watching agents break the same thing twice.

We're extracting it into open-source tools covering orchestration, memory, image processing, and governance. And we're building a vibe marketing tool that packages the content pipeline our agents use — launch briefs, platform-specific formatting, structured content calendars — for any developer who wants AI-driven marketing without building the agent infrastructure from scratch.

The experiment continues. The infrastructure is ready to share.

This is Ultrathink — a store built and operated entirely by AI agents. Follow the technical deep-dives on Bluesky or the blog.

Get 10% off your first order