The Queue That Runs Itself

✍️ Ultrathink Engineering 📅 February 06, 2026

This is Episode 5 of "How We Automated an AI Business." Last time: teaching taste — the feedback loop between human judgment and machine production. This time: what happens when no one is watching.

The work queue from Episode 2 has a problem. It coordinates agents beautifully — as long as someone keeps feeding it tasks. The CEO agent creates work each morning. But the CEO is itself an agent that needs to be spawned. Who spawns the CEO?

We needed the queue to sustain itself.

The Queue Monitor

bin/queue-monitor is a 440-line Ruby script that runs every hour via macOS launchd. It does three things:

1. Checks queue depth. It hits the production API, counts tasks in ready status, and compares against a threshold of 5.

2. Detects stale tasks. Agents die — network timeouts, OOM kills, API rate limits. When they die, their task sits claimed forever. The monitor cross-references each claimed task against two signals: a .running file on the local filesystem and the task's last heartbeat timestamp. No .running file and no recent log? The agent was never spawned — reset after 5 minutes. Has a log but no process? Agent died — reset after 60 minutes.

threshold = (!has_local_state && !has_recent_log) ? ORPHAN_TASK_MINUTES : STALE_TASK_MINUTES

Two thresholds, one heuristic. Tasks that were never picked up get recycled fast. Tasks where the agent might still be finishing get more grace.

3. Auto-spawns the CEO. If the queue is below threshold AND the CEO hasn't run in 4+ hours, the monitor creates a P0 task: "Strategic review — queue running low, generate work." The orchestrator picks it up on its next 60-second poll, spawns the CEO agent, and the CEO fills the queue with new tasks for every role.

The CEO creates work. The orchestrator dispatches it. The monitor ensures the CEO itself gets dispatched. A closed loop.

Task Chains

Not every task is independent. When a coder ships a feature, someone needs to verify it. We could rely on the CEO to remember to create QA tasks — but the CEO is an LLM. It forgets.

So we built next_tasks directly into the WorkQueueTask model. Each task can declare children that auto-spawn on completion:

next_tasks:
  - role: qa
    subject: "QA Review for {{parent_task_id}}"
    trigger: on_complete

When complete! fires, spawn_next_tasks iterates the array, interpolates the parent task ID into the subject, and creates new ready tasks. The orchestrator picks them up on its next cycle. No human dispatching. No routing logic. The chain is defined at creation and executes itself.

The pattern handles the most common workflow — coder finishes, QA reviews — but it's general enough for any dependency. Designer finishes artwork, product agent uploads it. Product creates a listing, QA screenshots the page.

The Retry Budget

Agents fail. The question is what to do about it.

Our first implementation reset failed tasks to ready unconditionally. In February, a Claude rate limit error caused task WQ-719 to retry 319 times — each attempt burning API quota, hitting the same rate limit, and resetting. An infinite loop of failure.

The fix: a retry budget. fail! increments failure_count. First two failures reset to ready. Third failure marks the task failed permanently.

MAX_RETRIES = 3

def fail!(reason: nil)
  current_retries = (failure_count || 0) + 1
  if current_retries >= MAX_RETRIES
    update_columns(status: "failed", ...)
  else
    update_columns(status: "ready", ...)
  end
end

We also added a rate limit cooldown. When agent-worker detects usage limit language in Claude's output, it writes a timestamp file. The orchestrator checks this file before spawning — if it's less than an hour old, all spawns are paused. One agent hitting a rate limit no longer triggers 150 retries across the queue.

The Daemon Network

The self-sustaining loop isn't one script. It's six launchd daemons, each with a single job:

Daemon	Interval	Job
`agent-orchestrator`	60s	Poll queue, spawn agents
`queue-health`	1h	Check depth, reset stuck tasks, alert
`ceo-client`	Always-on	Process web chat messages
`ceo-strategy`	Daily 9am	Full strategic review + task generation
`social-engagement`	3h	Create Reddit/Bluesky tasks if missing
`reddit-sync`	Periodic	Sync subreddit metadata

Each daemon is a .plist file in ~/Library/LaunchAgents/. Each one does exactly one thing. They share no state except the database and the filesystem.

The orchestrator is the heartbeat — every 60 seconds it checks for ready tasks and spawns up to 3 concurrent agents. The health monitor is the immune system — hourly checks for stuck tasks, queue starvation, and crashed daemons. If ceo-client is down (exit code != 0), the monitor unloads and reloads the plist automatically.

The social engagement check is the simplest — every 3 hours, it queries the queue for existing marketing tasks. No Bluesky task? Create one. No Reddit task? Create one. It even rotates Reddit subreddit groups by time-of-day so engagement is distributed across communities.

The Closed Loop

Here's the full cycle:

queue-monitor detects < 5 ready tasks
Creates P0 CEO task: "generate work"
agent-orchestrator claims it, spawns CEO agent
CEO creates 8-10 tasks across roles
Orchestrator spawns coder, designer, marketing agents
Coder finishes → next_tasks spawns QA review
QA finishes → task complete
If agent dies → monitor resets task after timeout
If rate limited → cooldown pauses all spawns for 1 hour
Queue drops below 5 → back to step 1

No cron job starts the loop. No human triggers it. The daemons poll, react, and recover. The only external dependency is a MacBook that stays open and a Claude API key with budget.

We learned this the hard way in early February: 3,751 crashes from a PATH misconfiguration in a single plist. The mise Ruby path wasn't first — macOS resolved to system Ruby 2.6, which lacked every gem. The CEO chat processor crash-looped for 12 hours before anyone noticed. Now every plist starts with the same PATH line, and the health monitor checks daemon status every hour.

The system isn't intelligent. It's a set of timers and thresholds wired to a database table. But it runs continuously, recovers from failures, and generates its own work. Which, for a store run by AI agents, is the whole point.

Next time: how AI agents try to build community on Reddit and Bluesky — automod walls, karma thresholds, and why being helpful in someone else's thread beats any promotional post. Episode 6 coming soon.