The Orchestrator: How Claude Code Agents Actually Ship Code
This is Episode 9 of "How We Automated an AI Business" -- the series finale. Last time: the daily security audit -- an agent that probes our own defenses and won't stop mentioning what it finds. This time: the script at the center of everything.
New here? Episode 1 covers why we built this. Episode 2 covers the database table it runs on.
The orchestrator is a Ruby script with a loop and a sleep 60.
Every minute, it wakes up, reads the work queue, and asks: are there ready tasks and open slots? If yes, spawn an agent. If not, sleep.
The complexity is in what "spawn an agent" means -- and what happens when the agent crashes, hangs, or produces output that should trigger more work.
The Five-Phase Tick
Each cycle runs five phases:
Phase 1: Check the dead. Scan agent_runs/ for .running state files. Is the PID alive? No? Read its log. TASK_COMPLETE in the output? Mark complete, trigger chained children. Zero-byte log? Mark failed -- agent never ran. Non-zero log but no signal? Auto-complete -- the work probably happened, the signal got lost.
Phase 2: Quality gates. Tasks with branches get linted, tested, and scanned. Rubocop, Rails tests, Brakeman. All three must pass.
Phase 3: Security review. Code touching controllers, auth, or payment spawns a security agent to review the diff.
Phase 4: Merge. Tasks that passed all gates get merged to main. Push triggers the deploy pipeline.
Phase 5: Spawn. Claim ready tasks via the production API. For each, Process.spawn a new bin/agent-worker with task context in environment variables. Write a .running state file. Move on.
Three agents max. One per git-pushing role -- two coders pushing simultaneously causes overlapping deploys.
What a Spawn Looks Like
The orchestrator claims the task via HTTP POST (atomic -- if two orchestrators race, one wins). Sets up env vars: AGENT_TASK_ID, AGENT_ROLE, log path, state path. PATH gets mise Ruby first -- system Ruby 2.6 has none of our gems. Calls Process.spawn, detaches the PID, writes a YAML state file, moves to the next task.
The spawned process is on its own. The orchestrator doesn't wait.
The Worker
bin/agent-worker is every agent's entry point. First line: write a boot marker to stdout. BOOT: ruby=3.3.4 pid=12345 role=coder. This exists because agents were producing zero-byte logs -- were they crashing in Ruby startup? In bundler? In our code? If you see the boot marker, Ruby loaded. If not, the crash is deeper.
The worker loads the task from the API, builds the Claude prompt from the agent's role definition and task brief, starts a heartbeat thread (30-second POST: "I'm alive"), and spawns the Claude CLI with a one-hour deadline.
The deadline is the critical piece. We used to wrap execution in Ruby's Timeout.timeout. It never fired. Timeout.timeout can't interrupt a blocking I/O syscall -- one task ran as a zombie for seven days.
The fix: Process.spawn, a monotonic clock, and Process.kill. If the clock runs out, SIGTERM. Wait two seconds. SIGKILL. OS signals instead of Ruby exceptions.
Heartbeats and Death
Every agent sends dual heartbeats: one to the production database via HTTP, one to a local YAML file.
The API heartbeat lets the queue monitor (separate hourly script) detect stale tasks server-side. Claimed for 60+ minutes with no heartbeat? Reset to ready. The local file lets the orchestrator detect dead agents during its own tick.
The heartbeat thread has its own failure budget: 20 consecutive API failures and it stops. Prevents a network partition from accumulating hundreds of failed requests while the agent does real work.
Task Chains
The most useful pattern: next_tasks. When a coder task is created, it declares its children:
next_tasks:
- role: qa
subject: "QA Review: {{parent_task_id}}"
When complete! fires, spawn_next_tasks creates a new task with status ready. The orchestrator picks it up next tick. Coder finishes, QA auto-spawns. Designer finishes, product agent auto-spawns.
One gotcha: chains fire server-side in the Rails model, so the worker must call the API's complete endpoint. We shipped a bug where the orchestrator called a local method instead. Zero errors. Zero chains. For weeks. The fix was one line.
The Completion Parse
When an agent dies, the orchestrator checks in order: does a .complete file exist? Is the log zero bytes? (Mark failed -- 92% of social tasks once "completed" with empty logs, creating a phantom record of work that never happened.) Does the log contain TASK_COMPLETE? None of the above? Auto-complete.
Each rule was added after a specific incident. The hierarchy is messy because failures are messy.
What the Series Built
Nine episodes. Here's the stack:
Ep 1: Agent roles in markdown. Infrastructure > prompts.
Ep 2: A database table with a status column.
Ep 3: Quality gates. 70% rejection rate.
Ep 4: Design review chains. Teaching taste.
Ep 5: launchd daemons and the self-sustaining loop.
Ep 6: YAML memory for a stateless CEO.
Ep 7: Layered recovery and processes that don't trust each other.
Ep 8: Automated audits that won't stop mentioning findings.
This episode: the orchestrator. A loop, a sleep, and Process.spawn.
The whole system is ~4,000 lines of Ruby. No Kubernetes. No message broker. A Mac Mini running launchd daemons, a Rails app with a database table, and Claude Code processes that start, work, and die.
Every rule in the concurrency controls, every branch in the completion parser, every threshold in the heartbeat monitor exists because something broke. The seven-day zombie. The 319-retry loop. The 3,751 PATH crashes. The zero-byte phantoms. Each incident left a scar in the code. The scars are the system.
This concludes "How We Automated an AI Business." The series is about one idea: AI agents are unreliable, stateless, and prone to failure -- and you can build a working system out of them anyway, if you assume they'll break and build recovery into every layer.
Thanks for reading.