How We Orchestrate 10 AI Agents with Claude Code
We run 10 AI agents in production. They deploy code, create products, publish content, run security audits, and manage social media. The orchestration layer that coordinates them is roughly 500 lines of Ruby.
No Kubernetes. No Celery. No message broker. Here's how it actually works.
The Work Queue Is a Rails Model
Every task flows through a single model: WorkQueueTask. The state machine is eight states:
class WorkQueueTask < ApplicationRecord
STATUSES = %w[
pending ready claimed in_progress
needs_review completed blocked failed
].freeze
scope :ready, -> { where(status: "ready") }
scope :for_role, ->(role) { where(role: [role, "any"]) }
scope :by_priority, -> {
order(Arel.sql("CASE priority
WHEN 'P0' THEN 0
WHEN 'P1' THEN 1
WHEN 'P2' THEN 2
ELSE 3 END"))
}
end
Tasks start as pending (needs review), move to ready (available for agents), get claimed when an agent picks them up, then in_progress while executing. completed, failed, or blocked are terminal states.
The priority system is simple: P0 preempts everything. If a P0 task enters the queue while P2 tasks are running, the orchestrator logs a warning. We don't kill running agents — we let them finish and give the P0 task the next slot.
Agents Are Processes, Not Threads
Each agent is a Claude Code process spawned with Process.spawn. The orchestrator sets up the environment, points it at a task, and detaches:
env = {
"AGENT_TASK_ID" => task_id,
"AGENT_ROLE" => role,
"AGENT_LOG_FILE" => log_file,
"AGENT_STATE_FILE" => state_file,
"CLAUDECODE" => nil # Prevent nested session crash
}
pid = Process.spawn(
env,
"ruby", "-S", "bundle", "exec", "bin/agent-worker",
chdir: PROJECT_DIR,
out: log_file,
err: log_file,
unsetenv_others: false
)
Process.detach(pid)
That "CLAUDECODE" => nil line? It exists because the orchestrator itself runs inside a Claude Code session. Child processes inherit CLAUDECODE=1, which causes a "nested session" crash on startup. Setting it to nil in the env hash unsets it for the child. This bug affected six tasks before we found it.
The worker script (bin/agent-worker) claims the task via API, runs Claude Code with --agent <role>, captures output, and reports completion:
# Early crash capture — if this never appears in the log,
# the crash is in ruby/bundler startup, not our code
puts "[Worker #{task_id}] BOOT: ruby=#{RUBY_VERSION} pid=#{$$}"
That boot marker exists because we had a bug where agents produced 0-byte logs. Knowing whether the process even reached our code — versus crashing during require — cut debugging time from hours to minutes.
Concurrency Limits: Three Rules
The orchestrator enforces three concurrency constraints:
MAX_CONCURRENT_AGENTS = 3
# Roles that push to main — max 1 each
GIT_PUSHING_ROLES = %w[coder marketing].freeze
# Roles that crash when spawned simultaneously
SERIALIZE_ROLES = %w[social].freeze
Rule 1: Never more than 3 concurrent agents. Our Mac Mini has enough memory for 3 Claude Code processes. A fourth causes memory pressure and degraded response quality.
Rule 2: Only 1 coder and 1 marketing agent at a time. Both roles push to main, which triggers GitHub Actions deploys. Two concurrent pushes cause overlapping Kamal deploys — and with SQLite on a Docker volume, concurrent deploys risk WAL corruption. We lost two database records learning this (February 4).
Rule 3: Social agents are serialized. Spawning two Claude Code processes in the same second causes one to produce a 0-byte log. 100% reproduction rate. We never found the root cause, but the fix was simple: let the daemon's 60-second poll interval naturally stagger them.
Heartbeat: Detecting Dead Agents
Agents send heartbeats every 30 seconds via HTTP POST to the production API. The task model tracks staleness:
HEARTBEAT_STALE_SECONDS = 90
def heartbeat_stale?
return true if last_heartbeat.blank?
last_heartbeat < HEARTBEAT_STALE_SECONDS.seconds.ago
end
A separate monitor (bin/queue-monitor --stale) runs hourly and resets tasks whose agents have died:
- If the
.runningstate file doesn't exist → agent never started. Reset toready. - If last heartbeat was >60 minutes ago → agent died. Reset to
ready. - If
.completefile exists but task is stillclaimed→ completion API call failed. Mark complete.
Before this monitor existed, tasks would get stuck in claimed for hours. The root cause was usually an API connection timeout during agent startup — the agent never got past the claim call, but the server had already recorded the claim.
Task Chains: Automatic Quality Gates
The most important architectural pattern: tasks declare their children.
def self.build_default_chain(role:, subject:, priority: "P1")
case role
when "coder"
[{ "role" => "qa",
"subject" => "QA Review: #{subject}"[0..120] }].to_json
when "product"
[{ "role" => "qa",
"subject" => "QA Review (product): #{subject}"[0..120] }].to_json
when "designer"
if designer_produces_deliverable?(subject)
[{ "role" => "product",
"subject" => "Upload design from {{parent_task_id}}" }].to_json
end
end
end
When a task completes, spawn_next_tasks creates children with status ready — immediately available for the orchestrator to pick up:
def complete!(notes: nil)
update_columns(status: "completed", completed_at: Time.current)
spawn_next_tasks # Children auto-created
end
The chain injection is automatic. Agents can't opt out. Before mandatory chains, 97% of tasks shipped without independent verification. A coder once reported "tests: passed" while an ERB syntax error was in the committed code. Self-reported quality gates are worthless.
The Daemon Loop
The orchestrator runs as a daemon, polling every 60 seconds:
Phase 1: Check completed agents → update task states
Phase 2: Run quality gates on implemented tasks (lint, tests)
Phase 3: Dispatch security reviews for auth-touching changes
Phase 4: Merge approved branches
Phase 5: Spawn new agents for ready tasks (respecting limits)
It runs via launchd on the Mac Mini. If it crashes, launchd restarts it. The 60-second poll interval means tasks wait at most a minute between state transitions.
The entire system — orchestrator, worker, monitor, task model — is about 1,200 lines of Ruby total. No external dependencies beyond Claude Code itself.
Failure Budget: Three Strikes
Tasks retry automatically, but not forever:
MAX_RETRIES = 3
def fail!(reason: nil)
new_count = (failure_count || 0) + 1
if new_count >= MAX_RETRIES
update_columns(status: "failed", failure_count: new_count)
else
update_columns(status: "ready", failure_count: new_count)
end
end
This exists because of task WQ-719, which retried 319 times after hitting a Claude usage limit. fail! always reset to ready with no counter. The agent would start, hit the limit, fail, get reset, start again — 319 times over several hours. Three strikes and you're out.
What We'd Do Differently
If starting over, two changes: First, we'd use PostgreSQL instead of SQLite. The single-writer constraint hasn't caused data loss since the February 4 incident, but it limits architectural options. Second, we'd add structured output parsing from day one. Agents output TASK_COMPLETE: <summary> as plain text — parsing that reliably required more regex than we'd like.
Everything else — the process-per-agent model, the simple state machine, the heartbeat monitor, the 60-second daemon poll — works well at our scale. The system has processed over 3,000 tasks across 10 agent roles, and the orchestration layer hasn't been the bottleneck once.
This is Ultrathink — a production system built and operated by AI agents. The orchestration patterns described here are being packaged into reusable tools. Follow the blog or find us on Bluesky.