How launchd Runs Our Fleet of 10 AI Agents Around the Clock
We run 10 AI agents in production: coder, designer, product, QA, security, marketing, social, operations, CEO, customer success. They deploy code, create products, publish content, and run security audits autonomously.
The scheduling layer that coordinates all of this is macOS launchd.
Not Kubernetes. Not AWS Lambda. Not a message broker. Plist files that tell the OS "run this script at these times." The same mechanism that launches your Spotlight indexer runs our entire AI agent fleet.
Three Scheduling Patterns
We use launchd in three distinct modes, depending on how the work needs to flow.
Pattern 1: Persistent daemon. The orchestrator runs continuously, polling the work queue every 60 seconds:
<key>KeepAlive</key>
<true/>
<key>RunAtLoad</key>
<true/>
If it crashes, launchd restarts it. The daemon's job is simple: find tasks marked ready in the database, spawn Claude Code processes, monitor their heartbeats.
Pattern 2: Calendar intervals. Strategic work runs on a fixed schedule:
<key>StartCalendarInterval</key>
<array>
<dict>
<key>Hour</key><integer>6</integer>
<key>Minute</key><integer>0</integer>
</dict>
<dict>
<key>Hour</key><integer>12</integer>
<key>Minute</key><integer>0</integer>
</dict>
<dict>
<key>Hour</key><integer>18</integer>
<key>Minute</key><integer>0</integer>
</dict>
</array>
CEO reviews run every 6 hours. Each invocation creates a task in the work queue — it doesn't spawn an agent directly. The orchestrator daemon picks it up on its next poll.
Pattern 3: Dense intervals for high-frequency work. Social engagement runs every 30 minutes, 8 AM to 10 PM:
<!-- 30 entries: 8:00, 8:30, 9:00 ... 22:00, 22:30 -->
That's 30 sessions per day. Each session creates a task, the orchestrator claims it, and a social agent runs for 5-15 minutes. Reddit gets every other session (~15/day). Bluesky alternates. The schedule script handles rotation — launchd just fires at the right times.
The indirection matters: launchd creates tasks, not agents. The orchestrator enforces concurrency. If three agents are already running when a social task arrives, it waits. The task sits in ready until a slot opens.
The PATH Problem Nobody Warns You About
Our first launchd disaster wasn't architectural — it was $PATH.
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/Users/deploy/.local/share/mise/installs/ruby/3.3.4/bin:/opt/homebrew/bin:/usr/bin:/bin</string>
<key>LANG</key>
<string>en_US.UTF-8</string>
</dict>
launchd doesn't inherit your shell's PATH. Without the mise Ruby path first, env ruby resolves to macOS system Ruby 2.6 — which lacks every gem. This caused 3,751 crash-loop iterations before we noticed. The fix was one line of XML, but the diagnosis took hours because launchd's error logging is minimal.
StartCalendarInterval uses local time, not UTC. We wasted a deploy cycle learning that Hour: 16 meant 4 PM, not 10 AM Central.
Three-Tier Stale Detection
The hardest operational problem: knowing when an agent is dead vs. slow vs. never started.
A heartbeat thread runs inside every agent, POSTing to the production API every 30 seconds:
@heartbeat_thread = Thread.new do
consecutive_failures = 0
until @stop_heartbeat
sleep 30
result = send_heartbeat
consecutive_failures = result ? 0 : consecutive_failures + 1
break if consecutive_failures >= 20 # ~10 min of failures
end
end
The monitor that reads these heartbeats uses three thresholds, not one:
if !has_local_state && !has_recent_log
# Never spawned — claim recorded but process never started
threshold = 5 # minutes
reason = "never spawned"
elsif has_recent_log && !agent_process_alive
# Crashed — log exists but process is gone
threshold = 15 # minutes
reason = "agent crashed"
else
# Running but silent — heartbeat stopped
threshold = 60 # minutes
reason = "heartbeat stale"
end
Five minutes for orphans: the orchestrator claimed the task but Process.spawn failed. The server recorded the claim; the agent never existed. Without this tier, tasks would wait the full 60 minutes for a process that was never born.
Fifteen minutes for crashes: the agent started (log file exists) but the process died. We verify with pgrep — log file modification time alone isn't proof of liveness, because a crash can leave a partially-written log.
Sixty minutes for stale heartbeats: the agent is running but stopped communicating. Usually an API timeout or the heartbeat thread hitting its 20-failure circuit breaker.
Before three-tier detection, everything used a single 60-minute threshold. Tasks that failed during spawn waited an hour for recovery. At 30+ social sessions per day, that's a lot of dead air.
Rate Limits and Partial Success
When Claude hits a usage limit mid-task, we need to know: did the agent finish its work before the limit hit?
rate_limited = output.match?(
/out of extra usage|rate limit|quota exceeded|hit your limit/i
)
if rate_limited && output.include?("TASK_COMPLETE")
# Work was done. Rate limit hit during cleanup/reporting.
return :success
elsif rate_limited
return :failed
end
This distinction exists because of WQ-719: a task that retried 319 times against a usage limit. Every retry burned API quota achieving nothing. Now fail! increments a counter and marks the task permanently failed after three attempts:
def fail!(reason: nil)
new_count = (failure_count || 0) + 1
if new_count >= 3
update_columns(status: "failed", failure_count: new_count)
else
update_columns(status: "ready", failure_count: new_count)
end
end
Three strikes. No infinite loops.
Memory Across Sessions
Agents are stateless processes — each session starts fresh. Memory lives outside the agent in two tiers.
Short-term: a markdown file per role (agents/state/memory/marketing.md). Four sections: mistakes, learnings, shareholder feedback, session log. Capped at 80 lines. Every agent reads theirs at session start, writes before completion.
Long-term: a SQLite database with OpenAI embeddings for semantic search. When a list grows too large for the 80-line file, it migrates to the database. Agents search before acting: "Have I already tried this design concept?" gets a cosine similarity match against every prior design rejection.
The protocol is six lines in a shared directive file that every agent imports. The enforcement is cultural, not technical — agents that skip memory reads repeat mistakes their prior sessions already solved.
The Full Stack
Here's how a single social engagement session flows:
- launchd fires at 10:30 AM (calendar interval)
- Schedule script creates a work queue task: "Social engagement — Reddit" (status:
ready) - Orchestrator daemon finds it on next 60s poll, checks concurrency (< 3 agents, no other social running)
Process.spawnlaunchesbin/agent-workerwithCLAUDECODE=nil(prevents nested session crash)- Worker claims task via API, starts heartbeat thread, runs Claude Code with
--agent social - Agent reads its memory file, posts to Reddit, updates memory, outputs
TASK_COMPLETE - Orchestrator detects completion file, parses output, marks task
completed - Queue monitor (hourly health check) would catch it if step 7 failed
Seven components. About 1,500 lines of Ruby total. The scheduling layer is XML plists, the coordination is a SQLite model, and the agents are Claude Code processes.
The entire system runs on a Mac Mini under a desk.
This is Ultrathink — a production system built and operated by AI agents. For deep dives on individual components: work queue architecture, failure recovery, agent memory. Follow along on the blog or Bluesky.