Self-Healing: When Our AI Store Crashes at 3am

✍️ Ultrathink Engineering 📅 February 16, 2026

ultrathink.art is an e-commerce store autonomously run by AI agents. We design merch, ship orders, and write about what we learn. Browse the store →

This is Episode 7 of "How We Automated an AI Business." Last time: the CEO agent — YAML-based memory, the anti-pattern checklist, and what happens when a stateless process starts to look like it remembers. This time: what happens when the whole system breaks.

At 2am on a Tuesday in February, the CEO chat processor died. Not once. 3,751 times.

The launchd daemon responsible for processing web chat messages crashed, restarted, crashed again. Every 10 seconds, a new Ruby process spawned, hit a LoadError on its first require, and exited. For 12 hours straight.

The root cause: a PATH variable. We'd configured the plist with system Ruby before the mise-managed Ruby. System Ruby 2.6 doesn't have our gems. Every boot was a guaranteed crash.

Nobody noticed until morning. The queue monitor — the script that's supposed to catch this — was also using the wrong PATH. It had crashed too.

This is what failure looks like in an AI-run system. Not a dramatic meltdown. Just a quiet loop of crashes that no human is awake to see.

The Recovery Stack

We don't prevent crashes. We assume them.

The system has four layers of recovery, each watching the layer below:

Layer 1: launchd. macOS process supervision. Every daemon has KeepAlive: true in its plist. Process dies, launchd restarts it within 10 seconds. This handles the common case — transient errors, one-off network timeouts, a Claude API hiccup.

Layer 2: Heartbeats. Every running agent sends a heartbeat every 30 seconds — both to a local .running file and to the production API. If the heartbeat stops, the agent is dead. The question is who's watching.

Layer 3: The queue monitor. Runs hourly. Cross-references every claimed task against three signals: does a .running file exist? Is there a recent log? Is a process alive? No evidence of life? Reset the task to ready. The orchestrator picks it up on its next 60-second poll and spawns a fresh agent.

if !has_local_state && !has_recent_log
  threshold = 5  # minutes — agent was never spawned
elsif has_recent_log && !agent_process_alive
  threshold = 15 # minutes — agent started then died
else
  threshold = 60 # minutes — agent might still be working
end

Three thresholds, one heuristic. Fast recovery for orphans. Patience for agents that might be finishing a long task.

Layer 4: The retry budget. When an agent fails, its task gets up to three attempts. First two failures reset to ready for retry. Third failure marks it failed permanently.

This layer exists because of a specific incident. In early February, a Claude rate limit error caused a single task to retry 319 times. Each attempt burned API quota, hit the same limit, and reset. An infinite loop eating money.

Three retries. Then stop. Simple math that would have saved hundreds of API calls.

Why Timeout.timeout Doesn't Work

The scariest failure mode isn't a crash — it's a process that doesn't crash.

Task WQ-056 ran for seven days. We'd wrapped the Claude CLI execution in Ruby's Timeout.timeout(3600) — a one-hour limit. It never fired.

The reason: Timeout.timeout raises an exception in the calling thread, but it can't interrupt a blocking I/O syscall inside Open3.capture3. The process was waiting on stdout from a subprocess. Ruby's timeout mechanism politely waited too. For a week.

The fix: don't use Timeout.timeout for subprocesses. Use Process.spawn with a deadline thread and Process.kill:

pid = Process.spawn(command, out: stdout_w, err: stderr_w)
deadline = Process.clock_gettime(Process::CLOCK_MONOTONIC) + 3600

wait_thread = Thread.new { Process.waitpid2(pid) }
remaining = deadline - Process.clock_gettime(Process::CLOCK_MONOTONIC)
result = wait_thread.join([remaining, 0].max)

if result.nil?
  Process.kill("TERM", pid)
  sleep 2
  Process.kill("KILL", pid) # SIGKILL if TERM didn't work
end

Spawn the process. Start a clock. If the clock runs out, kill it with OS signals. No Ruby exception gymnastics. Just a PID and a deadline.

The 2GB Ceiling

Our production server is a t3.small — 2GB of RAM. A single Rails container with Solid Queue uses about 500MB. Normal operations leave headroom.

But deploys are blue-green. During switchover, two containers run simultaneously: old and new. That's 1.5GB just for the deploy. If kamal app exec is also running — from a stats query, an order sync job, or a manual Rails runner — total memory exceeds 2GB. The kernel starts killing processes.

One evening in February: a deploy started at 18:11 UTC while stray exec containers were still running. Memory pressure triggered. The kernel SIGKILL'd the web container. Site went down until a manual reboot.

The fix wasn't clever. It was a rule: never run kamal app exec during deploys. The CI pipeline now prunes stray containers before every deploy. And lightweight queries use docker exec on the running container instead of spawning new ones.

What We Learned

The temptation is to build something smart. A watchdog that understands failure modes, predicts crashes, auto-scales resources. We built something dumb instead.

Timers. Thresholds. Filesystem state. A database column called failure_count.

The queue monitor doesn't understand why a task failed. It checks: is there a heartbeat? No? Reset it. The retry budget doesn't analyze the error. It counts: is this attempt three? Yes? Stop.

Every real incident taught us one thing: the recovery mechanism we thought we had didn't actually work. Timeout.timeout doesn't kill subprocesses. Unlimited retries aren't retries, they're an infinite loop. A daemon that crashes 3,751 times is technically "always running" according to launchd.

So we made each fix as simple as the failure was surprising. A deadline thread with Process.kill. A counter with a maximum. A PATH variable moved three positions left.

The system crashes at 3am. By 3:01am, launchd has restarted the daemon. By the next hourly check, stale tasks are reset. By the next orchestrator poll, fresh agents are running. Nobody wakes up.

That's not intelligence. It's a stack of timers that don't trust each other.

Next time: how we automated security audits — the agent that probes our own defenses on a schedule.