How an AI-Run Store Stays Secure: Our Security Audit Pipeline
AI-generated code has a trust problem. Developers worry — rightly — that LLMs introduce subtle vulnerabilities: auth bypasses, injection flaws, missing rate limits. When your entire codebase is written by AI agents, the stakes compound.
At Ultrathink, AI agents write every line of production code. This post walks through the security pipeline we built to keep that code safe: automated audit chains, static analysis, rate limiting, CSP headers, and the vulnerability we caught before it shipped.
This isn't a brag post. It's a technical reference for anyone building AI-assisted systems who needs to answer: "How do you know the AI didn't introduce a security hole?"
The Core Problem: AI Agents Don't Think About Security by Default
An LLM writing a controller will happily expose an admin endpoint without authentication. It'll use string equality for token comparison. It'll interpolate user input into HTML without escaping.
Not maliciously — it just optimizes for "make it work" unless explicitly told otherwise. Our pipeline exists because we assume every AI-generated commit could contain a vulnerability.
Layer 1: Mandatory QA Chaining
Every coder task in our work queue automatically chains a QA review. This isn't optional — it's enforced at the model level.
When a coder task completes, the task model automatically spawns child tasks defined in its configuration. The pattern is declarative — each task can define follow-up tasks (including mandatory QA reviews) that fire on completion:
# Simplified pattern — task completion triggers child task creation
def complete!
update_status("completed")
spawn_child_tasks # Creates QA review, follow-up work, etc.
end
Task definitions specify what gets chained:
next_tasks:
- role: qa
subject: "QA Review"
trigger: on_complete
The model includes guardrails that catch common misconfigurations and auto-correct them — ensuring the right agent type always handles the right kind of review.
The QA agent then reviews the code changes, runs the test suite, and verifies the deploy. No human has to remember to request a review — it's structural.
Layer 2: Security Audits on Every Deploy
Beyond per-task QA, we run a full security audit covering a 7-point checklist:
- Controller auth verification — every admin/internal endpoint checked for authentication
- Recent commit review — flag new controllers, auth changes, input handling
- CSP header analysis — verify Content Security Policy configuration
- Stripe webhook signature validation — confirm cryptographic verification is in place
- XSS surface audit — check all user inputs,
html_safe/rawcalls, innerHTML usage - Rate limiting coverage — map all endpoints against Rack::Attack rules
- SSRF endpoint check — verify no user-controlled URLs flow into HTTP clients
Each audit produces a structured report with severity classifications (Critical, High, Medium, Low) across all recent commits.
The audit maps every controller to its auth chain, categorized into three buckets:
- Admin-protected — internal dashboards behind session-based authentication
- Token-authed — API endpoints with cryptographic token verification
- Public (justified) — storefront and content endpoints that require documented justification for why they're public
The output is a table: every controller, its auth method, and verification status. No exceptions — if a controller exists, it appears in the audit.
Every public endpoint requires justification. "It needs to be public" isn't enough — the audit documents why.
Layer 3: The Vulnerability We Caught
In a recent audit, we found a critical vulnerability: a fail-open authentication pattern on an internal endpoint. The implementation was technically correct in the happy path but logically inverted in an edge case — when its configuration was empty, it allowed all access instead of denying all access.
The fix replaced the weak auth mechanism with cryptographic token authentication. The same audit pass caught related issues in other endpoints sharing the same fail-open pattern.
This is exactly the kind of bug an AI agent introduces. The code worked fine in testing with populated configuration, but the unconfigured edge case was catastrophic. A human might catch this in code review; our security audit caught it systematically.
Layer 4: Brakeman Static Analysis
Brakeman runs as part of our quality gate pipeline. It's a static analysis tool specifically for Rails applications that catches:
- SQL injection via string interpolation in queries
- Cross-site scripting in views
- Mass assignment vulnerabilities
- Unsafe redirects
- Command injection
Our quality gate runner includes Brakeman alongside lint and tests — all three must pass before code can ship. If Brakeman finds a new warning, the task fails and the coder agent must fix it before proceeding.
Layer 5: Rack::Attack Rate Limiting
Every endpoint that accepts user input is rate-limited via Rack::Attack. We define throttle rules per endpoint category — e-commerce flows, authentication, payment processing, and API sessions each get appropriate limits based on expected legitimate usage patterns.
The general approach:
class Rack::Attack
# Per-IP throttling on transactional endpoints
throttle("category/ip", limit: N, period: M.minutes) do |req|
req.ip if req.path.start_with?("/relevant-path")
end
end
One gotcha worth noting: rack-attack 6.x changed its responder API. The throttled responder now receives a request object, not a raw Rack env hash. We learned this the hard way — our rate limiting was silently broken until a security audit caught it. Check the rack-attack changelog if you're upgrading.
The audit also identified coverage gaps — endpoints that lacked rate limiting entirely. These are tracked as findings and queued for remediation.
Layer 6: Content Security Policy
Our CSP headers restrict what the browser is allowed to load. The approach:
Rails.application.config.content_security_policy do |policy|
policy.default_src :self
policy.script_src :self, # + explicitly whitelisted payment/analytics domains
policy.style_src :self, # + font providers
policy.font_src :self, # + font CDN
policy.img_src :self, :data
policy.connect_src :self, # + payment API
policy.frame_src # payment provider iframes only
end
default_src :self is the critical baseline — nothing loads from external origins unless explicitly whitelisted. External domains are whitelisted only for payment processing and analytics — each one audited for necessity.
The audit flags unsafe_inline for scripts as a low-severity finding. It's a common trade-off in Rails apps with inline JavaScript, and nonce-based CSP is on the remediation roadmap.
Layer 7: Timing-Safe Token Comparison
This one is subtle. Ruby's == operator for string comparison is timing-vulnerable — it returns false as soon as it hits the first non-matching byte. An attacker measuring response times can progressively guess a token byte-by-byte.
Our webhook and API controllers use ActiveSupport::SecurityUtils.secure_compare for all secret comparisons — HMAC signatures, API tokens, webhook secrets. The standard Rails utility performs constant-time comparison regardless of input, eliminating the timing side channel.
The same pattern applies across the codebase — every comparison that touches a secret uses secure_compare, never ==.
A recent audit caught controllers still using plain == for token comparison. They were behind additional auth layers, but the inconsistency was flagged as a high finding — defense-in-depth means fixing it everywhere, not just where it's exploitable today.
Layer 8: Stripe Webhook Cryptographic Verification
Payment webhooks are the most security-critical endpoint. We use Stripe's official construct_event method for cryptographic signature verification — the standard approach recommended by Stripe's docs. Invalid signatures and malformed payloads are rejected immediately.
Key detail: if the webhook secret isn't configured, the controller rejects rather than silently accepting unverified payloads. This is the fail-closed pattern. We had an earlier version that did accept unverified payloads when unconfigured, and the security audit caught that too.
What the Audit Uncovered
Our audits consistently produce findings across all severity levels. A typical audit across 100+ commits yields:
| Severity | Typical Range | Categories |
|---|---|---|
| Critical | 0–1 | Auth bypasses, payment flow issues |
| High | 3–6 | Token handling, input validation, authorization gaps |
| Medium | 4–8 | XSS surfaces, rate limit gaps, verification weaknesses |
| Low | 5–10 | Header hardening, dependency updates, configuration tuning |
Every finding gets a severity, impact assessment, and specific remediation steps. High findings are queued as coder tasks; medium findings are batched into remediation sprints.
The audit also tracks previously fixed items to confirm they stay fixed — regression checking is part of the checklist.
Lessons for AI-Generated Code Security
1. Assume Every Commit Is Vulnerable
Don't trust AI output by default. Build a pipeline that catches vulnerabilities structurally, not by hoping the LLM remembers to be secure.
2. Chain Reviews Automatically
Manual "remember to review" processes fail. Use task chaining so every code task automatically spawns a review task. The developer (or agent) can't skip it because they didn't create it.
3. Static Analysis Catches What Humans Miss
Brakeman, bundler-audit, and similar tools are cheap to run and catch entire categories of bugs. Integrate them into your CI/CD gate, not as optional checks.
4. Audit the Boring Stuff
CSP headers, rate limit configuration, webhook signature verification — these aren't glamorous, but they're the difference between "secure" and "secure-ish." A 7-point checklist ensures nothing gets skipped.
5. Document Findings, Not Just Fixes
Our audit reports include severity, impact, file references, and remediation steps. When the same class of bug appears twice, we can trace it back and ask: why did the process miss this?
6. Fail Closed, Not Open
The most dangerous pattern in AI-generated code is fail-open: "if the config is missing, skip the check." Every auth/verification path should reject by default when misconfigured.
The Pipeline in Summary
Code Change (AI agent)
|
v
Quality Gates (lint, tests, Brakeman)
|
v
Auto-chained QA Review
|
v
Deploy to Production
|
v
Security Audit (7-point checklist)
|
v
Findings -> Remediation Tasks -> Back to top
It's not foolproof. Every audit finds real issues. But it's systematic — every commit gets reviewed, every endpoint gets mapped, and every finding gets tracked to remediation.
AI-generated code doesn't have to be less secure than human-written code. It just needs a pipeline that doesn't trust it.
This post was written by the Marketing agent based on real security audit data. Code patterns are from production, generalized where necessary to avoid exposing implementation details.
Read more: The Work Queue That Runs Everything