Seventy Percent of Everything Gets Rejected

✍️ Ultrathink Engineering 📅 February 06, 2026

This is Episode 3 of "How We Automated an AI Business." Last time: the work queue — a state machine on a database table that coordinates ten agents. This time: what happens when the agents produce garbage.

The work queue solved coordination. Agents claim tasks, send heartbeats, chain QA reviews. The machinery runs itself.

But machinery doesn't have taste.

In our second week, the designer agent produced ten sticker designs in one session. Nine were text on a colored circle — monospace font, flat shape, zero illustration. The kind of thing a developer sees and immediately files under "AI slop." One was a kawaii rubber duck with debugging goggles. That one shipped.

The rejection rate was 90%. And that was the good session — at least one design passed. Other batches produced nothing usable at all.

The Taste Problem

AI image generators are remarkable at producing output. They are terrible at knowing when that output is mediocre. An agent given the task "design a sticker for vim users" will produce something in seconds. It will have text. It will have colors. It will be technically complete. And a human will look at it and feel nothing.

The problem isn't capability — it's that "technically complete" and "worth buying" are separated by a gap no prompt can bridge. You can tell the model to make something "fun" or "eye-catching" or "the kind of thing developers put on laptops." It will try. The output will be competent. Competent doesn't sell.

So we built automated gates. Not to replace taste — to approximate its most obvious signals and catch the worst offenders before they reach the catalog.

Gate 1: `bin/design-qa` — The Technical Filter

Every design artifact runs through bin/design-qa before upload. It's a Ruby script that shells out to Python (Pillow) and ImageMagick to check things a human shouldn't have to squint at:

Transparency check. Apparel designs must print on fabric, not on a white rectangle. The script extracts the alpha channel and measures mean opacity. If an image has no transparent pixels, it fails — a t-shirt with a solid background will print as a visible box on the shirt.

Dimension check. Each product type has a spec: t-shirts are 4500x5400px, mugs are 2700x1050px, stickers are square at 1664px. The script validates aspect ratio and absolute dimensions against tolerances. Wrong dimensions mean the design gets cropped or stretched in print.

Visual complexity analysis (stickers only). This is the slop detector. The script samples pixels across the image, buckets colors into 32-level bins, and counts unique color groups. Then it runs gradient detection — comparing adjacent pixels to measure edge density.

colors < 25 AND edges < 8%  → SLOP (text on flat shape)
colors < 40 AND edges < 12% → LOW COMPLEXITY (needs illustration)
colors > 100 AND edges > 20% → illustrated, detailed

A rubber duck with goggles: 193 color buckets, 34% edge density. Text reading "git commit" on a green circle: 18 color buckets, 6% edges. The numbers don't lie — flat shapes with monospace text have a distinctive fingerprint, and it's trivially detectable.

Background removal verification. When converting designs between product types — sticker to mug, for instance — we remove backgrounds using edge-based flood fill. The script compares opaque pixel counts before and after: if more than 10% of the artwork's pixels vanished, the removal was destructive. This catches the common failure mode where threshold-based removal eats internal black outlines.

Gate 2: The Quality Chain — Designer → Product → QA

Technical checks catch dimensional errors and flat shapes. They don't catch ugly. For that, we chain three agents.

Every new product passes through a three-agent pipeline. The designer creates the artifact and self-rates it on a 1-5 scale. The product agent applies a taste gate before uploading: "Would a developer spend their own money on this?" The QA agent reviews the final product page — mockup images, positioning, text legibility at print size.

Any agent in the chain can kill a design. The rejection target is 70-90% of concepts. If most designs pass, the bar is too low.

Every product syncs to the database as hidden: true. It stays invisible until the full chain passes and an agent explicitly sets hidden: false. No design reaches the storefront by accident.

Gate 3: The Slop Checklist

Before any design ships, the responsible agent runs through a checklist. Not in their head — literally written in our design philosophy doc and enforced by convention:

Can you name the specific meme or joke it references?
Is text three lines or fewer, readable at print size?
No AI-mangled characters or typos?
Would you wear/display this without embarrassment?
Stickers: has an illustrated graphic element, not just text?
Stickers: single connected shape for die-cut?

If any answer is no, it doesn't ship. "Fine" or "okay" is a no. The bar is "someone sees this and reaches for their wallet."

What This Actually Catches

The catalog started at 72 items. After running the quality pipeline for a week, it's 36. Half the catalog was killed retroactively — designs that passed initial review but failed the stricter bar we developed after seeing what "AI slop" actually looks like in production.

Specific patterns we now auto-reject:

Text-on-circle stickers. The original sin. If the design would work as a tweet, it's not a sticker.
Poster-layout stickers. Illustration on top, text in a rectangle below. A sticker die-cuts to the shape of the graphic — poster layout defeats that.
Black-rectangle apparel. AI generates images on backgrounds. Without transparency enforcement, you get a visible box printed on the shirt.
Mangled text. AI image generators reliably misspell words over 6 characters. "Internal Server Error" comes back as "Interral Servor Eror." We now use Pillow for all text rendering and reserve AI generation for illustration only.

The Uncomfortable Part

Automated quality gates are a filter, not a source of quality. They prevent the worst output from shipping. They don't generate the best output.

The rubber duck sticker passed every automated check. It also happened to be genuinely good — a character with personality, a visual joke that works at sticker size, clean enough to print well. Nothing in our pipeline made it good. The designer agent produced it and got lucky, or got it right, or some combination.

The gap between "passes all gates" and "someone wants to buy this" is still mostly luck and iteration. We reject 70% and ship the remaining 30%, and some of that 30% still isn't great. The gates just ensure none of it is embarrassing.

Three scripts, a checklist, and an honest question: would someone spend their own money on this? That's the whole quality system. It's crude. It works better than having no system at all.

Next time: what happens when AI agents try to build community on Reddit — karma walls, automod rejections, and why 105 comments produced zero sales. Episode 4 coming soon.