Skip to content
← ALL WRITING

2026-04-24 / 12 MIN READ

The image feedback ledger that became a prompt taxonomy

Across 21 review rounds and ~500 AI image generations, a small JSON ledger became an AI image prompt taxonomy. Here is what it caught and missed.

Twelve months ago I set up a dumb JSON file to remember which AI image prompts I had approved and which I had rejected. It became the most valuable piece of creative infrastructure in the whole image pipeline. After 21 review rounds and close to 500 generated images, the file stopped being a memory aid and started acting like a taxonomy: here are the prompts my model can actually render, here are the ones it cannot, and here are the cues that turn a MAYBE into a GOOD on the reroll.

This is the retrospective. What I shipped, what worked, what didn't, and where the whole thing is still wrong.

What I shipped (the ledger and the tooling around it)

The ledger is a single JSON file at .image-feedback-ledger.json. One top-level slugs object keyed by article slug, each with a history array of {round, verdict, note} entries plus a cues_to_preserve list and a status field. As of this writing it holds 106 slugs and 142 verdict entries across 21 review rounds. The ledger is one component in a larger pipeline; the full wiring between the Mac prompt desk and the Windows GPU box is covered in the hub post.

ledger/category taxonomy500 reviews · 21 rounds
works
atmospheric landscape
hit rate 51%
monolith in nature
hit rate 25%
dune interior
hit rate 39%
iridescent primitive
hit rate 23%
never works
spaceship
fail rate 0/14
hooded human
fail rate 2/15
dune vehicle
fail rate 0/16
looks-like metaphor
fail rate
The two stacks of the taxonomy as pulled from the ledger: simple single-subject forms on the left, multi-element or metaphor-heavy prompts on the right. Hit rates are from a single operator's Z-Image Turbo reviews, rounds 8 through 18.

Three functions do most of the work. record_verdict writes a new entry and runs the note through a regex cue extractor that derives short reusable prompt fragments. apply_cues takes the base prompt for a slug and appends the preserved cues before the next render. parse_verdict_paste reads a block I paste from my phone and returns a list of structured entries ready to record.

The loop runs like this. On my Windows box the dGPU renders a round of heroes overnight. In the morning I open a review page on my phone, tap through each image, and dictate a one-line verdict per slug into the paste format: - slug [GOOD|MAYBE|BAD] -- free text. Back at the Mac I paste the block into the tool, it parses, records, extracts cues, saves. The next generation cycle picks up the cues and reshapes the prompt. The Mac side of this loop is the same desk I run a short AI review block first thing every day from, which is why adding image triage cost me no new habit.

Why a ledger instead of vibes. A few hundred images is already too many to hold in working memory. My MAYBE verdict on round 3 of capi-purchase-event-hash said "I want these shapes to take up more of the frame." By round 7 I had forgotten I ever said that. The JSON remembered. The cue extractor tagged the slug with "tight macro zoom so the hero fills at least two-thirds of the frame" and the next generation carried that note forward automatically. The generation itself runs through a safety wrapper that keeps the Mac from swap-thrashing under concurrent Flux loads, so rounds can queue deep without me babysitting the laptop.

What the AI image prompt taxonomy says works

Across the ledger, the categories that landed at a usable hit rate all share a shape: one subject, clear placement, abstract enough that the model does not have to interpret a metaphor.

Atmospheric single-subject fantasy landscapes hit 77 of 150, a 51% rate. The winners were things like aurora tundra, salt flats mirror, slot canyon, obsidian lava field, ice chasm depth, valley of standing stones, red mesa citadel. No people, no vehicles, no architecture beyond a dominant natural form.

Monoliths landed at 27 of 110, a 25% rate overall but tightly clustered. Monoliths in nature with a single earthly placement worked (mesa top, cliff edge, glacier peak, mirror lake). Multi-monolith formations worked (triangle, ring, parade line, pyramidal stack). Abstract voids worked (aurora void, wormhole opening, liquid mirror plane). Everything else in the monolith category underperformed.

Dune came in at 26 of 66, a 39% rate, splitting cleanly into three buckets: desert landscapes, architectural interiors like atreides palace and sietch interior pool, and planetary orbital views. None of the Dune winners included vehicles. None had more than one subject.

Iridescent glass macros (the brand aesthetic I use for hero images on this site) sat at 7 of 30, a 23% hit rate, with the same pattern underneath. Simple geometric primitives won: single cube, triangular prism, flat horizontal slab, thin layer stack, grove of spikes. Complex or organic forms lost. The glass-macro result echoes something I found when comparing the prompt behaviors of Z-Image Turbo and Flux Schnell directly: Z-Image prefers the primitive, Flux tolerates more surface drama.

The common thread, stated as bluntly as I can state it. The model renders one thing well. When I give it one thing to render, it renders. When I give it a metaphor or a diorama or a small-human-in-a-big-scene, it renders something that looks AI-generated in the bad way.

What the taxonomy says never works

The negative side of the taxonomy is more useful than the positive side. Knowing what to avoid saves batch time.

Spaceships: 0 for 14. Absolute fail across every variant I tried: starships in orbit, dropships in atmosphere, frigates in dock, abandoned generation ships. Never prompt Z-Image Turbo for a spacecraft as the primary subject.

Hooded human figures in portals or caves: 2 for 15. The model cannot hold a small human silhouette and complex portal geometry at the same time; one of them always comes out warped.

Dune vehicles: 0 for 16. Ornithopters, frigates, dropships, heighliners, harvesters, convoys. All bad. The moment the prompt asks for a vehicle the render loses cohesion.

"Looks-like" metaphor prompts: dragon-spine ridge, kraken island, serpentine reef, sky whale skeleton, leviathan skeleton. The model reads the animal name and renders the literal animal instead of a ridge or island that evokes it. Banned until a different model handles it better.

Small architectural specifics on distant cliffs: cliffside monastery network, bridged monastery peaks, hidden mountain temple, glacier temple ruins. Small buildings on large geography do not resolve at 1024px. Either the buildings get smeared or the geography does.

What worked in the ledger itself, and what didn't

The review-on-the-phone flow worked. The format is - slug-name [GOOD|MAYBE|BAD] -- free text note. I can tap through 40 images in five minutes while waiting for coffee, and the paste block parses clean because the regex is strict about the brackets and the double dash.

The regex cue extractor worked better than I expected. The rules started at eight patterns in round 1 and grew to about thirty by round 6 as I noticed recurring notes. "Doesn't take up enough of the frame" became a cue. "It's too centered" became a cue. When I see a new phrase repeatedly in MAYBE notes I add a rule, and the next generation carries the cue forward on every affected slug without me re-editing prompts.

Per-slug memory persistence worked. Round 3 of capi-purchase-event-hash got a MAYBE with "I want these shapes to take up more of the frame." Round 7 reran with the cue preserved, and it hit GOOD. Without the ledger I was making the same mistakes in round 4 that I had already corrected in round 2.

What did not work: encoding the judgment by verdict alone. The MAYBE notes carry most of the signal because they are the rerolls I can rescue. The BAD notes often said things like "hey guys" or "nope" because I was triaging fast on the phone.

The first cue extractor pass also missed most of the patterns. It had eight rules. I spent rounds 4 through 6 expanding the rule set against what I was actually saying on the phone, and by round 10 it stabilized.

The MAYBE notes carry most of the signal because they are the rerolls I can rescue.

What I would do differently now

Capture the round number automatically instead of making me type it at the top of the paste. I have typed the wrong round twice, and having the tooling own that removes a class of error.

Split cues into namespaces: style, composition, banned-concept. Right now they are a flat list, which works but does not help me see at a glance whether a slug's problems are about the aesthetic or the framing.

Add a pre-generation filter that hard-rejects known-bad categories before they enter the batch. If a prompt contains "spaceship" or "ornithopter" or "dragon-spine ridge," the generator should refuse and ask me to reword. Nothing currently prevents me from waste-rendering the same failure again.

Keep the free-text notes above everything. The regex taxonomy is second-order; the notes are the actual data. If I had to rebuild the tooling from scratch the notes are what I would save and the rules are what I would regenerate.

What the next six months will test

Does the taxonomy transfer to a different model. Most of what I have learned is Z-Image Turbo-specific. When I move a batch to Qwen Image or to a newer Flux release, the hit rates will shift and some of the banned categories may open up. The ledger shape is portable; the cue rules are probably not.

Does cluster-uniform styling beat per-article bespoke prompts. I have been generating one bespoke hero per article, which is expensive in both render time and review time. Moving to one uniform aesthetic per cluster of articles would let me trade bespoke variety for consistent throughput. I want to see whether the reader actually prefers the variety or whether cluster-level consistency reads as more professional.

Does a library-first strategy eat the problem. If I keep building the approved pool, eventually 80 to 90% of new articles can pull a hero from the existing library and only 10 to 20% need bespoke rerolls. At that point the ledger stops being a prompt-iteration tool and starts being a curation tool. Different UX, different workflow, probably better economics.

At what point does the feedback loop's value curve bend. Round 1 moved the hit rate from unknown to roughly 40%. Round 10 was north of 60%. I do not expect round 30 to be at 90%; diminishing returns are real and at some point the remaining failures are model-capability failures, not prompt failures. I would like to know where that floor is. If you want to see the full creative-tech stack around this loop, the live case studies and the product suite both show how the image pipeline ties into client work and into the shelf of productized offers.

Why not just use a hosted image API?

Two reasons. Cost, because per-image API pricing compounds fast at 500 generations across 21 rounds. And control, because the cue-extractor pipeline only works if the generator and the ledger live in the same process. A hosted API is fine for one-off marketing assets, but it is the wrong architecture for a site that needs a hundred consistent hero images.

Why GOOD / MAYBE / BAD instead of a numeric score?

Three buckets is the most I can make an honest judgment about on a phone at 7am. A 1-to-10 score sounds more precise but in practice I would be hovering between 6 and 7 for most images and the noise would swamp the signal. GOOD / MAYBE / BAD also maps cleanly to the downstream action: approve, reroll, skip.

How does the cue extractor avoid drift over time?

It does not, fully. Rules added in round 4 assume a prompt shape that may not match rounds 15 onward. I try to review the full rule list every five rounds and retire cues that no longer trigger. The ledger still wins over vibes, but it is maintenance, not fire-and-forget.

Is this worth the setup cost for someone generating fewer images?

Probably not under 50 generations. Under 50 I can hold the prompt lessons in my head. Between 50 and 200 a spreadsheet would work. Above 200 the ledger pays for itself in saved reroll time. The tooling itself is about 150 lines of Python plus one JSON file, so the build cost is low; the decision is really about whether the volume justifies the habit.

What happens if the image model changes underneath the ledger?

The slug history stays. The cue rules probably break, because they encode assumptions about what the current model can and cannot render. When I move to a new model I expect to keep the ledger structure, keep the free-text notes, and rewrite the regex pack from scratch against the new behavior. That is the main reason I keep the notes and the rules separate in the code.

Sources and specifics

  • Ledger location: /.image-feedback-ledger.json in the solo operator's workspace root. 106 slugs, 142 verdict entries, 21 rounds captured as of this writing.
  • Tooling: .image-work/image_tooling.py exports load_ledger, save_ledger, apply_cues, record_verdict, parse_verdict_paste, slugs_needing_reroll. Approximately 30 regex cue rules as of round 11.
  • Category hit rates are from a single operator's review of ~500 generations on Z-Image Turbo, rounds 8 through 18: fantasy landscapes 77/150 (51%), monoliths 27/110 (25%), Dune 26/66 (39%), portals 10/45 (22%), iridescent glass 7/30 (23%), sci-fi 2/16 (12%), spaceships 0/14 (0%), epic fantasy 6/20 (30%).
  • All verdict notes are free-text dictated on a phone during morning review; the regex extractor is pattern-based and lossy by design.
  • The pipeline runs Mac-for-prompts and Windows-for-GPU; see the hub article on the cross-machine setup for the full wiring.

// related

Let us talk

If something in here connected, feel free to reach out. No pitch deck, no intake form. Just a direct conversation.

>Get in touch