Skip to content
← ALL WRITING

2026-04-24 / 13 MIN READ

The Mac-to-Windows GPU pipeline for local image AI

A decision log on running Flux, Z-Image, and Qwen on a local AI image generation pipeline: Mac writes the prompts, a Windows RTX box does the render.

Two Sunday afternoons ago I had a fork I couldn't ignore. The image pipeline for this site was going to need a thousand renders before it stopped. The Mac kept crashing under Flux. fal.ai would cost about what a used bike costs. And the Windows gaming PC with the RTX 3070 sat ten feet away, mostly idle. I had to pick one.

This is the retrospective on that fork. What the three options actually cost, what I chose and why, and when I'd still reach for the other two.

the fork / three options

Tap an option to compare cost, speed, and reliability at volume.

cost at volumeelectricity only
speed (warm)17.8s warm
reliability at volumeneeds weekend build

when to pick option C

Chosen. Volume crossed break-even, Mac stays free to think.

A three-option decision tree for a local AI image generation pipeline, scored on cost, speed, and reliability at production volume.

The fork: three ways to run a local AI image generation pipeline

I needed hundreds of bespoke hero images for the article inventory on this site. Not templates with the slug swapped in. Bespoke prompts per article, per model's preferences, with a feedback ledger that captured Michael-review verdicts and fed the cues back into the next round. Volume was going to be roughly 200 unique prompts a month for the next quarter, plus rerolls, plus whatever the next brand direction needed.

Three options were on the table:

Option A: keep paying fal.ai (or Replicate, or the Black Forest Labs endpoint directly) per image. Predictable, no infrastructure, best-in-class model access the day it ships. The meter runs every time.

Option B: run mflux on the Mac. Free per image, fast iteration, prompts live next to the MDX files. Works until it doesn't; an M4 Max with the wrong concurrency settings will page itself into the ground on Flux dev.

Option C: dispatch from the Mac to the Windows box over SSH. The M4 Max writes prompts and handles review. The RTX 3070 runs the diffusion models under ComfyUI. SSH moves JSON and PNGs between them.

I ran the numbers and a week-long pilot on each before committing. The short version: Option C won for this volume at this price point. The longer version is worth writing down because volume, price, and hardware are all going to change, and the decision will get re-made.

Option A: keep paying fal.ai per image

fal.ai runs Flux, Z-Image, Qwen Image, and half a dozen other diffusion models behind a single API. Flux Schnell was roughly $0.04-0.05 per image at volume during my evaluation. Submit a JSON request, get back a URL, done. Their uptime during the evaluation was flawless. Their model catalog is wider than anything I could realistically keep current on locally.

What it would have cost me at this volume: call it 2,000 images through the pipeline (200/month times ten months, accounting for rerolls and brand-direction shifts). Even at the low end, that's $80-100. At the cloud-API spot-price variance I saw during tests, closer to $120-150. Not large on an absolute scale. Not nothing, either.

The real cost of Option A wasn't the bill. It was the lack of local control over the prompt tooling loop. I already had a ledger running on disk that captured per-slug review verdicts and fed cues forward. I had a safety wrapper for mflux on the Mac. I had serial runners and prompt JSON files and git history on all of it. Option A would have meant rebuilding that glue against an HTTPS API, and more importantly, it would have left the generation half of the loop opaque. I couldn't A/B the seed on my own timing. I couldn't checkpoint intermediate latents. I couldn't swap the model with a config change.

When A is the right call: small batches where the infrastructure payoff is negative. A twenty-image campaign for a single client. Anything where the cloud meter is dominated by your time-to-first-render. And anything that needs a model you can't run on consumer hardware; the Option A endpoint will always have the newest thing before I do.

Option B: run mflux on the Mac

mflux is a native MLX port of Flux Schnell and Flux dev. On an M4 Max with enough unified memory, it works. Schnell at 4 steps, 1344x768, takes about 30-40 seconds warm on my machine. That's faster than fal.ai when the endpoint is cold. It's also free.

The first time I tried to run it in parallel, the Mac locked up. Not slow. Locked. Three mflux-generate processes, each trying to hold Flux dev in memory, caused the OS to page aggressively, swap filled up, and the machine became unresponsive for about 90 seconds. I ended up shipping the mflux safety wrapper that stopped the Mac from swap-thrashing specifically so this wouldn't happen again. The wrapper enforces one mflux process at a time, gates on system memory pressure before starting, and cools down 15 seconds between runs.

After the wrapper, Option B became stable for low-to-medium volume. What it still cost: serial-only generation meant 20 images took about 15 minutes wall-clock. Rerolling a batch of 10 after review was another 8-10 minutes. More importantly, mflux only supports Flux. No Z-Image. No Qwen Image. For some article types Flux Schnell was the right call; for others the review ledger was screaming for Z-Image Turbo, which is materially faster and reads cleaner on certain subjects.

When B is the right call: low-volume solo work, Flux-only needs, willing to serialize, Mac with enough unified memory that the pager doesn't kill you. The safety wrapper is the hard prerequisite. Without it, Option B will eventually take down a working session.

Option C: dispatch from Mac to a Windows GPU box over SSH

The Windows workstation in the corner had an RTX 3070 with 8GB VRAM. I had Pinokio installed on it from an earlier season of AI video work. Pinokio runs ComfyUI with a one-click install. ComfyUI talks to a wider set of diffusion models than mflux, including Z-Image Turbo, Qwen Image, and the full Flux family. The VRAM budget on an 8GB card is tight for Flux dev but fine for Flux Schnell, Z-Image Turbo, and Qwen Image at typical hero-image resolutions.

The pipeline I built:

  1. Mac writes a prompts JSON (prompts-batch-N.json) with per-slug seeds, dimensions, and text.
  2. Mac pushes the JSON to the Windows box via git, or via scp when the batch is exploratory.
  3. Windows runs a Python script that builds the ComfyUI workflow graph for the chosen model, submits each prompt, saves the PNG with the slug as the filename, and commits the renders directory.
  4. Mac pulls the renders, builds a review HTML, asks me to paste back a verdict, parses the verdict, and updates the feedback ledger.
  5. The ledger's per-slug cues get injected into the next prompts JSON.

Z-Image Turbo came in at 17.8 seconds warm on the RTX 3070 during the smoke test. Flux Schnell runs a few seconds slower. Cold start is about 30 seconds, warm runs amortize against each other for a batch. A ten-prompt batch completes in roughly three and a half minutes end-to-end, which is faster than my review throughput anyway.

Build cost: a weekend to wire it up cleanly. The ComfyUI graph-as-code pattern took the longest; the rest is Python, SSH, and git. Ongoing cost: the Windows box draws about 350W while rendering and idles at around 80W. Electricity is not a line item I track closely, but it's not zero.

What Option C still costs: network dependency between the two machines. When the fiber link flapped last month I fell back to Option B for a batch. The Windows box has to stay on; a reboot loses me ten minutes while I log back in and re-start Pinokio. And VeraCrypt, which I use to keep the model weights on an encrypted drive, adds a mount step after reboot that I've forgotten often enough to put a reminder in my session-state doc.

What I chose and why

I chose C. Three reasons, in order:

Volume. Even at fal.ai's friendly pricing, 2,000 images was a real number. Building Option C took a weekend. The break-even was roughly the first 400 images, which I hit in about six weeks.

Control. Option C preserves the prompt-tooling loop I'd already built. The ledger, the safety wrapper (which I still use for Mac-local emergencies), the serial runner, the verdict parser. All of that keeps running. The GPU is a dumb worker with a stable API; I don't care whether it's local or cloud as long as the tooling layer stays intact. In practice, local was easier to integrate because there was no HTTP auth to think about.

Model selection. Option C gave me Z-Image Turbo specifically. The comparison between Z-Image Turbo and Flux Schnell on prompt divergence ran on live article prompts over two weeks and materially changed which model I reach for by article archetype. Option B couldn't have run that comparison. Option A could have, but at a meter.

The same pre-flight discipline that governs this pipeline shows up elsewhere in the studio: the MDX component prop audit I run before dispatching parallel writer agents is the same shape of gate, just on a different asset type. And the lab side of the practice runs on the same cadence; the 20 card-flip variants shipped last week were reviewed through the exact HTML grid the Windows box outputs dump into.

The GPU is a dumb worker with a stable API. I don't care whether it's local or cloud, as long as the tooling layer stays intact.

The Mac is still where the interesting work happens. It writes the prompts, ingests the ledger, runs the review HTML, holds the MDX and the commits. The Windows box is dispatched to, not orchestrated on. That split matters because it means the daily workflow feels unchanged from when I was running Option B on the Mac alone; it's just faster and broader. The pattern is one of the load-bearing pieces in the Operator's Stack I use across the practice; image generation is just one workload the split-machine idea applies to.

What I'd revisit

I'd revisit Option A if the next interesting model needs more VRAM than I have. Flux dev at full resolution was already tight on 8GB; something larger and I'm buying a new GPU or renting one by the hour. The cloud meter wins at that threshold.

I'd revisit Option B if I upgraded to an M4 Max with a larger unified memory config, or to whichever Mac silicon generation first ships enough memory to run Flux dev comfortably in parallel. The Mac-as-GPU path has a real ceiling right now, but the ceiling is rising.

I would not revisit the parallel-Mac approach that preceded this whole arc. That's what broke and sent me to the safety wrapper in the first place. The feedback ledger that turned 500 prompt reviews into a taxonomy only exists because the serial-by-default wrapper made the runs survivable. Parallel-on-the-Mac isn't a hardware problem I can solve with more RAM; it's a default that was wrong for this workload. Option C moved parallelism to the right machine.

I'll also revisit C itself if the cluster-uniform work I'm shipping now reveals a batch-orchestration problem I haven't hit yet. Scaling the output to cluster-uniform hero images is the next test of whether the dispatch model holds up at fifty images per evening. I think it will. I thought the Mac could handle parallel Flux, too.

Frequently asked questions

Do I need a Windows box to run a local AI image generation pipeline?

No. Option B (Mac-local mflux with a safety wrapper) is a fine starting point for solo work. The Windows dispatch only makes sense when you need a model mflux doesn't support (Z-Image Turbo, Qwen Image) or when volume justifies moving generation off the workstation you're writing on. If you have a spare GPU box, use it. If you don't, start with the wrapper on the Mac and move later.

Why not use an RTX on a cloud GPU instead of a Windows workstation?

I considered it. RunPod and Vast.ai both rent RTX time by the hour. For my volume, a spare-room Windows box I already owned beat per-hour rental math within the first month. If I didn't own the hardware, the decision flips. Cloud GPU with persistent storage is the right Option C variant for anyone who doesn't have an idle workstation in the house.

How do you handle the SSH authentication and the network dependency?

Key-based SSH with a dedicated key for this purpose. The Mac has a shell function that pushes a prompts JSON, triggers the runner on the other end, polls for completion, and pulls the renders back. When the network is down, the script fails cleanly and I know to fall back to mflux on the Mac. I don't try to make it resilient across an outage; the fallback is a conscious switch, not an automatic retry.

Does the Windows box need to stay on 24/7?

Mine does, because it's also a gaming rig and a streaming server. For image work specifically it only needs to be on during render batches, which are maybe two hours a day. If power draw matters, wake-on-LAN from the Mac before a batch is a ten-line script and cuts idle draw to near zero.

What about context window management during these long sessions?

That's a real one. A full generation-and-review session can burn through a Claude Code context window faster than the renders finish. I'm explicit about writing session-state documents and using paste-handoff blocks across compaction so the pipeline survives a context rotation without losing state. The Mac-side tooling is persistent on disk for exactly this reason; the Claude session is transient, the ledger and runners are not.

Sources and specifics

  • Pipeline shipped the bespoke hero image inventory for this site across Q1 and Q2 2026. Volume estimate of roughly 200 unique prompts per month plus rerolls; exact count is in the per-slug feedback ledger on disk.
  • Benchmark timings are from my own machines: Z-Image Turbo at 17.8 seconds warm on an RTX 3070 via ComfyUI, and Flux Schnell at roughly 30-40 seconds warm on an M4 Max via mflux. Cold starts add about 30 seconds either way.
  • fal.ai reference pricing was roughly $0.04-0.05 per Flux Schnell image during my evaluation period. This will vary by provider, model, and contract.
  • The Windows workstation is an RTX 3070 with 8GB VRAM running ComfyUI under Pinokio. Model weights stay on a VeraCrypt-encrypted drive. The dispatch is plain SSH with a dedicated key.
  • The Mac side runs a serial mflux wrapper (one process at a time, memory-pressure gated, 15s cooldown) plus a runner script that applies per-slug cues from the feedback ledger before each batch.
  • The pipeline's output is visible across the hero images on the case studies and portfolio pages; every article on this site uses it.

// related

Let us talk

If something in here connected, feel free to reach out. No pitch deck, no intake form. Just a direct conversation.

>Get in touch