mflux ate my Mac: the safety wrapper I wish I'd written first

Q: Why use memory_pressure instead of vm_stat?

vm_stat reports "Pages free" narrowly. macOS keeps most of the machine's RAM as active cache, so vm_stat will show you single-digit percentages free on a healthy system with plenty of usable memory. memory_pressure accounts for inactive and purgeable pages, which is closer to what a new workload can actually claim. The 25% threshold in the script is calibrated against that.

Q: How do I adapt this for a different Flux model or for Z-Image Turbo?

The shell script is model-agnostic; change the --model flag and the memory threshold. Z-Image Turbo has a smaller resident footprint than Flux Schnell, so you can drop the memory_pressure gate to 20% or 18%. The Python runner is entirely generic - it shells out to whatever script you give it. For a walkthrough of how the two models differ in behavior and prompt fidelity, see the [side-by-side of the two local models I actually use](/writing/zimage-turbo-vs-flux-schnell-prompt-divergence).

Image seven of a twenty-image batch, and my Mac stopped responding. Beach ball. Finder frozen. pkill wouldn't reach the process. A hard power cycle was the only way out. When it booted back up, I had an empty seat at the prompt and a question: why did the generator take the whole machine down instead of just dying itself?

The answer became a two-file safety wrapper I now run every time. A shell script for the single-image call, a Python runner for the queue, and a set of gates that refuse to start when the system cannot afford another copy of the model in memory.

Ancient stone arch bridge spanning calm water with the setting sun glowing behind, pink light reflecting on the river surface. — // load-bearing for centuries

The incident: what I watched happen

I was running mflux-generate from two terminals at once, kicking out hero images for an article batch. Flux Schnell resident on Apple Silicon wants roughly 14 to 16 GB of unified memory depending on resolution. The first invocation was fine. The second one started by the time the first was halfway through denoising, and the two processes began fighting for the same pool.

Around image seven, memory_pressure dropped below 20% free. Swap usage started climbing fast. I watched the cursor start to stutter and thought, OK, I'll kill one of these. The problem was that by the time I'd decided to kill it, I couldn't.

Once macOS is past a certain swap threshold (somewhere around 85% used in my experience), the kernel stops scheduling anything interactive fast enough to matter. A terminal I already had open would print half a character every few seconds. kill -9 returned to the shell but the target process kept running. Finder was frozen. Dock was frozen. Activity Monitor wouldn't open.

The only way out was holding the power button. When the Mac came back up, the output file for image seven was a zero-byte corpse. Two Claude Code sessions were gone. No disk corruption, which was lucky.

Macro close-up of moss colonies clinging to weathered ancient stone, soft texture detail. — // slow accumulation of damage

Timeline

Rough local times for the incident and the fix.

T+0 min - First batch run launched from a Claude Code session. memory_pressure reports 45% free. Feels fine.
T+8 min - Second mflux-generate call fires from another agent window. Both are now competing for unified memory.
T+14 min - Swap usage at 87% per sysctl vm.swapusage. Cursor begins to lag.
T+16 min - Machine unresponsive. kill calls in a held-open terminal return but do nothing.
T+20 min - Hold power button. Hard reboot.
T+30 min - Booted. Verified no disk corruption. Image seven file is empty.
T+45 min - Wrote the first version of the memory gate: a memory_pressure pre-flight and a pgrep assertion. Decided the next run would be strictly serial.
T+4 hr - Ran the first 20-image batch through the wrapper. Zero crashes, zero timeouts. One legitimate mflux failure (unrelated), recovered cleanly.

Root cause: the swap cliff on unified memory

On Apple Silicon, system RAM and GPU memory are the same pool. Flux Schnell in MLX form wants a large chunk of it while it's generating. Two concurrent invocations don't split the model; each one wants its own resident copy. The OS has no way to tell it no, so it pages heavily to swap.

Past about 85% swap used, I've seen three different Apple Silicon machines behave the same way: the system becomes functionally unreachable. Not quite deadlocked, but close enough that kill -9 from a held-open terminal doesn't arrive soon enough to help. The kernel won't preempt a high-priority Python process that's holding GPU pages.

“
The model does not crash. The OS stops scheduling anything else.
”

The crash was not a bug in mflux. It was an absence of guardrails around it. Once I stopped thinking about the crash as a generator problem and started thinking about it as a scheduling problem, the fix wrote itself.

Looking up at the curved underside of an ancient stone arch, weathered keystones visible in soft light. — // keystones doing the work

What I changed: the mflux safety wrapper

I split the work into two files: a shell script that knows how to run exactly one image safely, and a Python runner that walks a queue, gates entries, and bails out before the machine is in trouble.

The inner script: one image per invocation

mflux_serial.sh has a single job: start if and only if it's safe to start, run in the foreground, and cool down before returning. It refuses to launch if another mflux process is already running, if dimensions aren't 64-aligned (Flux is picky about this and will OOM on misaligned tensors), or if system memory is below the 25% free threshold reported by memory_pressure.

# Assert no mflux already running
if pgrep -f mflux-generate >/dev/null; then
  echo "ERROR: mflux-generate already running. Aborting." >&2
  exit 1
fi

# Memory check via memory_pressure (not vm_stat alone)
FREE_PCT=$(memory_pressure 2>/dev/null | \
  awk -F: '/System-wide memory free percentage/{gsub(/[ %]/,"",$2); print $2; exit}')
if [[ -z "${FREE_PCT:-}" ]] || (( FREE_PCT < 25 )); then
  echo "ERROR: only ${FREE_PCT:-unknown}% memory free. Close apps and retry." >&2
  exit 1
fi

mflux-generate \
  --base-model schnell \
  --model black-forest-labs/FLUX.1-schnell \
  --steps 4 --seed "$SEED" \
  --height "$HEIGHT" --width "$WIDTH" \
  --prompt-file "$PROMPT_FILE" --output "$OUTPUT"

sleep 15  # built-in cooldown before returning

memory_pressure is the right tool here, not vm_stat | grep "Pages free". The latter reads too narrow on macOS because the OS treats most RAM as active cache. memory_pressure accounts for inactive and purgeable pages, which is closer to what you actually have available for a new workload.

Close-up of a fine fissure in weathered stone with subtle erosion patterns along the crack. — // where stress finds the seam

The outer runner: queue, swap gate, timeout, abort

runner_v3.py reads a JSON list of prompts, calls the inner script for each one, and adds a layer of gates the shell can't easily do on its own: swap-percent checks with drain waits, a hard subprocess timeout with pkill recovery, and a two-failure-streak abort.

PRE_IMAGE_SWAP_LIMIT = 95  # percent; only gate on true thrash
EXTRA_COOLDOWN = 30        # on top of the script's built-in 15s

def swap_percent_used():
    out = subprocess.check_output(['sysctl', '-n', 'vm.swapusage'], text=True)
    m = re.search(r"total = ([\d.]+)M .*used = ([\d.]+)M", out)
    total, used = float(m.group(1)), float(m.group(2))
    return 100 * used / total if total else 0

def wait_for_swap_breathing_room():
    for attempt in range(6):   # up to ~6 minutes
        pct = swap_percent_used()
        if pct < PRE_IMAGE_SWAP_LIMIT:
            return True
        print(f"  swap {pct:.0f}% used -- waiting 60s for drain")
        time.sleep(60)
    return False

The timeout is the part I think about the most. If mflux-generate stalls (I've seen it hang once on a corrupted prompt file, and once for no reason I ever diagnosed), the outer subprocess.run kills it after 360 seconds and explicitly pkill -9 -f mflux-generate for cleanup. That combination has never failed me.

try:
    r = subprocess.run(
        ['bash', SCRIPT, 'prompt.txt', str(out), str(w), str(h), str(seed)],
        capture_output=True, text=True, timeout=360)
    rc = r.returncode
except subprocess.TimeoutExpired:
    rc = 124
    subprocess.run(['pkill', '-9', '-f', 'mflux-generate'])
    time.sleep(15)

if rc != 0:
    fail_streak += 1
    if fail_streak >= 2:
        print("ABORT: 2 consecutive failures.")
        sys.exit(1)
else:
    fail_streak = 0

Two consecutive failures is the abort condition. If a single generation times out or errors, that's recoverable and worth a cooldown. If two in a row fail, something about the environment is wrong (swap won't drain, model file corrupt, GPU driver in a weird state) and continuing will just waste time. The right move is to stop the run, figure out what went sideways, and restart with a known-good state.

The 45-second total cooldown between images (15 seconds built into the script, 30 more in the runner) is longer than it needs to be on a cold machine. It's calibrated for the back-half of a 20-image run, when the inactive page cache is bloated and needs time to flush. Too short a cooldown and you start the next image with less headroom than you think.

The escape hatch from another terminal

The one thing I keep in muscle memory: if anything feels wrong, open a fresh terminal and run pkill -9 -f mflux-generate. If that returns instantly, the runner will see the failure, cool down, and continue or abort cleanly. If it doesn't return, you're already past the point where software can help, and you're back to the power button. Better to pkill early.

Side view of an ancient stone arch bridge from along the riverbank with the sun setting beyond it. — // the bridge from the bank

What I would do differently

Three things, in order of how much time they would have saved.

First, I would write the guardrails before the first batch, not after. I had written mflux prompts by hand a dozen times without issues. The jump to batch processing was the first time the generator ran more than twice back to back. That's exactly when a wrapper is worth the thirty minutes to write, because the cost of finding out the hard way is a reboot and some lost work.

Second, I would not start by tuning mflux flags. My first instinct after the crash was to look at --steps, --quantize, smaller resolutions, anything that would shrink the memory footprint. All of that turned out to be beside the point. The model was not the problem; concurrent scheduling was. Bad model flags can make the problem smaller, but they can't make it go away. A single concurrent call at any setting will eventually crash a loaded machine.

Third, I would have paired this with the Mac-to-Windows hybrid pipeline sooner. The Mac is great for prompt iteration and small one-off generations. For batches larger than five or six images, the Mac-to-Windows hybrid I eventually settled on offloads the queue to a dGPU box that doesn't share memory with the desktop session. That's now the default. The safety wrapper is what I run when I'm generating locally anyway (quick tests, single images, or when the Windows box is busy), not when I'm producing a finished batch.

A small anti-feature worth calling out: the runner does not retry on timeout. If a generation fails, it cools down and moves to the next one. The logic is that a retry on an already-stressed machine is the exact failure mode that cost me a reboot; better to log the failure and keep walking. You can re-run the failures in a follow-up pass once the environment is clean.

Ultra-wide distant view of the stone arch bridge as a small element in a vast landscape, dramatic horizontal scale. — // scale teaches humility

FAQ

Why use memory_pressure instead of vm_stat?

vm_stat reports "Pages free" narrowly. macOS keeps most of the machine's RAM as active cache, so vm_stat will show you single-digit percentages free on a healthy system with plenty of usable memory. memory_pressure accounts for inactive and purgeable pages, which is closer to what a new workload can actually claim. The 25% threshold in the script is calibrated against that.

Why 85% swap as the cliff threshold?

That's empirical across three Apple Silicon machines I've run this on. Below 85%, the system stays responsive even under heavy pressure. Above it, interactive scheduling degrades quickly enough that kill commands from a held-open terminal may not arrive in time. The runner actually gates at 95% and waits for drain, but my personal warning signal is 85% - if I see that during a batch, I stop and investigate.

Can you run mflux in parallel on a Mac Studio with 64 GB?

With enough memory headroom, yes. Two concurrent Flux Schnell invocations on a 64 GB machine can fit without paging if nothing else substantial is running. I still don't recommend it. The unified memory architecture means any spike from another app (a browser tab opening a heavy page, a Claude Code agent loading a large file) can push you over the edge in a second, and the recovery path is still the same. Serial is cheaper than the one time it fails.

What does the 45-second cooldown actually do?

Two things. First, it lets the Python process fully release any GPU pages it was holding, so the next invocation starts from a clean allocation instead of fighting the previous run's tail state. Second, it gives macOS time to flush inactive pages from the cache, which is what the memory_pressure reading is measuring. On a cold machine 45 seconds is overkill. On image fifteen of a batch it's barely enough.

How do I adapt this for a different Flux model or for Z-Image Turbo?

The shell script is model-agnostic; change the --model flag and the memory threshold. Z-Image Turbo has a smaller resident footprint than Flux Schnell, so you can drop the memory_pressure gate to 20% or 18%. The Python runner is entirely generic - it shells out to whatever script you give it. For a walkthrough of how the two models differ in behavior and prompt fidelity, see the side-by-side of the two local models I actually use.

Ultra-wide pure-atmosphere view of the setting sun across an empty horizon, no bridge in frame. — // after the work is done

Sources and specifics

Crash occurred on Apple Silicon during a batch hero-image run for this site, Q1 2026. Two mflux-generate processes were active when swap climbed past 85%.
Gate thresholds in the current runner: memory_pressure >= 25% free to start, vm.swapusage < 95% to proceed, 240s mflux internal tolerance with 360s outer timeout, 45s total cooldown (15s in the script + 30s in the runner), 2-consecutive-fail abort.
Scripts live at .image-work/runner_v3.py in the project and mflux_serial.sh in the bzk-image-flux skill. Both are checked in and used on every local batch.
Flux Schnell memory footprint on MLX is roughly 14 to 16 GB resident depending on resolution. Z-Image Turbo is smaller but I still gate it through the wrapper.
Recovery pattern on timeout: subprocess.TimeoutExpired -> pkill -9 -f mflux-generate -> 15s sleep -> continue to the next queue item if fail streak is below 2.
The broader pipeline (Mac for prompts and review, Windows dGPU for batch generation, feedback ledger for prompt iteration) is documented in the hub post on the full image-generation stack.
For how I actually turn hundreds of image reviews into a reusable prompt taxonomy, see the feedback ledger and prompt taxonomy post.
The decision to run strictly serial rather than parallel maps onto a broader argument about when agent and process parallelism is actually worth it.
If you're curious why I run the generator foreground instead of in a background job, the pattern is the same one I use for any agent work: background agents should actually run in the background, but mflux is not an agent, and foreground is safer here.
The full local-tooling practice lives in the Operator's Stack product, which documents the scripts, skills, and daily patterns that keep this kind of pipeline running without eating the week.