feat: tristate --vae-tiling (off|on|auto) with automatic OOM fallback by RapidMark · Pull Request #1621 · leejet/stable-diffusion.cpp

RapidMark · 2026-06-08T12:33:12Z

On memory-constrained backends — integrated GPUs especially — a full-image VAE decode allocates a single compute buffer larger than the backend's maximum single-buffer/allocation size, and sd.cpp hard-fails instead of falling back to the tiling it already supports. Today the user has to know to pass --vae-tiling up front; a wrong guess just crashes the run at the very end, after sampling has already completed.

Repro

AMD Radeon 8060S (Strix Halo, RDNA3.5 iGPU, 128 GB unified memory), Vulkan backend, Flux Krea-dev Q4 at 1024×1024, without --vae-tiling:

[INFO ] stable-diffusion.cpp:4482 - sampling completed, taking 10.24s
ggml_vulkan: Device memory allocation of size 2415919104 failed.
ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory
[ERROR] ggml_extend.hpp:1905 - vae: failed to allocate the compute buffer
[ERROR] vae.hpp:238 - vae decode compute failed
[ERROR] stable-diffusion.cpp:4202 - decode_first_stage failed for latent 1
[ERROR] main.cpp:793 - generate failed

The ~8.5 GB single-shot VAE decode buffer exceeds the iGPU's Vulkan per-buffer limit. The card has ample total memory (it shares 128 GB system RAM) — the failure is the per-buffer ceiling, not capacity. The whole gen is lost after a successful sampling pass.

Root cause

VAE::decode (src/model/vae/vae.hpp) only tiles when tiling_params.enabled is set by the caller (--vae-tiling); otherwise it does a single-shot decode. That single allocation hits the backend's per-buffer ceiling enforced in ggml-vulkan.cpp (Requested buffer size exceeds device buffer size limit). There is no automatic fallback.

Proposed change

Make --vae-tiling a tristate and add a reactive fallback for the new default:

value	behavior
`off`	never tile — fail if the untiled buffer doesn't fit
`on`	always tile (the previous `--vae-tiling` behavior)
`auto`	(default) try untiled; on allocation failure, retry once tiled

sd_tiling_params_t gains a bool auto_tile field, appended at the end of the struct so the C ABI stays backward-compatible (existing positional initializers keep compiling; the new default is auto_tile = true).
Reactive fallback in VAE::decode. When tiling isn't already requested and the untiled _compute returns empty (the compute-buffer allocation failed), free the buffer and retry once with tiling enabled, reusing the existing tiled path (get_tile_sizes + tiled_compute). A default tile size (32) fits comfortably. Logged at WARN so the fallback is visible. CPU stays the ultimate fallback if even a tiled buffer can't allocate.
CLI parses off|on|auto; bare --vae-tiling (no value) remains backward-compatible (= on). auto_tile round-trips through the JSON gen-params load/save.

I chose reactive (retry on the real allocation failure) over proactive (estimate the buffer size and compare against ggml_backend_buft_get_max_size()) deliberately: a size estimate is VAE-architecture-specific (peak activation differs across SD/SDXL/Flux/Wan/LTX VAEs), so a hardcoded bytes-per-pixel constant would be brittle, whereas retrying on the actual _compute failure is correct for every VAE and every backend with no tuning. The cost is one failed allocation attempt (a single WARN line) before the retry.

Validation (fresh build off current master)

Same box, same gen, all three modes:

--vae-tiling off   -> fails at decode (8.5 GB buffer exceeds the device limit), exit 1
--vae-tiling auto  -> WARN "vae: untiled decode buffer exceeded the backend limit;
                      retrying with tiling"; latent decoded in 6.94s; image saved, exit 0
--vae-tiling on    -> tiles from the start (unchanged legacy path)

So auto turns a hard crash into a fast, correct GPU decode with zero user knobs, while off/on keep deterministic control for anyone who wants it. The tiled decode (~6.9 s on the GPU) is also far faster than the usual workaround of routing the VAE to CPU (~29.5 s) to dodge the OOM, and is visually equivalent at 0.5 tile overlap (no seams).

This helps any constrained device, not just iGPUs (an 8 GB discrete card at high resolution hits the same per-buffer wall today). encode() has the same shape and could get the identical fallback, but this PR scopes to decode, where the failure actually occurs (decode works at output resolution; encode at the smaller latent resolution).

VAE decode can fail on integrated / low-VRAM GPUs because the untiled compute buffer exceeds the backend's maximum single-buffer allocation (e.g. Vulkan maxBufferSize), even when total memory is plentiful. sd.cpp already supports tiling that keeps each compute buffer small, but it had to be requested up front with --vae-tiling; users hit a hard failure instead of the working path that was one flag away. Make --vae-tiling a tristate: off - never tile (fail if the untiled buffer doesn't fit) on - always tile (previous --vae-tiling behavior) auto - (default) try untiled; if the compute buffer can't be allocated, free it and retry once with tiling Implemented by appending a `bool auto_tile` to sd_tiling_params_t (kept at the end of the struct so the C ABI stays backward-compatible) and a single fallback branch in VAE::decode. Bare `--vae-tiling` with no value remains backward-compatible (= on). auto_tile round-trips through the JSON gen-params load/save. Validated on an AMD Radeon 8060S iGPU (Flux Krea Q4, 1024x1024, Vulkan): --vae-tiling off fails at decode (8.5 GB buffer exceeds the device limit), --vae-tiling auto logs the retry and completes by tiling, --vae-tiling on tiles from the start. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

wbruna · 2026-06-08T15:18:55Z

I chose reactive (retry on the real allocation failure) over proactive (estimate the buffer size and compare against ggml_backend_buft_get_max_size()) deliberately: a size estimate is VAE-architecture-specific (peak activation differs across SD/SDXL/Flux/Wan/LTX VAEs), so a hardcoded bytes-per-pixel constant would be brittle

Wouldn't be possible to check with the real value, calculated from the graph before the allocation?

…es review) Reviewer (wbruna) asked why retry-on-failure rather than checking the real buffer size from the graph up front. Good point: ggml can plan the exact compute-buffer size with no allocation. Add an opt-in probe to GGMLRunner: when set_probe_compute_buffer_fits(true), alloc_compute_buffer measures the planned size via ggml_gallocr_reserve_n_size (no_alloc planning, zero allocation) and, if it exceeds ggml_backend_buft_get_max_size(), returns false BEFORE the real reserve -- so the backend never emits its raw "allocation failed" error on the AUTO success path. VAE::decode enables the probe only around the untiled _compute in AUTO mode; the reactive output.empty()->tile path stays as the backstop for a genuine runtime OOM (planned size fits the max, but the device is full). get_max_size() is SIZE_MAX on CPU, so this no-ops there. Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024x1024): --vae-tiling auto now logs only the INFO "retrying with tiling" + completes (no allocation-failed spew); off still fails; on still tiles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RapidMark · 2026-06-08T23:36:47Z

Good call — done (pushed just now).

Instead of retrying on failure, the AUTO path now measures the planned compute-buffer size up front with ggml_gallocr_reserve_n_size (no-alloc planning, zero allocation) and compares it to ggml_backend_buft_get_max_size() before allocating; if it won't fit, it goes straight to tiling — no bytes-per-pixel estimate. On CPU get_max_size() is SIZE_MAX, so it no-ops there.

I kept the original retry-on-empty as a backstop for a genuine runtime OOM (planned size fits the max, but the device is actually full). Net effect on the auto path: the backend no longer prints its raw "allocation failed" error — just an INFO line and the tiled decode.

Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024²): --vae-tiling auto now logs only the INFO "retrying with tiling" and completes; off still fails; on still tiles.

stduhpf · 2026-06-09T09:22:56Z

I think having a fallback to vae tiling is a much welcome addition, but I'm having some small issues with the user experience there. Modifying the syntax of --vae-tiling arg from a flag to a tistate option will break previously working commands, and I think we could implement the same feature without breakling anything.

For example we could add a --vae-tiling-auto flag.

Alternatively, set "auto" tiling as default and add something like a --no-auto-vae-tiling to disable it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: tristate --vae-tiling (off|on|auto) with automatic OOM fallback#1621

feat: tristate --vae-tiling (off|on|auto) with automatic OOM fallback#1621
RapidMark wants to merge 2 commits into
leejet:masterfrom
CloudhandsAI:cloudhands/vae-auto-tiling

RapidMark commented Jun 8, 2026

Uh oh!

wbruna commented Jun 8, 2026

Uh oh!

RapidMark commented Jun 8, 2026

Uh oh!

stduhpf commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RapidMark commented Jun 8, 2026

Repro

Root cause

Proposed change

Validation (fresh build off current master)

Uh oh!

wbruna commented Jun 8, 2026

Uh oh!

RapidMark commented Jun 8, 2026

Uh oh!

stduhpf commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants