Gate static pin_memory on the host-RAM budget (fixes AMD/ROCm large-model load stall, #13730)#14525
Conversation
On AMD/ROCm a clean launch stalls at "Requested to load LTXAV" while system RAM fills and spills to swap, even though VRAM sits at ~65%. It is host-side pinned-memory exhaustion, not VRAM pressure. partially_load() pins every offloaded weight via the static path pin_memory, which ignores ensure_pin_registerable()'s result and unconditionally cudaHostRegisters up to MAX_PINNED_MEMORY (0.90*RAM on Linux). Those pins are only reclaimable from is_dynamic() models by free_registrations, and dynamic VRAM is off by default on AMD, so they are never reclaimable. Page-locked RAM is not swappable, so the loader exhausts RAM and thrashes. The dynamic-VRAM pin path (comfy/pinned_memory.py) already guards this with ensure_pin_budget/ensure_pin_registerable; only the static path was missing it. Gate pin_memory the same way and skip pinning when the budget cannot be met (the weight stays in pageable RAM, still correct, just not page-locked). Behavior is unchanged when RAM is ample and under --high-ram.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughIn 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
hey, this actually makes quite a lot of sense, I will try to test on my setup when I have some free time |
Overview
Fixes #13730. On AMD/ROCm a clean launch stalls at
Requested to load LTXAVwhile system RAM fills andspills to swap, even though VRAM sits at ~65%. It's host-side pinned-memory exhaustion, not VRAM.
Root cause
partially_load()pins every offloaded weight via the static pathpin_memory, which:ensure_pin_registerable()'s result and unconditionallycudaHostRegisters, up toMAX_PINNED_MEMORY = 0.90 * RAM(27 GiB on a 30 GiB box);free_registrationscan only reclaim fromis_dynamic()models — and dynamic VRAMis off by default on AMD, so they're never reclaimable.
Page-locked RAM isn't swappable, so the loader exhausts RAM and thrashes. The dynamic path
(
pinned_memory.py) already guards this withensure_pin_budget/ensure_pin_registerable; only thestatic path was missing it. (
--high-rambypasses the budget, so it doesn't help here.)Change (+8 / −1,
comfy/model_management.py)Gate the static path on the host-RAM budget, like the dynamic path already does; skip pinning when it
can't be met (the weight stays in pageable RAM — correct, just not page-locked).
Unchanged when RAM is ample or under
--high-ram.Validation — real server +
/prompt, single-variable A/BRX 7900 GRE 16 GiB (gfx1100), 30 GiB RAM, ROCm 7.2. Real
ltx-2.3-22b-dev-fp8+gemma_3_12Bvia areal workflow POSTed to a clean-launch server's
/prompt. Watchdog kills the server if RAM drops below~1 GiB. Only the 8-line gate differs.
Without — min MemAvailable 0.7 GiB, swap 1.8 GiB, killed at 102s; never finished (the stall).
With — load completes past the stall point:
Reproduce it yourself (real workflow + driver)
Put
ltx-2.3-22b-dev-fp8.safetensorsinmodels/diffusion_models/andgemma_3_12B_it_fp4_mixed.safetensorsinmodels/text_encoders/.ltx_workflow.json(the real graph POSTed to/prompt):{ "1": {"class_type": "CLIPLoader", "inputs": {"clip_name": "gemma_3_12B_it_fp4_mixed.safetensors", "type": "ltxv"}}, "2": {"class_type": "CLIPTextEncode", "inputs": {"clip": ["1", 0], "text": "a cat walking in a garden, cinematic"}}, "3": {"class_type": "CLIPTextEncode", "inputs": {"clip": ["1", 0], "text": "blurry, low quality"}}, "4": {"class_type": "UNETLoader", "inputs": {"unet_name": "ltx-2.3-22b-dev-fp8.safetensors", "weight_dtype": "default"}}, "5": {"class_type": "EmptyLTXVLatentVideo", "inputs": {"width": 768, "height": 512, "length": 97, "batch_size": 1}}, "6": {"class_type": "KSampler", "inputs": {"model": ["4", 0], "seed": 0, "steps": 1, "cfg": 3.0, "sampler_name": "euler", "scheduler": "normal", "positive": ["2", 0], "negative": ["3", 0], "latent_image": ["5", 0], "denoise": 1.0}}, "7": {"class_type": "SaveLatent", "inputs": {"samples": ["6", 0], "filename_prefix": "ltx13730"}} }Driver — boots a clean-launch server, POSTs the workflow, and SIGKILLs if RAM drops below ~1 GiB so the
box can't freeze:
Run
python driver.pyon baseline →RAM_FLOOR_KILL; apply the patch and rerun →PROMPT_COMPLETEDwith
Prompt executed in ...inserver.log. (The minimal latent triggers an unrelatedsampling-stage shape error after the load — the point is the load now passes.)
A question
This is a minimal budget gate. A deeper option is to make the static pins reclaimable under
pressure (as the dynamic path's
free_registrations/_steal_pinare) so the loader releases pins ondemand instead of declining new ones. Happy to go that way instead/in addition if you prefer.
Caveat: test board is GRE 16 GiB vs the reporter's XTX 24 GiB — same gfx1100/RDNA3/ROCm 7.2 and ~RAM;
smaller VRAM just triggers offload sooner.
AI usage disclosure: this change was prepared with AI assistance; a human reviewed and verified it and
can explain every line.