Skip to content

Gate static pin_memory on the host-RAM budget (fixes AMD/ROCm large-model load stall, #13730)#14525

Open
liminfei-amd wants to merge 1 commit into
Comfy-Org:masterfrom
liminfei-amd:amd-rocm/13730-pin-budget-gate
Open

Gate static pin_memory on the host-RAM budget (fixes AMD/ROCm large-model load stall, #13730)#14525
liminfei-amd wants to merge 1 commit into
Comfy-Org:masterfrom
liminfei-amd:amd-rocm/13730-pin-budget-gate

Conversation

@liminfei-amd

Copy link
Copy Markdown

Overview

Fixes #13730. On AMD/ROCm a clean launch stalls at Requested to load LTXAV while system RAM fills and
spills to swap, even though VRAM sits at ~65%. It's host-side pinned-memory exhaustion, not VRAM.

Root cause

partially_load() pins every offloaded weight via the static path pin_memory, which:

  • ignores ensure_pin_registerable()'s result and unconditionally cudaHostRegisters, up to
    MAX_PINNED_MEMORY = 0.90 * RAM (27 GiB on a 30 GiB box);
  • produces pins that free_registrations can only reclaim from is_dynamic() models — and dynamic VRAM
    is off by default on AMD, so they're never reclaimable.

Page-locked RAM isn't swappable, so the loader exhausts RAM and thrashes. The dynamic path
(pinned_memory.py) already guards this with ensure_pin_budget / ensure_pin_registerable; only the
static path was missing it. (--high-ram bypasses the budget, so it doesn't help here.)

Change (+8 / −1, comfy/model_management.py)

Gate the static path on the host-RAM budget, like the dynamic path already does; skip pinning when it
can't be met (the weight stays in pageable RAM — correct, just not page-locked).

-    ensure_pin_registerable(size)
+    if not ensure_pin_budget(size) or not ensure_pin_registerable(size):
+        return False

Unchanged when RAM is ample or under --high-ram.

Validation — real server + /prompt, single-variable A/B

RX 7900 GRE 16 GiB (gfx1100), 30 GiB RAM, ROCm 7.2. Real ltx-2.3-22b-dev-fp8 + gemma_3_12B via a
real workflow POSTed to a clean-launch server's /prompt. Watchdog kills the server if RAM drops below
~1 GiB. Only the 8-line gate differs.

Without — min MemAvailable 0.7 GiB, swap 1.8 GiB, killed at 102s; never finished (the stall).

With — load completes past the stall point:

Requested to load LTXAV
loaded partially; 14685.35 MB loaded, 9151.30 MB offloaded
Prompt executed in 91.94 seconds
Reproduce it yourself (real workflow + driver)

Put ltx-2.3-22b-dev-fp8.safetensors in models/diffusion_models/ and
gemma_3_12B_it_fp4_mixed.safetensors in models/text_encoders/.

ltx_workflow.json (the real graph POSTed to /prompt):

{
  "1": {"class_type": "CLIPLoader", "inputs": {"clip_name": "gemma_3_12B_it_fp4_mixed.safetensors", "type": "ltxv"}},
  "2": {"class_type": "CLIPTextEncode", "inputs": {"clip": ["1", 0], "text": "a cat walking in a garden, cinematic"}},
  "3": {"class_type": "CLIPTextEncode", "inputs": {"clip": ["1", 0], "text": "blurry, low quality"}},
  "4": {"class_type": "UNETLoader", "inputs": {"unet_name": "ltx-2.3-22b-dev-fp8.safetensors", "weight_dtype": "default"}},
  "5": {"class_type": "EmptyLTXVLatentVideo", "inputs": {"width": 768, "height": 512, "length": 97, "batch_size": 1}},
  "6": {"class_type": "KSampler", "inputs": {"model": ["4", 0], "seed": 0, "steps": 1, "cfg": 3.0, "sampler_name": "euler", "scheduler": "normal", "positive": ["2", 0], "negative": ["3", 0], "latent_image": ["5", 0], "denoise": 1.0}},
  "7": {"class_type": "SaveLatent", "inputs": {"samples": ["6", 0], "filename_prefix": "ltx13730"}}
}

Driver — boots a clean-launch server, POSTs the workflow, and SIGKILLs if RAM drops below ~1 GiB so the
box can't freeze:

import subprocess, time, json, os, signal, sys, urllib.request, psutil
FLOOR, BOOT_TIMEOUT, RUN_TIMEOUT, PORT = 1.0, 180, 300, 8188
env = dict(os.environ, HIP_VISIBLE_DEVICES="0")
log = open("server.log", "w")
# clean launch = AMD default pinned ON / dynamic OFF
srv = subprocess.Popen([sys.executable, "main.py", "--listen", "127.0.0.1",
                        "--port", str(PORT), "--disable-auto-launch"],
                       env=env, stdout=log, stderr=subprocess.STDOUT)
mem = lambda: psutil.virtual_memory().available / 1024**3
# wait for the server
t0 = time.time()
while time.time() - t0 < BOOT_TIMEOUT:
    try:
        urllib.request.urlopen("http://127.0.0.1:%d/system_stats" % PORT, timeout=2); break
    except Exception:
        if srv.poll() is not None: sys.exit("server died at boot")
        time.sleep(1)
# POST the real workflow
wf = json.load(open("ltx_workflow.json"))
req = urllib.request.Request("http://127.0.0.1:%d/prompt" % PORT,
                             data=json.dumps({"prompt": wf}).encode(),
                             headers={"Content-Type": "application/json"})
pid = json.load(urllib.request.urlopen(req, timeout=15))["prompt_id"]
# watchdog RAM floor + poll /history
t1, mn, outcome = time.time(), 99, "?"
while True:
    a = mem(); mn = min(mn, a)
    if a < FLOOR:
        srv.send_signal(signal.SIGKILL); outcome = "RAM_FLOOR_KILL"; break
    if time.time() - t1 > RUN_TIMEOUT:
        srv.send_signal(signal.SIGKILL); outcome = "STALL_TIMEOUT"; break
    try:
        h = json.load(urllib.request.urlopen("http://127.0.0.1:%d/history/%s" % (PORT, pid), timeout=3))
        if pid in h and h[pid]["status"].get("completed"):
            outcome = "PROMPT_COMPLETED"; break
    except Exception:
        pass
    time.sleep(0.5)
print("OUTCOME=%s min_memavail=%.1fG elapsed=%.0fs" % (outcome, mn, time.time() - t1))
srv.send_signal(signal.SIGKILL)

Run python driver.py on baseline → RAM_FLOOR_KILL; apply the patch and rerun → PROMPT_COMPLETED
with Prompt executed in ... in server.log. (The minimal latent triggers an unrelated
sampling-stage shape error after the load — the point is the load now passes.)

A question

This is a minimal budget gate. A deeper option is to make the static pins reclaimable under
pressure (as the dynamic path's free_registrations / _steal_pin are) so the loader releases pins on
demand instead of declining new ones. Happy to go that way instead/in addition if you prefer.

Caveat: test board is GRE 16 GiB vs the reporter's XTX 24 GiB — same gfx1100/RDNA3/ROCm 7.2 and ~RAM;
smaller VRAM just triggers offload sooner.


AI usage disclosure: this change was prepared with AI assistance; a human reviewed and verified it and
can explain every line.

On AMD/ROCm a clean launch stalls at "Requested to load LTXAV" while system RAM
fills and spills to swap, even though VRAM sits at ~65%. It is host-side
pinned-memory exhaustion, not VRAM pressure.

partially_load() pins every offloaded weight via the static path pin_memory,
which ignores ensure_pin_registerable()'s result and unconditionally
cudaHostRegisters up to MAX_PINNED_MEMORY (0.90*RAM on Linux). Those pins are
only reclaimable from is_dynamic() models by free_registrations, and dynamic
VRAM is off by default on AMD, so they are never reclaimable. Page-locked RAM
is not swappable, so the loader exhausts RAM and thrashes.

The dynamic-VRAM pin path (comfy/pinned_memory.py) already guards this with
ensure_pin_budget/ensure_pin_registerable; only the static path was missing it.
Gate pin_memory the same way and skip pinning when the budget cannot be met
(the weight stays in pageable RAM, still correct, just not page-locked).
Behavior is unchanged when RAM is ample and under --high-ram.
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7a1cd199-75b9-4acf-a037-7e02fa745591

📥 Commits

Reviewing files that changed from the base of the PR and between a590d60 and 8d8e0c6.

📒 Files selected for processing (1)
  • comfy/model_management.py

📝 Walkthrough

Walkthrough

In comfy/model_management.py, the pin_memory() function gains two early-exit checks before it attempts CUDA host registration. Calls to ensure_pin_budget and ensure_pin_registerable are inserted at the top of the function; if either returns a falsy value, pin_memory() immediately returns False and leaves the tensor in pageable RAM without ever reaching cudaHostRegister.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: gating pin_memory on host-RAM budget to fix AMD/ROCm model load stalls, with the specific issue reference.
Description check ✅ Passed The description thoroughly documents the root cause, the specific code change, validation methodology, and results on real hardware, all directly related to the changeset.
Linked Issues check ✅ Passed The PR directly addresses issue #13730 by implementing budget gates on the static pin_memory path to prevent host-RAM exhaustion during large model loading on AMD/ROCm systems.
Out of Scope Changes check ✅ Passed The changeset is narrowly scoped to the identified root cause: adding budget/registerable checks to the static pin_memory path in comfy/model_management.py, with no extraneous modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@maxevilmind

Copy link
Copy Markdown

hey, this actually makes quite a lot of sense, I will try to test on my setup when I have some free time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LTX 2.3 FP8/Q4KM stalls during Requested to load LTXAV on RX 7900 XTX + ROCm unless dynamic VRAM / pinned memory / async offload are disabled

2 participants