Skip to content

Fix TensorRT runtime input buffer lifetimes#4247

Open
SandSnip3r wants to merge 1 commit into
pytorch:mainfrom
SandSnip3r:fix-runtime-buffer-lifetime-v2
Open

Fix TensorRT runtime input buffer lifetimes#4247
SandSnip3r wants to merge 1 commit into
pytorch:mainfrom
SandSnip3r:fix-runtime-buffer-lifetime-v2

Conversation

@SandSnip3r
Copy link
Copy Markdown
Collaborator

@SandSnip3r SandSnip3r commented May 11, 2026

Description

This fixes a C++ runtime lifetime issue where temporary formatted input buffers, especially .contiguous() copies of non-contiguous inputs, could be destroyed before TensorRT finished using the bound input addresses. In the failing case, the CUDA caching allocator could then reuse that freed input storage for an output buffer with the same shape and dtype, causing input/output aliasing and large numerical corruption in monolithic TRT engines such as FLUX.2-klein-9B.

The fix moves per-execution input and shape-tensor storage onto the TRTEngine so the buffers remain alive through enqueueV3, records CUDA stream usage for active input tensors, and clears the retained references only after execution has been launched. This also makes the runtime behavior more robust for non-contiguous inputs across the standard C++ runtime path, CUDA graphs, and output allocator mode.

Added regression coverage exercises a non-contiguous bf16 input whose output has the same shape, matching the allocator/lifetime pattern that exposed the original bug. Verification included the new runtime test plus the FLUX transformer numerical repro and full image-generation path.

Type of change

Bug fix

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

@meta-cla meta-cla Bot added the cla signed label May 11, 2026
@github-actions github-actions Bot added component: tests Issues re: Tests component: core Issues re: The core compiler component: runtime labels May 11, 2026
@github-actions github-actions Bot requested a review from cehongwang May 11, 2026 20:55
@SandSnip3r SandSnip3r changed the title Fix TensorRT runtime input buffer lifetimes and add regression covera… Fix TensorRT runtime input buffer lifetimes May 11, 2026
@SandSnip3r SandSnip3r force-pushed the fix-runtime-buffer-lifetime-v2 branch 4 times, most recently from e887772 to 5c38ea1 Compare May 11, 2026 23:08
Copy link
Copy Markdown
Collaborator

@narendasan narendasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense to me to use the engine object to manage lifetimes. We might want to make the active buffer management functions methods of the engine object.

Also it would be good to make sure the rename works correctly with IOutputAllocator / pre_allocated_outputs which I think reuse some of the cudagraphs system

uint64_t outputs = _out_binding_names.size();
out_binding_names.resize(outputs);
output_buffers.resize(outputs);
cudagraph_output_staging_buffers.resize(outputs);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this system also get used for the pre_allocated outputs / output allocator?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

Comment thread core/runtime/execute_engine.cpp Outdated
compiled_engine->active_shape_tensor_values.clear();
}

void reset_active_input_tensors(c10::intrusive_ptr<TRTEngine> compiled_engine) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to just make this a method of the engine object. I dont think it needs to get lifted into python or anything

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread core/runtime/execute_engine.cpp Outdated
return false;
}

void clear_active_input_tensors(c10::intrusive_ptr<TRTEngine> compiled_engine) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@SandSnip3r SandSnip3r force-pushed the fix-runtime-buffer-lifetime-v2 branch from 5c38ea1 to d8fa084 Compare May 11, 2026 23:33
@SandSnip3r
Copy link
Copy Markdown
Collaborator Author

Also it would be good to make sure the rename works correctly with IOutputAllocator / pre_allocated_outputs which I think reuse some of the cudagraphs system

I think that path uses compiled_engine->output_allocator->getBuffers() which is not going to use these cuda graph staging buffers.

…ge for non-contiguous inputs aliasing matching-shape outputs.
@SandSnip3r SandSnip3r force-pushed the fix-runtime-buffer-lifetime-v2 branch from d8fa084 to f062cec Compare May 12, 2026 00:05
@github-actions github-actions Bot requested a review from lanluo-nvidia May 12, 2026 17:06
@SandSnip3r SandSnip3r mentioned this pull request May 12, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants