Fix TensorRT runtime input buffer lifetimes#4247
Conversation
e887772 to
5c38ea1
Compare
narendasan
left a comment
There was a problem hiding this comment.
I think this makes sense to me to use the engine object to manage lifetimes. We might want to make the active buffer management functions methods of the engine object.
Also it would be good to make sure the rename works correctly with IOutputAllocator / pre_allocated_outputs which I think reuse some of the cudagraphs system
| uint64_t outputs = _out_binding_names.size(); | ||
| out_binding_names.resize(outputs); | ||
| output_buffers.resize(outputs); | ||
| cudagraph_output_staging_buffers.resize(outputs); |
There was a problem hiding this comment.
Does this system also get used for the pre_allocated outputs / output allocator?
| compiled_engine->active_shape_tensor_values.clear(); | ||
| } | ||
|
|
||
| void reset_active_input_tensors(c10::intrusive_ptr<TRTEngine> compiled_engine) { |
There was a problem hiding this comment.
Do you want to just make this a method of the engine object. I dont think it needs to get lifted into python or anything
| return false; | ||
| } | ||
|
|
||
| void clear_active_input_tensors(c10::intrusive_ptr<TRTEngine> compiled_engine) { |
5c38ea1 to
d8fa084
Compare
I think that path uses |
…ge for non-contiguous inputs aliasing matching-shape outputs.
d8fa084 to
f062cec
Compare
Description
This fixes a C++ runtime lifetime issue where temporary formatted input buffers, especially .contiguous() copies of non-contiguous inputs, could be destroyed before TensorRT finished using the bound input addresses. In the failing case, the CUDA caching allocator could then reuse that freed input storage for an output buffer with the same shape and dtype, causing input/output aliasing and large numerical corruption in monolithic TRT engines such as FLUX.2-klein-9B.
The fix moves per-execution input and shape-tensor storage onto the
TRTEngineso the buffers remain alive throughenqueueV3, records CUDA stream usage for active input tensors, and clears the retained references only after execution has been launched. This also makes the runtime behavior more robust for non-contiguous inputs across the standard C++ runtime path, CUDA graphs, and output allocator mode.Added regression coverage exercises a non-contiguous bf16 input whose output has the same shape, matching the allocator/lifetime pattern that exposed the original bug. Verification included the new runtime test plus the FLUX transformer numerical repro and full image-generation path.
Type of change
Bug fix
Checklist: