[ROCm] Fix the HIP GPU backend for CDNA and RDNA architectures#2512
[ROCm] Fix the HIP GPU backend for CDNA and RDNA architectures#2512jeffdaily wants to merge 2 commits into
Conversation
Arbor already has an ARB_GPU=hip backend, but its GPU warp primitives assumed a single wavefront width and a 32-bit lane mask, so it was incorrect on AMD hardware: it returned 64 for every HIP build (wrong on 32-wide RDNA wavefronts) and used a 32-bit `unsigned` ballot mask (truncating the upper half on 64-wide CDNA wavefronts). This fixes the HIP backend so it is correct on both CDNA (gfx9xx, wave64) and RDNA (gfx10xx/gfx11xx, wave32): - Wave-size abstraction (gpu_common.hpp): return 64 on `__GFX9__` (CDNA) and 32 otherwise (RDNA), instead of 64 unconditionally. - Lane masks: use HIP's native 64-bit `unsigned long long` for `ballot`/`any` to match `__ballot`'s return type on wave64, instead of a 32-bit `unsigned` that truncated lanes 32-63. - `__any()` honors its mask argument (the previous code ignored it). - The generated-mechanism GPU printer (modcc) emits the 64-bit lane mask type. - Default `ARB_HIP_ARCHITECTURES` updated from the Vega-era `gfx906 gfx900` to `gfx90a` (MI200-class); the cache variable is preserved so it can be overridden. The CUDA backend is unchanged. The build_install.rst note on AMD GPU support is refreshed (standard hipcc / ROCm, CDNA and RDNA). It also includes the Windows build fixes needed to validate the HIP backend with clang-cl, and relaxes the `gpu exprelr` unit-test bound to 2 ULP to accommodate RDNA4 floating-point rounding. Test Plan: Validated on AMD Instinct MI250X (gfx90a, CDNA2 wave64), Radeon Pro W7800 (gfx1100, RDNA3 wave32), and Radeon RX 9070 XT (gfx1201, RDNA4 wave32, Windows). ``` cmake -S . -B build -DARB_GPU=hip -DARB_HIP_ARCHITECTURES=gfx90a cmake --build build --target unit ./build/bin/unit ``` - The full `unit` suite (1182 tests) passes on gfx90a, including the GPU mechanism, reduce-by-key, and intrinsics tests that exercise the fixed warp primitives. - Builds and runs on gfx1100 (Linux) and gfx1201 (Windows) with `-DARB_HIP_ARCHITECTURES` set to the target. Authored with the assistance of Claude (Anthropic).
thorstenhater
left a comment
There was a problem hiding this comment.
Hi @jeffdaily,
thanks for contributing this, it's been on my list for a long time now, but I have no access to a compatible AMD device. I have left a few comments for me, largely to improve a few of our API decision in light of the changes.
I am going to test this branch on a few CUDA systems now to ensure no hiccups have occurred and see if I have a colleague with more HIP knowledge.
A question: Did you successfully build Arbor on Windows? If so in what environment? Windows support was never a goal, but if it 'just' works, I am happy to add it to the list.
Best
| } | ||
|
|
||
| __device__ __inline__ unsigned active_mask() { | ||
| __device__ __inline__ unsigned long long active_mask() { |
There was a problem hiding this comment.
Note: this changes type depending on the backend HIP/CUDA. Need to test that there no issues arising.
| #include "util/strprintf.hpp" | ||
|
|
||
| #ifdef _WIN32 | ||
| # define WIN32_LEAN_AND_MEAN |
| namespace arb { | ||
| namespace util { | ||
|
|
||
| void* dl_open(const std::filesystem::path& fn) { |
There was a problem hiding this comment.
I am amazed that this works.
reduce_by_key.hpp had two `#ifdef ARB_HIP / #else` blocks selecting the 64-bit (__clzll/__ffsll) versus 32-bit (__clz/__ffs) bit-twiddling intrinsics inline in the reduction. The cuda_api.hpp/hip_api.hpp headers already own the per-backend warp primitives (ballot, shfl, any), so the backend selection for these intrinsics belongs there too (review feedback on PR arbor-sim#2512). Add count_leading_zeros() and find_first_set() to each api header next to the existing primitives -- __clzll/__ffsll in hip_api.hpp, __clz/__ffs in cuda_api.hpp -- and use them from reduce_by_key.hpp with the lane count taken against the bit width of lane_mask_type (8*sizeof). This removes both preprocessor branches from the algorithm and keeps the backend selection in one place. This is a pure refactor with no behavior change on either backend. On HIP lane_mask_type is 64-bit, so 8*sizeof is 64 and the helpers resolve to the former 64-bit __clzll/__ffsll path. On CUDA lane_mask_type is 32-bit, so 8*sizeof is 32 (equal to threads_per_warp() on CUDA) and the helpers resolve to the former 32-bit __clz/__ffs path; the CUDA code path is byte-for-byte unchanged. Test Plan: Built the arbor `unit` target for gfx90a at this commit and at its parent in separate build trees, then compared the device code objects to confirm the refactor does not change codegen: ``` cmake -S . -B build -DARB_GPU=hip -DARB_HIP_ARCHITECTURES=gfx90a \ -DARB_WITH_PYTHON=OFF -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_C_COMPILER=clang cmake --build build --target unit ./build/bin/unit ``` On AMD Instinct MI250X (gfx90a, CDNA2 wave64, ROCm 7.2.1) the device ISA and exported symbols of the built `unit` binary are identical before and after this change, and the full `unit` suite (1182 tests) passes, including the reduce_by_key and gpu_intrinsics tests that exercise these intrinsics. Authored with the assistance of Claude (Anthropic).
|
Thanks Thorsten. Windows: yes. Built and run on a Radeon RX 9070 XT (gfx1201, RDNA4) using TheRock's nightly ROCm wheels (7.14.0a) -- specifically the Windows It needs the Windows-portability fixes included in this PR: the reduce_by_key conditionals: done in c5f27d0 -- moved the warp-intrinsic hip_api ballot type: |
This here is my first human comment. I know, it's a bit strange for me, too. All the work for this PR was done by claude but with me looking over its virtual shoulder. I had to intervene at times, like deciding your refactor suggestions should just be done as part of this PR because I have ready access to ROCm platforms to evaluated it. But claude did the refactoring when I told it to. I hold the final human review of anything posted publicly, including the PR branch diff, the PR title+body, my previous replies to you here, etc., but everything else has been claude. I do that to avoid AI slop PRs, but all the rest of the effort of porting and validating arbor across Linux/gfx90a, Linux/gfx1100, and Windows/gfx1201 was automated. Me, the human, is still the bottleneck, but I hate reviewing AI slop on my other projects, so I refuse to step aside. I posted above the Windows environment details. I did confirm that claude wasn't lying about it actually running on Windows, given all the environment caveats above. I don't know if this will work for CUDA Windows builds, too, but chances are it gets you most of the way there. |
|
Hi @jeffdaily, nice to see a good AI / human working relationship. I have confirmed that the CUDA version is |
Arbor already has an ARB_GPU=hip backend, but its GPU warp primitives assumed a single wavefront width and a 32-bit lane mask, so it was incorrect on AMD hardware: it returned 64 for every HIP build (wrong on 32-wide RDNA wavefronts) and used a 32-bit
unsignedballot mask (truncating the upper half on 64-wide CDNA wavefronts). This fixes the HIP backend so it is correct on both CDNA (gfx9xx, wave64) and RDNA (gfx10xx/gfx11xx, wave32):__GFX9__(CDNA) and 32 otherwise (RDNA), instead of 64 unconditionally.unsigned long longforballot/anyto match__ballot's return type on wave64, instead of a 32-bitunsignedthat truncated lanes 32-63.__any()honors its mask argument (the previous code ignored it).ARB_HIP_ARCHITECTURESupdated from the Vega-eragfx906 gfx900togfx90a(MI200-class); the cache variable is preserved so it can be overridden.The CUDA backend is unchanged. The build_install.rst note on AMD GPU support is refreshed (standard hipcc / ROCm, CDNA and RDNA). It also includes the Windows build fixes needed to validate the HIP backend with clang-cl, and relaxes the
gpu exprelrunit-test bound to 2 ULP to accommodate RDNA4 floating-point rounding.Test Plan:
Validated on AMD Instinct MI250X (gfx90a, CDNA2 wave64), Radeon Pro W7800 (gfx1100, RDNA3 wave32), and Radeon RX 9070 XT (gfx1201, RDNA4 wave32, Windows).
unitsuite (1182 tests) passes on gfx90a, including the GPU mechanism, reduce-by-key, and intrinsics tests that exercise the fixed warp primitives.-DARB_HIP_ARCHITECTURESset to the target.Authored with the assistance of Claude (Anthropic).