[ROCm] Fix the HIP GPU backend for CDNA and RDNA architectures by jeffdaily · Pull Request #2512 · arbor-sim/arbor

jeffdaily · 2026-06-08T19:44:18Z

Arbor already has an ARB_GPU=hip backend, but its GPU warp primitives assumed a single wavefront width and a 32-bit lane mask, so it was incorrect on AMD hardware: it returned 64 for every HIP build (wrong on 32-wide RDNA wavefronts) and used a 32-bit unsigned ballot mask (truncating the upper half on 64-wide CDNA wavefronts). This fixes the HIP backend so it is correct on both CDNA (gfx9xx, wave64) and RDNA (gfx10xx/gfx11xx, wave32):

Wave-size abstraction (gpu_common.hpp): return 64 on __GFX9__ (CDNA) and 32 otherwise (RDNA), instead of 64 unconditionally.
Lane masks: use HIP's native 64-bit unsigned long long for ballot/any to match __ballot's return type on wave64, instead of a 32-bit unsigned that truncated lanes 32-63.
__any() honors its mask argument (the previous code ignored it).
The generated-mechanism GPU printer (modcc) emits the 64-bit lane mask type.
Default ARB_HIP_ARCHITECTURES updated from the Vega-era gfx906 gfx900 to gfx90a (MI200-class); the cache variable is preserved so it can be overridden.

The CUDA backend is unchanged. The build_install.rst note on AMD GPU support is refreshed (standard hipcc / ROCm, CDNA and RDNA). It also includes the Windows build fixes needed to validate the HIP backend with clang-cl, and relaxes the gpu exprelr unit-test bound to 2 ULP to accommodate RDNA4 floating-point rounding.

Test Plan:

Validated on AMD Instinct MI250X (gfx90a, CDNA2 wave64), Radeon Pro W7800 (gfx1100, RDNA3 wave32), and Radeon RX 9070 XT (gfx1201, RDNA4 wave32, Windows).

cmake -S . -B build -DARB_GPU=hip -DARB_HIP_ARCHITECTURES=gfx90a
cmake --build build --target unit
./build/bin/unit

The full unit suite (1182 tests) passes on gfx90a, including the GPU mechanism, reduce-by-key, and intrinsics tests that exercise the fixed warp primitives.
Builds and runs on gfx1100 (Linux) and gfx1201 (Windows) with -DARB_HIP_ARCHITECTURES set to the target.

Authored with the assistance of Claude (Anthropic).

Arbor already has an ARB_GPU=hip backend, but its GPU warp primitives assumed a single wavefront width and a 32-bit lane mask, so it was incorrect on AMD hardware: it returned 64 for every HIP build (wrong on 32-wide RDNA wavefronts) and used a 32-bit `unsigned` ballot mask (truncating the upper half on 64-wide CDNA wavefronts). This fixes the HIP backend so it is correct on both CDNA (gfx9xx, wave64) and RDNA (gfx10xx/gfx11xx, wave32): - Wave-size abstraction (gpu_common.hpp): return 64 on `__GFX9__` (CDNA) and 32 otherwise (RDNA), instead of 64 unconditionally. - Lane masks: use HIP's native 64-bit `unsigned long long` for `ballot`/`any` to match `__ballot`'s return type on wave64, instead of a 32-bit `unsigned` that truncated lanes 32-63. - `__any()` honors its mask argument (the previous code ignored it). - The generated-mechanism GPU printer (modcc) emits the 64-bit lane mask type. - Default `ARB_HIP_ARCHITECTURES` updated from the Vega-era `gfx906 gfx900` to `gfx90a` (MI200-class); the cache variable is preserved so it can be overridden. The CUDA backend is unchanged. The build_install.rst note on AMD GPU support is refreshed (standard hipcc / ROCm, CDNA and RDNA). It also includes the Windows build fixes needed to validate the HIP backend with clang-cl, and relaxes the `gpu exprelr` unit-test bound to 2 ULP to accommodate RDNA4 floating-point rounding. Test Plan: Validated on AMD Instinct MI250X (gfx90a, CDNA2 wave64), Radeon Pro W7800 (gfx1100, RDNA3 wave32), and Radeon RX 9070 XT (gfx1201, RDNA4 wave32, Windows). ``` cmake -S . -B build -DARB_GPU=hip -DARB_HIP_ARCHITECTURES=gfx90a cmake --build build --target unit ./build/bin/unit ``` - The full `unit` suite (1182 tests) passes on gfx90a, including the GPU mechanism, reduce-by-key, and intrinsics tests that exercise the fixed warp primitives. - Builds and runs on gfx1100 (Linux) and gfx1201 (Windows) with `-DARB_HIP_ARCHITECTURES` set to the target. Authored with the assistance of Claude (Anthropic).

thorstenhater

Hi @jeffdaily,

thanks for contributing this, it's been on my list for a long time now, but I have no access to a compatible AMD device. I have left a few comments for me, largely to improve a few of our API decision in light of the changes.

I am going to test this branch on a few CUDA systems now to ensure no hiccups have occurred and see if I have a colleague with more HIP knowledge.

A question: Did you successfully build Arbor on Windows? If so in what environment? Windows support was never a goal, but if it 'just' works, I am happy to add it to the list.

Best

thorstenhater · 2026-06-09T06:37:28Z

 }

-__device__ __inline__ unsigned active_mask() {
+__device__ __inline__ unsigned long long active_mask() {


Note: this changes type depending on the backend HIP/CUDA. Need to test that there no issues arising.

thorstenhater · 2026-06-09T06:40:09Z

 #include "util/strprintf.hpp"

+#ifdef _WIN32
+#  define WIN32_LEAN_AND_MEAN


thorstenhater · 2026-06-09T06:41:08Z

 namespace arb {
 namespace util {

 void* dl_open(const std::filesystem::path& fn) {


I am amazed that this works.

reduce_by_key.hpp had two `#ifdef ARB_HIP / #else` blocks selecting the 64-bit (__clzll/__ffsll) versus 32-bit (__clz/__ffs) bit-twiddling intrinsics inline in the reduction. The cuda_api.hpp/hip_api.hpp headers already own the per-backend warp primitives (ballot, shfl, any), so the backend selection for these intrinsics belongs there too (review feedback on PR arbor-sim#2512). Add count_leading_zeros() and find_first_set() to each api header next to the existing primitives -- __clzll/__ffsll in hip_api.hpp, __clz/__ffs in cuda_api.hpp -- and use them from reduce_by_key.hpp with the lane count taken against the bit width of lane_mask_type (8*sizeof). This removes both preprocessor branches from the algorithm and keeps the backend selection in one place. This is a pure refactor with no behavior change on either backend. On HIP lane_mask_type is 64-bit, so 8*sizeof is 64 and the helpers resolve to the former 64-bit __clzll/__ffsll path. On CUDA lane_mask_type is 32-bit, so 8*sizeof is 32 (equal to threads_per_warp() on CUDA) and the helpers resolve to the former 32-bit __clz/__ffs path; the CUDA code path is byte-for-byte unchanged. Test Plan: Built the arbor `unit` target for gfx90a at this commit and at its parent in separate build trees, then compared the device code objects to confirm the refactor does not change codegen: ``` cmake -S . -B build -DARB_GPU=hip -DARB_HIP_ARCHITECTURES=gfx90a \ -DARB_WITH_PYTHON=OFF -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_C_COMPILER=clang cmake --build build --target unit ./build/bin/unit ``` On AMD Instinct MI250X (gfx90a, CDNA2 wave64, ROCm 7.2.1) the device ISA and exported symbols of the built `unit` binary are identical before and after this change, and the full `unit` suite (1182 tests) passes, including the reduce_by_key and gpu_intrinsics tests that exercise these intrinsics. Authored with the assistance of Claude (Anthropic).

jeffdaily · 2026-06-09T17:52:07Z

Thanks Thorsten.

Windows: yes. Built and run on a Radeon RX 9070 XT (gfx1201, RDNA4) using TheRock's nightly ROCm wheels (7.14.0a) -- specifically the Windows rocm_sdk_devel package, with its bundled clang/clang++ (no MSVC for the compile) and CMake + Ninja. With $ROCM pointing at the unpacked rocm_sdk_devel:

cmake .. -G Ninja -DARB_GPU=hip -DARB_HIP_ARCHITECTURES=gfx1201 \
  -DCMAKE_CXX_COMPILER=$ROCM/bin/clang++.exe -DCMAKE_C_COMPILER=$ROCM/bin/clang.exe \
  -DCMAKE_PREFIX_PATH=$ROCM -DCMAKE_POSITION_INDEPENDENT_CODE=OFF \
  -DARB_WITH_PYTHON=OFF -DCMAKE_BUILD_TYPE=Release
cmake --build . --target unit

It needs the Windows-portability fixes included in this PR: the dylib.cpp LoadLibrary path you spotted; an explicit amdhip64.lib link (clang doesn't auto-add the HIP runtime to a shared lib on Windows); qualifying arb::util::unexpected (MSVC's <eh.h> declares a legacy global ::unexpected); and guarding one std::array<T,0> iterator test. With those, the unit suite builds and runs.

reduce_by_key conditionals: done in c5f27d0 -- moved the warp-intrinsic #ifdefs into the gpu api headers (count_leading_zeros/find_first_set on the HIP and CUDA api), so reduce_by_key.hpp is backend-agnostic now.

hip_api ballot type: lane_mask_type is wavefront-sized (64-bit on HIP wave64, the original 32-bit on CUDA) and used consistently; the CUDA path is unchanged. Exercised on gfx90a (wave64) and gfx1100 (wave32).

jeffdaily · 2026-06-09T18:02:54Z

A question: Did you successfully build Arbor on Windows? If so in what environment? Windows support was never a goal, but if it 'just' works, I am happy to add it to the list.

This here is my first human comment. I know, it's a bit strange for me, too. All the work for this PR was done by claude but with me looking over its virtual shoulder. I had to intervene at times, like deciding your refactor suggestions should just be done as part of this PR because I have ready access to ROCm platforms to evaluated it. But claude did the refactoring when I told it to. I hold the final human review of anything posted publicly, including the PR branch diff, the PR title+body, my previous replies to you here, etc., but everything else has been claude. I do that to avoid AI slop PRs, but all the rest of the effort of porting and validating arbor across Linux/gfx90a, Linux/gfx1100, and Windows/gfx1201 was automated. Me, the human, is still the bottleneck, but I hate reviewing AI slop on my other projects, so I refuse to step aside.

I posted above the Windows environment details. I did confirm that claude wasn't lying about it actually running on Windows, given all the environment caveats above. I don't know if this will work for CUDA Windows builds, too, but chances are it gets you most of the way there.

thorstenhater · 2026-06-10T09:43:30Z

Hi @jeffdaily,

nice to see a good AI / human working relationship. I have confirmed that the CUDA version is
good still and found a person with AMD experience and access for a review.

thorstenhater reviewed Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Fix the HIP GPU backend for CDNA and RDNA architectures#2512

[ROCm] Fix the HIP GPU backend for CDNA and RDNA architectures#2512
jeffdaily wants to merge 2 commits into
arbor-sim:masterfrom
jeffdaily:moat-port

jeffdaily commented Jun 8, 2026

Uh oh!

thorstenhater left a comment

Uh oh!

thorstenhater Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

thorstenhater Jun 9, 2026

Uh oh!

thorstenhater Jun 9, 2026

Uh oh!

jeffdaily commented Jun 9, 2026

Uh oh!

jeffdaily commented Jun 9, 2026

Uh oh!

thorstenhater commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffdaily commented Jun 8, 2026

Uh oh!

thorstenhater left a comment

Choose a reason for hiding this comment

Uh oh!

thorstenhater Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thorstenhater Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

thorstenhater Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

jeffdaily commented Jun 9, 2026

Uh oh!

jeffdaily commented Jun 9, 2026

Uh oh!

thorstenhater commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants