Skip to content

[ROCm] Fix the HIP GPU backend for CDNA and RDNA architectures#2512

Open
jeffdaily wants to merge 2 commits into
arbor-sim:masterfrom
jeffdaily:moat-port
Open

[ROCm] Fix the HIP GPU backend for CDNA and RDNA architectures#2512
jeffdaily wants to merge 2 commits into
arbor-sim:masterfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown

Arbor already has an ARB_GPU=hip backend, but its GPU warp primitives assumed a single wavefront width and a 32-bit lane mask, so it was incorrect on AMD hardware: it returned 64 for every HIP build (wrong on 32-wide RDNA wavefronts) and used a 32-bit unsigned ballot mask (truncating the upper half on 64-wide CDNA wavefronts). This fixes the HIP backend so it is correct on both CDNA (gfx9xx, wave64) and RDNA (gfx10xx/gfx11xx, wave32):

  • Wave-size abstraction (gpu_common.hpp): return 64 on __GFX9__ (CDNA) and 32 otherwise (RDNA), instead of 64 unconditionally.
  • Lane masks: use HIP's native 64-bit unsigned long long for ballot/any to match __ballot's return type on wave64, instead of a 32-bit unsigned that truncated lanes 32-63.
  • __any() honors its mask argument (the previous code ignored it).
  • The generated-mechanism GPU printer (modcc) emits the 64-bit lane mask type.
  • Default ARB_HIP_ARCHITECTURES updated from the Vega-era gfx906 gfx900 to gfx90a (MI200-class); the cache variable is preserved so it can be overridden.

The CUDA backend is unchanged. The build_install.rst note on AMD GPU support is refreshed (standard hipcc / ROCm, CDNA and RDNA). It also includes the Windows build fixes needed to validate the HIP backend with clang-cl, and relaxes the gpu exprelr unit-test bound to 2 ULP to accommodate RDNA4 floating-point rounding.

Test Plan:

Validated on AMD Instinct MI250X (gfx90a, CDNA2 wave64), Radeon Pro W7800 (gfx1100, RDNA3 wave32), and Radeon RX 9070 XT (gfx1201, RDNA4 wave32, Windows).

cmake -S . -B build -DARB_GPU=hip -DARB_HIP_ARCHITECTURES=gfx90a
cmake --build build --target unit
./build/bin/unit
  • The full unit suite (1182 tests) passes on gfx90a, including the GPU mechanism, reduce-by-key, and intrinsics tests that exercise the fixed warp primitives.
  • Builds and runs on gfx1100 (Linux) and gfx1201 (Windows) with -DARB_HIP_ARCHITECTURES set to the target.

Authored with the assistance of Claude (Anthropic).

Arbor already has an ARB_GPU=hip backend, but its GPU warp primitives assumed a single wavefront width and a 32-bit lane mask, so it was incorrect on AMD hardware: it returned 64 for every HIP build (wrong on 32-wide RDNA wavefronts) and used a 32-bit `unsigned` ballot mask (truncating the upper half on 64-wide CDNA wavefronts). This fixes the HIP backend so it is correct on both CDNA (gfx9xx, wave64) and RDNA (gfx10xx/gfx11xx, wave32):

- Wave-size abstraction (gpu_common.hpp): return 64 on `__GFX9__` (CDNA) and 32 otherwise (RDNA), instead of 64 unconditionally.
- Lane masks: use HIP's native 64-bit `unsigned long long` for `ballot`/`any` to match `__ballot`'s return type on wave64, instead of a 32-bit `unsigned` that truncated lanes 32-63.
- `__any()` honors its mask argument (the previous code ignored it).
- The generated-mechanism GPU printer (modcc) emits the 64-bit lane mask type.
- Default `ARB_HIP_ARCHITECTURES` updated from the Vega-era `gfx906 gfx900` to `gfx90a` (MI200-class); the cache variable is preserved so it can be overridden.

The CUDA backend is unchanged. The build_install.rst note on AMD GPU support is refreshed (standard hipcc / ROCm, CDNA and RDNA). It also includes the Windows build fixes needed to validate the HIP backend with clang-cl, and relaxes the `gpu exprelr` unit-test bound to 2 ULP to accommodate RDNA4 floating-point rounding.

Test Plan:

Validated on AMD Instinct MI250X (gfx90a, CDNA2 wave64), Radeon Pro W7800 (gfx1100, RDNA3 wave32), and Radeon RX 9070 XT (gfx1201, RDNA4 wave32, Windows).

```
cmake -S . -B build -DARB_GPU=hip -DARB_HIP_ARCHITECTURES=gfx90a
cmake --build build --target unit
./build/bin/unit
```

- The full `unit` suite (1182 tests) passes on gfx90a, including the GPU mechanism, reduce-by-key, and intrinsics tests that exercise the fixed warp primitives.
- Builds and runs on gfx1100 (Linux) and gfx1201 (Windows) with `-DARB_HIP_ARCHITECTURES` set to the target.

Authored with the assistance of Claude (Anthropic).

@thorstenhater thorstenhater left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jeffdaily,

thanks for contributing this, it's been on my list for a long time now, but I have no access to a compatible AMD device. I have left a few comments for me, largely to improve a few of our API decision in light of the changes.

I am going to test this branch on a few CUDA systems now to ensure no hiccups have occurred and see if I have a colleague with more HIP knowledge.

A question: Did you successfully build Arbor on Windows? If so in what environment? Windows support was never a goal, but if it 'just' works, I am happy to add it to the list.

Best

}

__device__ __inline__ unsigned active_mask() {
__device__ __inline__ unsigned long long active_mask() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this changes type depending on the backend HIP/CUDA. Need to test that there no issues arising.

Comment thread arbor/include/arbor/gpu/reduce_by_key.hpp Outdated
Comment thread arbor/include/arbor/gpu/reduce_by_key.hpp Outdated
Comment thread arbor/util/dylib.cpp
#include "util/strprintf.hpp"

#ifdef _WIN32
# define WIN32_LEAN_AND_MEAN

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

Comment thread arbor/util/dylib.cpp
namespace arb {
namespace util {

void* dl_open(const std::filesystem::path& fn) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am amazed that this works.

reduce_by_key.hpp had two `#ifdef ARB_HIP / #else` blocks selecting the
64-bit (__clzll/__ffsll) versus 32-bit (__clz/__ffs) bit-twiddling
intrinsics inline in the reduction. The cuda_api.hpp/hip_api.hpp headers
already own the per-backend warp primitives (ballot, shfl, any), so the
backend selection for these intrinsics belongs there too (review feedback
on PR arbor-sim#2512).

Add count_leading_zeros() and find_first_set() to each api header next to
the existing primitives -- __clzll/__ffsll in hip_api.hpp, __clz/__ffs in
cuda_api.hpp -- and use them from reduce_by_key.hpp with the lane count
taken against the bit width of lane_mask_type (8*sizeof). This removes
both preprocessor branches from the algorithm and keeps the backend
selection in one place.

This is a pure refactor with no behavior change on either backend. On HIP
lane_mask_type is 64-bit, so 8*sizeof is 64 and the helpers resolve to the
former 64-bit __clzll/__ffsll path. On CUDA lane_mask_type is 32-bit, so
8*sizeof is 32 (equal to threads_per_warp() on CUDA) and the helpers
resolve to the former 32-bit __clz/__ffs path; the CUDA code path is
byte-for-byte unchanged.

Test Plan:

Built the arbor `unit` target for gfx90a at this commit and at its parent
in separate build trees, then compared the device code objects to confirm
the refactor does not change codegen:

```
cmake -S . -B build -DARB_GPU=hip -DARB_HIP_ARCHITECTURES=gfx90a \
  -DARB_WITH_PYTHON=OFF -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_C_COMPILER=clang
cmake --build build --target unit
./build/bin/unit
```

On AMD Instinct MI250X (gfx90a, CDNA2 wave64, ROCm 7.2.1) the device ISA
and exported symbols of the built `unit` binary are identical before and
after this change, and the full `unit` suite (1182 tests) passes,
including the reduce_by_key and gpu_intrinsics tests that exercise these
intrinsics.

Authored with the assistance of Claude (Anthropic).
@jeffdaily

Copy link
Copy Markdown
Author

Thanks Thorsten.

Windows: yes. Built and run on a Radeon RX 9070 XT (gfx1201, RDNA4) using TheRock's nightly ROCm wheels (7.14.0a) -- specifically the Windows rocm_sdk_devel package, with its bundled clang/clang++ (no MSVC for the compile) and CMake + Ninja. With $ROCM pointing at the unpacked rocm_sdk_devel:

cmake .. -G Ninja -DARB_GPU=hip -DARB_HIP_ARCHITECTURES=gfx1201 \
  -DCMAKE_CXX_COMPILER=$ROCM/bin/clang++.exe -DCMAKE_C_COMPILER=$ROCM/bin/clang.exe \
  -DCMAKE_PREFIX_PATH=$ROCM -DCMAKE_POSITION_INDEPENDENT_CODE=OFF \
  -DARB_WITH_PYTHON=OFF -DCMAKE_BUILD_TYPE=Release
cmake --build . --target unit

It needs the Windows-portability fixes included in this PR: the dylib.cpp LoadLibrary path you spotted; an explicit amdhip64.lib link (clang doesn't auto-add the HIP runtime to a shared lib on Windows); qualifying arb::util::unexpected (MSVC's <eh.h> declares a legacy global ::unexpected); and guarding one std::array<T,0> iterator test. With those, the unit suite builds and runs.

reduce_by_key conditionals: done in c5f27d0 -- moved the warp-intrinsic #ifdefs into the gpu api headers (count_leading_zeros/find_first_set on the HIP and CUDA api), so reduce_by_key.hpp is backend-agnostic now.

hip_api ballot type: lane_mask_type is wavefront-sized (64-bit on HIP wave64, the original 32-bit on CUDA) and used consistently; the CUDA path is unchanged. Exercised on gfx90a (wave64) and gfx1100 (wave32).

@jeffdaily

Copy link
Copy Markdown
Author

A question: Did you successfully build Arbor on Windows? If so in what environment? Windows support was never a goal, but if it 'just' works, I am happy to add it to the list.

This here is my first human comment. I know, it's a bit strange for me, too. All the work for this PR was done by claude but with me looking over its virtual shoulder. I had to intervene at times, like deciding your refactor suggestions should just be done as part of this PR because I have ready access to ROCm platforms to evaluated it. But claude did the refactoring when I told it to. I hold the final human review of anything posted publicly, including the PR branch diff, the PR title+body, my previous replies to you here, etc., but everything else has been claude. I do that to avoid AI slop PRs, but all the rest of the effort of porting and validating arbor across Linux/gfx90a, Linux/gfx1100, and Windows/gfx1201 was automated. Me, the human, is still the bottleneck, but I hate reviewing AI slop on my other projects, so I refuse to step aside.

I posted above the Windows environment details. I did confirm that claude wasn't lying about it actually running on Windows, given all the environment caveats above. I don't know if this will work for CUDA Windows builds, too, but chances are it gets you most of the way there.

@thorstenhater

Copy link
Copy Markdown
Contributor

Hi @jeffdaily,

nice to see a good AI / human working relationship. I have confirmed that the CUDA version is
good still and found a person with AMD experience and access for a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants