Skip to content

Resolving pytorch-kernels failing CI tests - Issue #321#332

Open
sreeram-11 wants to merge 3 commits into
mainfrom
sreeram/py-kernels-resolve-dll
Open

Resolving pytorch-kernels failing CI tests - Issue #321#332
sreeram-11 wants to merge 3 commits into
mainfrom
sreeram/py-kernels-resolve-dll

Conversation

@sreeram-11
Copy link
Copy Markdown
Collaborator

@sreeram-11 sreeram-11 commented Jun 2, 2026

  1. Documentation/Instruction updates

    • Windows code fence now uses powershell wherever appropriate
    • [Windows]: Set ROCM_BIN environment variable
    • [Linux]: Avoids hardcoding Python3.12
    • Changed the Linux and Windows dependency install order so that rocm-sdk-devel matches the ROCm version that torch==2.10.0 actually installs
      • Install torch==2.10.0 torchaudio torchvision
      • Read the installed rocm package version
      • Install rocm[libraries,devel] pinned to that exact version
  2. env-setup-rocm-pytorch-windows

    • Replaced the hardcoded HIPRTC block with a dynamic one that finds whatever non-builtins hiprtc*.dll exists in $ROCM_BIN, then copies it to the name PyTorch expects
    • Added ROCM_BIN to environment
    • Added a tiny JIT sanity check so that failures are caught in the setup step instead of the next test
    • Fixed the pip command
    • Changed the Linux and Windows dependency install order so that rocm-sdk-devel matches the ROCm version that torch==2.10.0 actually installs
  3. vector-addition-jit-windows

    • Replaced the hardcoded HIPRTC block with a dynamic one that finds whatever non-builtins hiprtc*.dll exists in $ROCM_BIN, then copies it to the name PyTorch expects.
  4. matmul-jit-windows

    • Replaced the hardcoded HIPRTC block with a dynamic one that finds whatever non-builtins hiprtc*.dll exists in $ROCM_BIN, then copies it to the name PyTorch expects.
  5. env-setup-rocm-pytorch-linux

    • Changed the Linux and Windows dependency install order so that rocm-sdk-devel matches the ROCm version that torch==2.10.0 actually installs

@sreeram-11 sreeram-11 requested a review from danielholanda June 2, 2026 23:29
@adamlam2-amd
Copy link
Copy Markdown
Collaborator

is torch 2.10 the correct version?
another thought is that it seems like we hand-holding the environment setup, instead of actually testing the env setup. In general, please make sure that the setup and install steps are tested from blank/clean machines - most of the errors occur during these setups.

Copy link
Copy Markdown
Collaborator

@danielholanda danielholanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need confirmation from @adamlam2-amd that this is indeed the recommended install strategy before merging.

Comment on lines +159 to +160
ROCM_VERSION="$(python -c 'import importlib.metadata as m; print(m.version("rocm"))')"
python -m pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ "rocm[libraries,devel]==${ROCM_VERSION}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamlam2-amd can you confirm that this is the recommended path?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdevinenamd, please review

@sreeram-11
Copy link
Copy Markdown
Collaborator Author

is torch 2.10 the correct version?

@sdevinenamd, can you please confirm?

@adamlam2-amd
Copy link
Copy Markdown
Collaborator

so the windows version of this playbook doesnt work and the commands havent been validated either really.

@adamlam2-amd
Copy link
Copy Markdown
Collaborator

please wait for #246 to be merged.

@sreeram-11
Copy link
Copy Markdown
Collaborator Author

please wait for #246 to be merged.

It should actually be the other way around, #246 depends on this PR (#332).

Right now, CI is passing only because of the changes in this PR. The main fix here was reordering the install steps:

  • Install torch first
  • Read the ROCm version from torch
  • Then install rocm[libraries,devel]

Previously, we were installing rocm[libraries,devel] first and then torch, which led to a version mismatch. Specifically, rocm-sdk-devel was getting installed at a different version than the torch/ROCm runtime stack.

Example of the mismatch we were seeing:

Package Version Notes
rocm 7.13.0a20260502
rocm-sdk-core 7.13.0a20260502
rocm-sdk-libraries-gfx1151 7.13.0a20260502
torch 2.10.0+rocm7.13.0a20260502
rocm-sdk-devel 7.14.0a20260602 mismatch

With the updated order, everything stays aligned.

@sreeram-11
Copy link
Copy Markdown
Collaborator Author

so the windows version of this playbook doesnt work and the commands havent been validated either really.

I was able to run this on Windows, at least the CI tests are all passing there. For this playbook, CI is actually pretty close to real user steps since there’s no GUI involved, so CI.

Please take a look at PR #265 to see exactly what the tests are validating.

@adamlam2-amd
Copy link
Copy Markdown
Collaborator

do you have a windows halobox? please test there to confirm. otherwise, for both satya and myself, we couldn't get it working.

@adamlam2-amd
Copy link
Copy Markdown
Collaborator

we are also using torch 2.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants