From 3b37235ace8143a5d56b57cbf9e7ae5951915ef6 Mon Sep 17 00:00:00 2001 From: Daniel Holanda Date: Mon, 25 May 2026 09:56:15 -0700 Subject: [PATCH 1/3] Follow AMD branding guideline --- playbooks/supplemental/cvml/README.md | 4 ++-- playbooks/supplemental/cvml/playbook.json | 4 ++-- playbooks/supplemental/gaia-agents/playbook.json | 2 +- playbooks/supplemental/llama-factory-finetuning/README.md | 2 +- playbooks/supplemental/ollama-getting-started/playbook.json | 2 +- playbooks/supplemental/pytorch-kernels/README.md | 4 ++-- playbooks/supplemental/pytorch-kernels/platform.md | 2 +- playbooks/supplemental/pytorch-kernels/playbook.json | 4 ++-- 8 files changed, 12 insertions(+), 12 deletions(-) diff --git a/playbooks/supplemental/cvml/README.md b/playbooks/supplemental/cvml/README.md index a4597f24..20e96bad 100644 --- a/playbooks/supplemental/cvml/README.md +++ b/playbooks/supplemental/cvml/README.md @@ -9,11 +9,11 @@ SPDX-License-Identifier: MIT > This playbook uses special tags that GitHub cannot render. Please visit [amd.com/playbooks](https://amd.com/playbooks) to correctly preview this content. -# Local Computer Vision with Ryzen AI NPU +# Local Computer Vision with AMD Ryzen™ AI NPU ## Overview -The [Ryzen AI CVML Library](https://ryzenai.docs.amd.com/en/latest/ryzen_ai_libraries.html#ryzen-ai-cvml-library) is AMD's C++ computer vision and machine learning toolkit that provides powerful, on-device perception capabilities — including depth estimation, face detection, and face mesh tracking. Built on top of the Ryzen AI drivers, the library automatically selects the best available hardware (GPU or NPU) for inference, letting you add AI features to C++ applications without worrying about model training or framework integration. All processing happens locally on your system, making it ideal for privacy-sensitive, low-latency applications. +The [Ryzen AI CVML Library](https://ryzenai.docs.amd.com/en/latest/ryzen_ai_libraries.html#ryzen-ai-cvml-library) is an AMD C++ computer vision and machine learning toolkit that provides powerful, on-device perception capabilities — including depth estimation, face detection, and face mesh tracking. Built on top of the Ryzen AI drivers, the library automatically selects the best available hardware (GPU or NPU) for inference, letting you add AI features to C++ applications without worrying about model training or framework integration. All processing happens locally on your system, making it ideal for privacy-sensitive, low-latency applications. This playbook teaches you how to set up the Ryzen AI CVML Library, build the included sample applications, and run face detection on a sample video. diff --git a/playbooks/supplemental/cvml/playbook.json b/playbooks/supplemental/cvml/playbook.json index 2b8f45a7..423d86a9 100644 --- a/playbooks/supplemental/cvml/playbook.json +++ b/playbooks/supplemental/cvml/playbook.json @@ -1,7 +1,7 @@ { "id": "cvml", - "title": "Local Computer Vision with Ryzen AI NPU", - "description": "Build local perception capabilities using CVML SDK on top of RyzenAI and ROCm", + "title": "Local Computer Vision with AMD Ryzen\u2122 AI NPU", + "description": "Build local perception capabilities using the CVML SDK on top of Ryzen AI and AMD ROCm\u2122 software", "time": 60, "supported_platforms": { "halo": [ diff --git a/playbooks/supplemental/gaia-agents/playbook.json b/playbooks/supplemental/gaia-agents/playbook.json index b0ee94f1..ccad7ba3 100644 --- a/playbooks/supplemental/gaia-agents/playbook.json +++ b/playbooks/supplemental/gaia-agents/playbook.json @@ -1,7 +1,7 @@ { "id": "gaia-agents", "title": "Building Your First Agent with GAIA", - "description": "Build a 100% local AI agent \u2014 no cloud APIs needed. Use the GAIA SDK to create a hardware advisor on your STX Halo", + "description": "Build a 100% local AI agent \u2014 no cloud APIs needed. Use the GAIA SDK to create a hardware advisor on your AMD Ryzen\u2122 AI", "time": 20, "supported_platforms": { "halo": [ diff --git a/playbooks/supplemental/llama-factory-finetuning/README.md b/playbooks/supplemental/llama-factory-finetuning/README.md index 109a141e..e046f79b 100644 --- a/playbooks/supplemental/llama-factory-finetuning/README.md +++ b/playbooks/supplemental/llama-factory-finetuning/README.md @@ -193,7 +193,7 @@ llamafactory-cli train examples/train_lora/qwen3_lora_sft_ci.yaml -After running LLM finetuning, all generated outputs are stored in the "output_dir", including model checkpoint files, configuration files, and training metrics. +After running LLM fine-tuning, all generated outputs are stored in the "output_dir", including model checkpoint files, configuration files, and training metrics.

Qwen3 LoRA Fine-tuning diff --git a/playbooks/supplemental/ollama-getting-started/playbook.json b/playbooks/supplemental/ollama-getting-started/playbook.json index 80762d43..2abca581 100644 --- a/playbooks/supplemental/ollama-getting-started/playbook.json +++ b/playbooks/supplemental/ollama-getting-started/playbook.json @@ -1,7 +1,7 @@ { "id": "ollama-getting-started", "title": "Getting Started with Ollama", - "description": "Install Ollama and run LLMs locally \u2014 chat from the terminal, desktop app, or REST API on your STX Halo\u2122", + "description": "Install Ollama and run LLMs locally \u2014 chat from the terminal, desktop app, or REST API on your AMD Ryzen\u2122 AI", "time": 15, "supported_platforms": { "halo": [ diff --git a/playbooks/supplemental/pytorch-kernels/README.md b/playbooks/supplemental/pytorch-kernels/README.md index 27c18b54..01e6f3a6 100644 --- a/playbooks/supplemental/pytorch-kernels/README.md +++ b/playbooks/supplemental/pytorch-kernels/README.md @@ -9,7 +9,7 @@ SPDX-License-Identifier: MIT > This playbook uses special tags that GitHub cannot render. Please visit [amd.com/playbooks](https://amd.com/playbooks) to correctly preview this content. -# Compile your own GPU kernels for Pytorch+ROCm +# Compile your own GPU kernels for PyTorch + AMD ROCm™ ## Overview @@ -122,7 +122,7 @@ PyTorch also exposes `torch.cuda._compile_kernel()`, a high-level shortcut to JI ## Setup ### Prerequisites - Windows -- Install latest: [AMD Adrenalin Software](https://www.amd.com/en/products/software/adrenalin.html) +- Install latest: [AMD Software: Adrenalin Edition™](https://www.amd.com/en/products/software/adrenalin.html) ### Create a Virtual Environment diff --git a/playbooks/supplemental/pytorch-kernels/platform.md b/playbooks/supplemental/pytorch-kernels/platform.md index a9423fa8..f2b0e123 100644 --- a/playbooks/supplemental/pytorch-kernels/platform.md +++ b/playbooks/supplemental/pytorch-kernels/platform.md @@ -117,7 +117,7 @@ print("HIP available:", torch.cuda.is_available()) ``` ### Prerequisites -- Install latest: [AMD Adrenalin Software](https://www.amd.com/en/products/software/adrenalin.html) +- Install latest: [AMD Software: Adrenalin Edition™](https://www.amd.com/en/products/software/adrenalin.html) ### Install ROCm Python packages via pip ```bash diff --git a/playbooks/supplemental/pytorch-kernels/playbook.json b/playbooks/supplemental/pytorch-kernels/playbook.json index 3dd03cfd..e31c742c 100644 --- a/playbooks/supplemental/pytorch-kernels/playbook.json +++ b/playbooks/supplemental/pytorch-kernels/playbook.json @@ -1,7 +1,7 @@ { "id": "pytorch-kernels", - "title": "Custom GPU Kernels with PyTorch ROCm", - "description": "Write and optimize custom GPU kernels using PyTorch and ROCm on STX Halo\u2122", + "title": "Custom GPU Kernels with PyTorch and AMD ROCm\u2122", + "description": "Write and optimize custom GPU kernels using PyTorch and AMD ROCm\u2122 software on AMD Ryzen\u2122 AI", "time": 120, "supported_platforms": { "halo": [ From e2ac7e3f1cd3975b8f5d26fed6b1e7fe9dafc743 Mon Sep 17 00:00:00 2001 From: Daniel Holanda Date: Mon, 25 May 2026 10:10:03 -0700 Subject: [PATCH 2/3] grammar fixes --- playbooks/supplemental/cvml/README.md | 2 +- .../supplemental/llama-factory-finetuning/README.md | 6 +++--- playbooks/supplemental/pytorch-kernels/README.md | 12 ++++++------ 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/playbooks/supplemental/cvml/README.md b/playbooks/supplemental/cvml/README.md index 20e96bad..2364a6df 100644 --- a/playbooks/supplemental/cvml/README.md +++ b/playbooks/supplemental/cvml/README.md @@ -15,7 +15,7 @@ SPDX-License-Identifier: MIT The [Ryzen AI CVML Library](https://ryzenai.docs.amd.com/en/latest/ryzen_ai_libraries.html#ryzen-ai-cvml-library) is an AMD C++ computer vision and machine learning toolkit that provides powerful, on-device perception capabilities — including depth estimation, face detection, and face mesh tracking. Built on top of the Ryzen AI drivers, the library automatically selects the best available hardware (GPU or NPU) for inference, letting you add AI features to C++ applications without worrying about model training or framework integration. All processing happens locally on your system, making it ideal for privacy-sensitive, low-latency applications. -This playbook teaches you how to set up the Ryzen AI CVML Library, build the included sample applications, and run face detection on a sample video. +This playbook teaches you how to set up the Ryzen AI CVML Library, build the included sample applications, and run face detection on a sample image. ## What You'll Learn diff --git a/playbooks/supplemental/llama-factory-finetuning/README.md b/playbooks/supplemental/llama-factory-finetuning/README.md index e046f79b..84d61e32 100644 --- a/playbooks/supplemental/llama-factory-finetuning/README.md +++ b/playbooks/supplemental/llama-factory-finetuning/README.md @@ -136,7 +136,7 @@ print("PASS: Required LLaMA Factory example files exist") These example configuration files have specified model parameters, fine-tuning method parameters, dataset parameters, evaluation parameters, and more. You can configure them according to your own needs. In this playbook, we will use [qwen3_lora_sft.yaml](https://github.com/hiyouga/LlamaFactory/blob/main/examples/train_lora/qwen3_lora_sft.yaml). **Key parameters explained:** -- `model_name_or_path` - HuggingFace Model name or local model file path. +- `model_name_or_path` - Hugging Face model name or local model file path. - `stage` - Training stage. Options: rm (reward modeling), pt (pretrain), sft (Supervised Fine-Tuning), PPO, DPO, KTO, ORPO. - `do_train` - true for training, false for evaluation - `finetuning_type` - Fine-tuning method. Options: freeze, lora, full @@ -236,7 +236,7 @@ print(f"Found adapter weights: {adapter_weights}") **llamafactory-cli chat** is designed for interactive chat/inference with LLMs (both base models and LoRA-fine-tuned models). LLaMA Factory provides the sample configuration to run inference of fine-tuned models in [examples/inference](https://github.com/hiyouga/LlamaFactory/tree/main/examples/inference). You can also modify this sample configuration to change the settings, such as the inference backend. -Use the following command to test Qwen3 fine-tuned model: +Use the following command to test the Qwen3 fine-tuned model: ```bash llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml @@ -252,7 +252,7 @@ An example chat using the fine-tuned model is shown below: For production use-cases, the pre-trained model and the LoRA adapter need to be merged and exported into a single model. This merged model can be used as a normal Hugging Face model file. LLaMA Factory provides the sample configurations in [examples/merge_lora](https://github.com/hiyouga/LlamaFactory/tree/main/examples/merge_lora). -Use the following command to export Qwen3 fine-tuned model: +Use the following command to export the Qwen3 fine-tuned model: ```bash llamafactory-cli export examples/merge_lora/qwen3_lora_sft.yaml diff --git a/playbooks/supplemental/pytorch-kernels/README.md b/playbooks/supplemental/pytorch-kernels/README.md index 01e6f3a6..298f00b3 100644 --- a/playbooks/supplemental/pytorch-kernels/README.md +++ b/playbooks/supplemental/pytorch-kernels/README.md @@ -48,7 +48,7 @@ This playbook covers two approaches for kernel development: | **C++ Extension** | `CUDAExtension` + pybind11, compile a `.cu` file into a native `.so` and import it | -Both approaches run on AMD GPUs. This is possible because PyTorch's ROCm build maps the entire CUDA API surface to HIP, `torch.cuda`, `CUDAExtension`, and CUDA kernel syntax all work on AMD hardware transparently. You write CUDA-style code; ROCm handles the translation. +Both approaches run on AMD GPUs. This is possible because PyTorch's ROCm build maps the entire CUDA API surface to HIP. `torch.cuda`, `CUDAExtension`, and CUDA kernel syntax all work on AMD hardware transparently. You write CUDA-style code; ROCm handles the translation. --- @@ -95,7 +95,7 @@ These variables are combined to compute a globally unique thread index: int idx = blockIdx.x * blockDim.x + threadIdx.x; ``` -Total threads = `gridDim.x * blockDim.x`. Each thread processes one element independently, this is **data parallelism**. The same operation runs on many elements at once with no inter-thread dependency. +Total threads = `gridDim.x * blockDim.x`. Each thread processes one element independently; this is **data parallelism**. The same operation runs on many elements at once with no inter-thread dependency. --- @@ -669,7 +669,7 @@ the CPU immediately continues executing the next instruction without waiting for pip install --no-build-isolation -v . ``` -`CUDAExtension` is a CUDA build helper from `torch.utils.cpp_extension`. On AMD with ROCm, PyTorch **remaps `CUDAExtension` to use `hipcc`** instead of `nvcc`, so the same `setup.py` that would build a CUDA extension on NVIDIA compiles to AMD GPU code without any changes. This is the key mechanism that makes CUDA extension code portable to AMD: PyTorch's ROCm build intercepts the build path and routes it through the HIP compiler. Produces these in the same directory: +`CUDAExtension` is a CUDA build helper from `torch.utils.cpp_extension`. On AMD with ROCm, PyTorch **remaps `CUDAExtension` to use `hipcc`** instead of `nvcc`, so the same `setup.py` that would build a CUDA extension on NVIDIA compiles to AMD GPU code without any changes. This is the key mechanism that makes CUDA extension code portable to AMD: PyTorch's ROCm build intercepts the build path and routes it through the HIP compiler. This produces the following in the same directory: - `build/`: directory with the `.pyd` files - `add_one_kernel.hip`: the HIP source generated by hipifying the `.cu` file; this is what `hipcc` actually compiled @@ -848,7 +848,7 @@ Each output element is defined as: $$C[row, col] = \sum_{n=0}^{N-1} A[row, n] \cdot B[n, col]$$ -Each output element is assigned to exactly one thread, and threads don't depend on each other's results, thread `(0,0)` and thread `(1,5)` run simultaneously with no coordination. However, within a single thread the dot product is **sequential**: the `n` loop iterates N times, accumulating one multiply-add per step. +Each output element is assigned to exactly one thread, and threads don't depend on each other's results: thread `(0,0)` and thread `(1,5)` run simultaneously with no coordination. However, within a single thread the dot product is **sequential**: the `n` loop iterates N times, accumulating one multiply-add per step. #### Row-Major Memory Layout @@ -1148,7 +1148,7 @@ $code | & $Python - #### Approach B: C++ Extension -The full manual path: write the kernel and Python binding in a `.cu` file, compile it as a native extension, then import and call it from Python. Mirrors the structure of `add_one_kernel.cu` exactly, only the kernel signature and launcher logic differ. +The full manual path: write the kernel and Python binding in a `.cu` file, compile it as a native extension, then import and call it from Python. This mirrors the structure of `add_one_kernel.cu` exactly; only the kernel signature and launcher logic differ. **Files:** @@ -1216,7 +1216,7 @@ Compared to `add_one_launcher` in Walkthrough 1, the launcher here: pip install --no-build-isolation -v . ``` -Produces these in the same directory: +This produces the following in the same directory: - `build/`: directory with the `.pyd` files - `matmul_kernel.hip`: the HIP source generated by hipifying the `.cu` file; this is what `hipcc` actually compiled From bdb137b72bdbccc379ae4a3f42bcd61c1c950f64 Mon Sep 17 00:00:00 2001 From: Daniel Holanda Date: Wed, 27 May 2026 07:55:41 -0700 Subject: [PATCH 3/3] Apply suggestions from code review Co-authored-by: Victoria Godsoe --- playbooks/supplemental/pytorch-kernels/README.md | 8 ++++---- playbooks/supplemental/pytorch-kernels/platform.md | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/playbooks/supplemental/pytorch-kernels/README.md b/playbooks/supplemental/pytorch-kernels/README.md index 298f00b3..a6b1e9b9 100644 --- a/playbooks/supplemental/pytorch-kernels/README.md +++ b/playbooks/supplemental/pytorch-kernels/README.md @@ -9,23 +9,23 @@ SPDX-License-Identifier: MIT > This playbook uses special tags that GitHub cannot render. Please visit [amd.com/playbooks](https://amd.com/playbooks) to correctly preview this content. -# Compile your own GPU kernels for PyTorch + AMD ROCm™ +# Compile your own GPU kernels for PyTorch + AMD ROCm™ Software ## Overview -Write a GPU kernel from scratch, compile it, and launch it on an AMD GPU, then watch utilization spike. This playbook shows how GPU computation actually works: you write the kernel code, and it executes in parallel across thousands of threads. +Write a GPU kernel from scratch, compile it, and launch it on an AMD Radeon™ GPU, then watch utilization spike. This playbook shows how GPU computation actually works: you write the kernel code, and it executes in parallel across thousands of threads. ## What You'll Learn - How GPU kernels work: grids, blocks, threads, and the indexing model that maps them to data -- How AMD's ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification +- How the AMD ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification - How to compile a kernel at runtime using `torch.cuda._compile_kernel` - How to build a native C++ kernel extension with `CUDAExtension` + pybind11, importable from Python - How GPU kernels work: grids, blocks, threads, and the indexing model that maps them to data -- How AMD's ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification +- How the AMD ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification - How to compile a kernel at runtime using `torch.cuda._compile_kernel` - How to build a native C++ kernel extension with `CUDAExtension` + pybind11, importable from Python - How to measure kernel execution time and monitor live GPU utilization with `rocm-smi` diff --git a/playbooks/supplemental/pytorch-kernels/platform.md b/playbooks/supplemental/pytorch-kernels/platform.md index f2b0e123..89571796 100644 --- a/playbooks/supplemental/pytorch-kernels/platform.md +++ b/playbooks/supplemental/pytorch-kernels/platform.md @@ -5,7 +5,7 @@ This document describes the expected platform configurations for running this pl ## Required Frameworks ## Linux -If you're running on a Halo Box, ROCm and PyTorch are preinstalled. You can validate them by running: +If you're running on AMD Ryzen™ AI Halo Developer Platform, AMD ROCm™ software and PyTorch are preinstalled. You can validate them by running: ```bash hipcc --version