diff --git a/playbooks/supplemental/cvml/README.md b/playbooks/supplemental/cvml/README.md index 0c12e2b9..9931dd95 100644 --- a/playbooks/supplemental/cvml/README.md +++ b/playbooks/supplemental/cvml/README.md @@ -9,13 +9,13 @@ SPDX-License-Identifier: MIT > This playbook uses special tags that GitHub cannot render. Please visit [amd.com/playbooks](https://amd.com/playbooks) to correctly preview this content. -# Local Computer Vision with Ryzen AI NPU +# Local Computer Vision with AMD Ryzen™ AI NPU ## Overview -The [Ryzen AI CVML Library](https://ryzenai.docs.amd.com/en/latest/ryzen_ai_libraries.html#ryzen-ai-cvml-library) is AMD's C++ computer vision and machine learning toolkit that provides powerful, on-device perception capabilities — including depth estimation, face detection, and face mesh tracking. Built on top of the Ryzen AI drivers, the library automatically selects the best available hardware (GPU or NPU) for inference, letting you add AI features to C++ applications without worrying about model training or framework integration. All processing happens locally on your system, making it ideal for privacy-sensitive, low-latency applications. +The [Ryzen AI CVML Library](https://ryzenai.docs.amd.com/en/latest/ryzen_ai_libraries.html#ryzen-ai-cvml-library) is an AMD C++ computer vision and machine learning toolkit that provides powerful, on-device perception capabilities — including depth estimation, face detection, and face mesh tracking. Built on top of the Ryzen AI drivers, the library automatically selects the best available hardware (GPU or NPU) for inference, letting you add AI features to C++ applications without worrying about model training or framework integration. All processing happens locally on your system, making it ideal for privacy-sensitive, low-latency applications. -This playbook teaches you how to set up the Ryzen AI CVML Library, build the included sample applications, and run face detection on a sample video. +This playbook teaches you how to set up the Ryzen AI CVML Library, build the included sample applications, and run face detection on a sample image. ## What You'll Learn diff --git a/playbooks/supplemental/cvml/playbook.json b/playbooks/supplemental/cvml/playbook.json index 74aa5fdc..567f6144 100644 --- a/playbooks/supplemental/cvml/playbook.json +++ b/playbooks/supplemental/cvml/playbook.json @@ -1,7 +1,7 @@ { "id": "cvml", - "title": "Local Computer Vision with Ryzen AI NPU", - "description": "Build local perception capabilities using CVML SDK on top of RyzenAI and ROCm", + "title": "Local Computer Vision with AMD Ryzen\u2122 AI NPU", + "description": "Build local perception capabilities using the CVML SDK on top of Ryzen AI and AMD ROCm\u2122 software", "time": 60, "supported_platforms": { "halo": [ diff --git a/playbooks/supplemental/gaia-agents/playbook.json b/playbooks/supplemental/gaia-agents/playbook.json index b0ee94f1..ccad7ba3 100644 --- a/playbooks/supplemental/gaia-agents/playbook.json +++ b/playbooks/supplemental/gaia-agents/playbook.json @@ -1,7 +1,7 @@ { "id": "gaia-agents", "title": "Building Your First Agent with GAIA", - "description": "Build a 100% local AI agent \u2014 no cloud APIs needed. Use the GAIA SDK to create a hardware advisor on your STX Halo", + "description": "Build a 100% local AI agent \u2014 no cloud APIs needed. Use the GAIA SDK to create a hardware advisor on your AMD Ryzen\u2122 AI", "time": 20, "supported_platforms": { "halo": [ diff --git a/playbooks/supplemental/llama-factory-finetuning/README.md b/playbooks/supplemental/llama-factory-finetuning/README.md index 109a141e..84d61e32 100644 --- a/playbooks/supplemental/llama-factory-finetuning/README.md +++ b/playbooks/supplemental/llama-factory-finetuning/README.md @@ -136,7 +136,7 @@ print("PASS: Required LLaMA Factory example files exist") These example configuration files have specified model parameters, fine-tuning method parameters, dataset parameters, evaluation parameters, and more. You can configure them according to your own needs. In this playbook, we will use [qwen3_lora_sft.yaml](https://github.com/hiyouga/LlamaFactory/blob/main/examples/train_lora/qwen3_lora_sft.yaml). **Key parameters explained:** -- `model_name_or_path` - HuggingFace Model name or local model file path. +- `model_name_or_path` - Hugging Face model name or local model file path. - `stage` - Training stage. Options: rm (reward modeling), pt (pretrain), sft (Supervised Fine-Tuning), PPO, DPO, KTO, ORPO. - `do_train` - true for training, false for evaluation - `finetuning_type` - Fine-tuning method. Options: freeze, lora, full @@ -193,7 +193,7 @@ llamafactory-cli train examples/train_lora/qwen3_lora_sft_ci.yaml -After running LLM finetuning, all generated outputs are stored in the "output_dir", including model checkpoint files, configuration files, and training metrics. +After running LLM fine-tuning, all generated outputs are stored in the "output_dir", including model checkpoint files, configuration files, and training metrics.

Qwen3 LoRA Fine-tuning @@ -236,7 +236,7 @@ print(f"Found adapter weights: {adapter_weights}") **llamafactory-cli chat** is designed for interactive chat/inference with LLMs (both base models and LoRA-fine-tuned models). LLaMA Factory provides the sample configuration to run inference of fine-tuned models in [examples/inference](https://github.com/hiyouga/LlamaFactory/tree/main/examples/inference). You can also modify this sample configuration to change the settings, such as the inference backend. -Use the following command to test Qwen3 fine-tuned model: +Use the following command to test the Qwen3 fine-tuned model: ```bash llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml @@ -252,7 +252,7 @@ An example chat using the fine-tuned model is shown below: For production use-cases, the pre-trained model and the LoRA adapter need to be merged and exported into a single model. This merged model can be used as a normal Hugging Face model file. LLaMA Factory provides the sample configurations in [examples/merge_lora](https://github.com/hiyouga/LlamaFactory/tree/main/examples/merge_lora). -Use the following command to export Qwen3 fine-tuned model: +Use the following command to export the Qwen3 fine-tuned model: ```bash llamafactory-cli export examples/merge_lora/qwen3_lora_sft.yaml diff --git a/playbooks/supplemental/ollama-getting-started/playbook.json b/playbooks/supplemental/ollama-getting-started/playbook.json index 80762d43..2abca581 100644 --- a/playbooks/supplemental/ollama-getting-started/playbook.json +++ b/playbooks/supplemental/ollama-getting-started/playbook.json @@ -1,7 +1,7 @@ { "id": "ollama-getting-started", "title": "Getting Started with Ollama", - "description": "Install Ollama and run LLMs locally \u2014 chat from the terminal, desktop app, or REST API on your STX Halo\u2122", + "description": "Install Ollama and run LLMs locally \u2014 chat from the terminal, desktop app, or REST API on your AMD Ryzen\u2122 AI", "time": 15, "supported_platforms": { "halo": [ diff --git a/playbooks/supplemental/pytorch-kernels/README.md b/playbooks/supplemental/pytorch-kernels/README.md index 01af9ead..1056bb36 100644 --- a/playbooks/supplemental/pytorch-kernels/README.md +++ b/playbooks/supplemental/pytorch-kernels/README.md @@ -9,23 +9,23 @@ SPDX-License-Identifier: MIT > This playbook uses special tags that GitHub cannot render. Please visit [amd.com/playbooks](https://amd.com/playbooks) to correctly preview this content. -# Compile your own GPU kernels for Pytorch+ROCm +# Compile your own GPU kernels for PyTorch + AMD ROCm™ Software ## Overview -Write a GPU kernel from scratch, compile it, and launch it on an AMD GPU, then watch utilization spike. This playbook shows how GPU computation actually works: you write the kernel code, and it executes in parallel across thousands of threads. +Write a GPU kernel from scratch, compile it, and launch it on an AMD Radeon™ GPU, then watch utilization spike. This playbook shows how GPU computation actually works: you write the kernel code, and it executes in parallel across thousands of threads. ## What You'll Learn - How GPU kernels work: grids, blocks, threads, and the indexing model that maps them to data -- How AMD's ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification +- How the AMD ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification - How to compile a kernel at runtime using `torch.cuda._compile_kernel` - How to build a native C++ kernel extension with `CUDAExtension` + pybind11, importable from Python - How GPU kernels work: grids, blocks, threads, and the indexing model that maps them to data -- How AMD's ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification +- How the AMD ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification - How to compile a kernel at runtime using `torch.cuda._compile_kernel` - How to build a native C++ kernel extension with `CUDAExtension` + pybind11, importable from Python - How to measure kernel execution time and monitor live GPU utilization with `rocm-smi` @@ -48,7 +48,7 @@ This playbook covers two approaches for kernel development: | **C++ Extension** | `CUDAExtension` + pybind11, compile a `.cu` file into a native `.so` and import it | -Both approaches run on AMD GPUs. This is possible because PyTorch's ROCm build maps the entire CUDA API surface to HIP, `torch.cuda`, `CUDAExtension`, and CUDA kernel syntax all work on AMD hardware transparently. You write CUDA-style code; ROCm handles the translation. +Both approaches run on AMD GPUs. This is possible because PyTorch's ROCm build maps the entire CUDA API surface to HIP. `torch.cuda`, `CUDAExtension`, and CUDA kernel syntax all work on AMD hardware transparently. You write CUDA-style code; ROCm handles the translation. --- @@ -95,7 +95,7 @@ These variables are combined to compute a globally unique thread index: int idx = blockIdx.x * blockDim.x + threadIdx.x; ``` -Total threads = `gridDim.x * blockDim.x`. Each thread processes one element independently, this is **data parallelism**. The same operation runs on many elements at once with no inter-thread dependency. +Total threads = `gridDim.x * blockDim.x`. Each thread processes one element independently; this is **data parallelism**. The same operation runs on many elements at once with no inter-thread dependency. --- @@ -122,7 +122,7 @@ PyTorch also exposes `torch.cuda._compile_kernel()`, a high-level shortcut to JI ## Setup ### Prerequisites - Windows -- Install latest: [AMD Adrenalin Software](https://www.amd.com/en/products/software/adrenalin.html) +- Install latest: [AMD Software: Adrenalin Edition™](https://www.amd.com/en/products/software/adrenalin.html) ### Create a Virtual Environment @@ -669,7 +669,7 @@ the CPU immediately continues executing the next instruction without waiting for pip install --no-build-isolation -v . ``` -`CUDAExtension` is a CUDA build helper from `torch.utils.cpp_extension`. On AMD with ROCm, PyTorch **remaps `CUDAExtension` to use `hipcc`** instead of `nvcc`, so the same `setup.py` that would build a CUDA extension on NVIDIA compiles to AMD GPU code without any changes. This is the key mechanism that makes CUDA extension code portable to AMD: PyTorch's ROCm build intercepts the build path and routes it through the HIP compiler. Produces these in the same directory: +`CUDAExtension` is a CUDA build helper from `torch.utils.cpp_extension`. On AMD with ROCm, PyTorch **remaps `CUDAExtension` to use `hipcc`** instead of `nvcc`, so the same `setup.py` that would build a CUDA extension on NVIDIA compiles to AMD GPU code without any changes. This is the key mechanism that makes CUDA extension code portable to AMD: PyTorch's ROCm build intercepts the build path and routes it through the HIP compiler. This produces the following in the same directory: - `build/`: directory with the `.pyd` files - `add_one_kernel.hip`: the HIP source generated by hipifying the `.cu` file; this is what `hipcc` actually compiled @@ -848,7 +848,7 @@ Each output element is defined as: $$C[row, col] = \sum_{n=0}^{N-1} A[row, n] \cdot B[n, col]$$ -Each output element is assigned to exactly one thread, and threads don't depend on each other's results, thread `(0,0)` and thread `(1,5)` run simultaneously with no coordination. However, within a single thread the dot product is **sequential**: the `n` loop iterates N times, accumulating one multiply-add per step. +Each output element is assigned to exactly one thread, and threads don't depend on each other's results: thread `(0,0)` and thread `(1,5)` run simultaneously with no coordination. However, within a single thread the dot product is **sequential**: the `n` loop iterates N times, accumulating one multiply-add per step. #### Row-Major Memory Layout @@ -1148,7 +1148,7 @@ $code | & $Python - #### Approach B: C++ Extension -The full manual path: write the kernel and Python binding in a `.cu` file, compile it as a native extension, then import and call it from Python. Mirrors the structure of `add_one_kernel.cu` exactly, only the kernel signature and launcher logic differ. +The full manual path: write the kernel and Python binding in a `.cu` file, compile it as a native extension, then import and call it from Python. This mirrors the structure of `add_one_kernel.cu` exactly; only the kernel signature and launcher logic differ. **Files:** @@ -1216,7 +1216,7 @@ Compared to `add_one_launcher` in Walkthrough 1, the launcher here: pip install --no-build-isolation -v . ``` -Produces these in the same directory: +This produces the following in the same directory: - `build/`: directory with the `.pyd` files - `matmul_kernel.hip`: the HIP source generated by hipifying the `.cu` file; this is what `hipcc` actually compiled diff --git a/playbooks/supplemental/pytorch-kernels/platform.md b/playbooks/supplemental/pytorch-kernels/platform.md index a9423fa8..89571796 100644 --- a/playbooks/supplemental/pytorch-kernels/platform.md +++ b/playbooks/supplemental/pytorch-kernels/platform.md @@ -5,7 +5,7 @@ This document describes the expected platform configurations for running this pl ## Required Frameworks ## Linux -If you're running on a Halo Box, ROCm and PyTorch are preinstalled. You can validate them by running: +If you're running on AMD Ryzen™ AI Halo Developer Platform, AMD ROCm™ software and PyTorch are preinstalled. You can validate them by running: ```bash hipcc --version @@ -117,7 +117,7 @@ print("HIP available:", torch.cuda.is_available()) ``` ### Prerequisites -- Install latest: [AMD Adrenalin Software](https://www.amd.com/en/products/software/adrenalin.html) +- Install latest: [AMD Software: Adrenalin Edition™](https://www.amd.com/en/products/software/adrenalin.html) ### Install ROCm Python packages via pip ```bash diff --git a/playbooks/supplemental/pytorch-kernels/playbook.json b/playbooks/supplemental/pytorch-kernels/playbook.json index 3dd03cfd..e31c742c 100644 --- a/playbooks/supplemental/pytorch-kernels/playbook.json +++ b/playbooks/supplemental/pytorch-kernels/playbook.json @@ -1,7 +1,7 @@ { "id": "pytorch-kernels", - "title": "Custom GPU Kernels with PyTorch ROCm", - "description": "Write and optimize custom GPU kernels using PyTorch and ROCm on STX Halo\u2122", + "title": "Custom GPU Kernels with PyTorch and AMD ROCm\u2122", + "description": "Write and optimize custom GPU kernels using PyTorch and AMD ROCm\u2122 software on AMD Ryzen\u2122 AI", "time": 120, "supported_platforms": { "halo": [