Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions playbooks/supplemental/cvml/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@ SPDX-License-Identifier: MIT
> This playbook uses special tags that GitHub cannot render. Please visit [amd.com/playbooks](https://amd.com/playbooks) to correctly preview this content.
<!-- @github-only:end -->

# Local Computer Vision with Ryzen AI NPU
# Local Computer Vision with AMD Ryzen AI NPU

## Overview

The [Ryzen AI CVML Library](https://ryzenai.docs.amd.com/en/latest/ryzen_ai_libraries.html#ryzen-ai-cvml-library) is AMD's C++ computer vision and machine learning toolkit that provides powerful, on-device perception capabilities — including depth estimation, face detection, and face mesh tracking. Built on top of the Ryzen AI drivers, the library automatically selects the best available hardware (GPU or NPU) for inference, letting you add AI features to C++ applications without worrying about model training or framework integration. All processing happens locally on your system, making it ideal for privacy-sensitive, low-latency applications.
The [Ryzen AI CVML Library](https://ryzenai.docs.amd.com/en/latest/ryzen_ai_libraries.html#ryzen-ai-cvml-library) is an AMD C++ computer vision and machine learning toolkit that provides powerful, on-device perception capabilities — including depth estimation, face detection, and face mesh tracking. Built on top of the Ryzen AI drivers, the library automatically selects the best available hardware (GPU or NPU) for inference, letting you add AI features to C++ applications without worrying about model training or framework integration. All processing happens locally on your system, making it ideal for privacy-sensitive, low-latency applications.

This playbook teaches you how to set up the Ryzen AI CVML Library, build the included sample applications, and run face detection on a sample video.
This playbook teaches you how to set up the Ryzen AI CVML Library, build the included sample applications, and run face detection on a sample image.

## What You'll Learn

Expand Down
4 changes: 2 additions & 2 deletions playbooks/supplemental/cvml/playbook.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"id": "cvml",
"title": "Local Computer Vision with Ryzen AI NPU",
"description": "Build local perception capabilities using CVML SDK on top of RyzenAI and ROCm",
"title": "Local Computer Vision with AMD Ryzen\u2122 AI NPU",
"description": "Build local perception capabilities using the CVML SDK on top of Ryzen AI and AMD ROCm\u2122 software",
"time": 60,
"supported_platforms": {
"halo": [
Expand Down
2 changes: 1 addition & 1 deletion playbooks/supplemental/gaia-agents/playbook.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"id": "gaia-agents",
"title": "Building Your First Agent with GAIA",
"description": "Build a 100% local AI agent \u2014 no cloud APIs needed. Use the GAIA SDK to create a hardware advisor on your STX Halo",
"description": "Build a 100% local AI agent \u2014 no cloud APIs needed. Use the GAIA SDK to create a hardware advisor on your AMD Ryzen\u2122 AI",
"time": 20,
"supported_platforms": {
"halo": [
Expand Down
8 changes: 4 additions & 4 deletions playbooks/supplemental/llama-factory-finetuning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ print("PASS: Required LLaMA Factory example files exist")
These example configuration files have specified model parameters, fine-tuning method parameters, dataset parameters, evaluation parameters, and more. You can configure them according to your own needs. In this playbook, we will use [qwen3_lora_sft.yaml](https://github.com/hiyouga/LlamaFactory/blob/main/examples/train_lora/qwen3_lora_sft.yaml).

**Key parameters explained:**
- `model_name_or_path` - HuggingFace Model name or local model file path.
- `model_name_or_path` - Hugging Face model name or local model file path.
- `stage` - Training stage. Options: rm (reward modeling), pt (pretrain), sft (Supervised Fine-Tuning), PPO, DPO, KTO, ORPO.
- `do_train` - true for training, false for evaluation
- `finetuning_type` - Fine-tuning method. Options: freeze, lora, full
Expand Down Expand Up @@ -193,7 +193,7 @@ llamafactory-cli train examples/train_lora/qwen3_lora_sft_ci.yaml
<!-- @test:end -->
<!-- @os:end -->

After running LLM finetuning, all generated outputs are stored in the "output_dir", including model checkpoint files, configuration files, and training metrics.
After running LLM fine-tuning, all generated outputs are stored in the "output_dir", including model checkpoint files, configuration files, and training metrics.

<p align="center">
<img src="assets/qwen3_lora.png" alt="Qwen3 LoRA Fine-tuning" width="600"/>
Expand Down Expand Up @@ -236,7 +236,7 @@ print(f"Found adapter weights: {adapter_weights}")

**llamafactory-cli chat** is designed for interactive chat/inference with LLMs (both base models and LoRA-fine-tuned models). LLaMA Factory provides the sample configuration to run inference of fine-tuned models in [examples/inference](https://github.com/hiyouga/LlamaFactory/tree/main/examples/inference). You can also modify this sample configuration to change the settings, such as the inference backend.

Use the following command to test Qwen3 fine-tuned model:
Use the following command to test the Qwen3 fine-tuned model:

```bash
llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml
Expand All @@ -252,7 +252,7 @@ An example chat using the fine-tuned model is shown below:

For production use-cases, the pre-trained model and the LoRA adapter need to be merged and exported into a single model. This merged model can be used as a normal Hugging Face model file. LLaMA Factory provides the sample configurations in [examples/merge_lora](https://github.com/hiyouga/LlamaFactory/tree/main/examples/merge_lora).

Use the following command to export Qwen3 fine-tuned model:
Use the following command to export the Qwen3 fine-tuned model:

```bash
llamafactory-cli export examples/merge_lora/qwen3_lora_sft.yaml
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"id": "ollama-getting-started",
"title": "Getting Started with Ollama",
"description": "Install Ollama and run LLMs locally \u2014 chat from the terminal, desktop app, or REST API on your STX Halo\u2122",
"description": "Install Ollama and run LLMs locally \u2014 chat from the terminal, desktop app, or REST API on your AMD Ryzen\u2122 AI",
"time": 15,
"supported_platforms": {
"halo": [
Expand Down
22 changes: 11 additions & 11 deletions playbooks/supplemental/pytorch-kernels/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,23 @@ SPDX-License-Identifier: MIT
> This playbook uses special tags that GitHub cannot render. Please visit [amd.com/playbooks](https://amd.com/playbooks) to correctly preview this content.
<!-- @github-only:end -->

# Compile your own GPU kernels for Pytorch+ROCm
# Compile your own GPU kernels for PyTorch + AMD ROCm™ Software

## Overview

Write a GPU kernel from scratch, compile it, and launch it on an AMD GPU, then watch utilization spike. This playbook shows how GPU computation actually works: you write the kernel code, and it executes in parallel across thousands of threads.
Write a GPU kernel from scratch, compile it, and launch it on an AMD Radeon™ GPU, then watch utilization spike. This playbook shows how GPU computation actually works: you write the kernel code, and it executes in parallel across thousands of threads.

## What You'll Learn

<!-- @os:windows -->
- How GPU kernels work: grids, blocks, threads, and the indexing model that maps them to data
- How AMD's ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification
- How the AMD ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification
- How to compile a kernel at runtime using `torch.cuda._compile_kernel`
- How to build a native C++ kernel extension with `CUDAExtension` + pybind11, importable from Python
<!-- @os:end -->
<!-- @os:linux -->
- How GPU kernels work: grids, blocks, threads, and the indexing model that maps them to data
- How AMD's ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification
- How the AMD ROCm/HIP stack lets you write CUDA-style code that runs on AMD GPUs without modification
- How to compile a kernel at runtime using `torch.cuda._compile_kernel`
- How to build a native C++ kernel extension with `CUDAExtension` + pybind11, importable from Python
- How to measure kernel execution time and monitor live GPU utilization with `rocm-smi`
Expand All @@ -48,7 +48,7 @@ This playbook covers two approaches for kernel development:
| **C++ Extension** | `CUDAExtension` + pybind11, compile a `.cu` file into a native `.so` and import it |
<!-- @os:end -->

Both approaches run on AMD GPUs. This is possible because PyTorch's ROCm build maps the entire CUDA API surface to HIP, `torch.cuda`, `CUDAExtension`, and CUDA kernel syntax all work on AMD hardware transparently. You write CUDA-style code; ROCm handles the translation.
Both approaches run on AMD GPUs. This is possible because PyTorch's ROCm build maps the entire CUDA API surface to HIP. `torch.cuda`, `CUDAExtension`, and CUDA kernel syntax all work on AMD hardware transparently. You write CUDA-style code; ROCm handles the translation.

---

Expand Down Expand Up @@ -95,7 +95,7 @@ These variables are combined to compute a globally unique thread index:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
```

Total threads = `gridDim.x * blockDim.x`. Each thread processes one element independently, this is **data parallelism**. The same operation runs on many elements at once with no inter-thread dependency.
Total threads = `gridDim.x * blockDim.x`. Each thread processes one element independently; this is **data parallelism**. The same operation runs on many elements at once with no inter-thread dependency.

---

Expand All @@ -122,7 +122,7 @@ PyTorch also exposes `torch.cuda._compile_kernel()`, a high-level shortcut to JI
## Setup
<!-- @os:windows -->
### Prerequisites - Windows
- Install latest: [AMD Adrenalin Software](https://www.amd.com/en/products/software/adrenalin.html)
- Install latest: [AMD Software: Adrenalin Edition™](https://www.amd.com/en/products/software/adrenalin.html)
<!-- @os:end -->

### Create a Virtual Environment
Expand Down Expand Up @@ -669,7 +669,7 @@ the CPU immediately continues executing the next instruction without waiting for
pip install --no-build-isolation -v .
```

`CUDAExtension` is a CUDA build helper from `torch.utils.cpp_extension`. On AMD with ROCm, PyTorch **remaps `CUDAExtension` to use `hipcc`** instead of `nvcc`, so the same `setup.py` that would build a CUDA extension on NVIDIA compiles to AMD GPU code without any changes. This is the key mechanism that makes CUDA extension code portable to AMD: PyTorch's ROCm build intercepts the build path and routes it through the HIP compiler. Produces these in the same directory:
`CUDAExtension` is a CUDA build helper from `torch.utils.cpp_extension`. On AMD with ROCm, PyTorch **remaps `CUDAExtension` to use `hipcc`** instead of `nvcc`, so the same `setup.py` that would build a CUDA extension on NVIDIA compiles to AMD GPU code without any changes. This is the key mechanism that makes CUDA extension code portable to AMD: PyTorch's ROCm build intercepts the build path and routes it through the HIP compiler. This produces the following in the same directory:
<!-- @os:windows -->
- `build/`: directory with the `.pyd` files
- `add_one_kernel.hip`: the HIP source generated by hipifying the `.cu` file; this is what `hipcc` actually compiled
Expand Down Expand Up @@ -848,7 +848,7 @@ Each output element is defined as:

$$C[row, col] = \sum_{n=0}^{N-1} A[row, n] \cdot B[n, col]$$

Each output element is assigned to exactly one thread, and threads don't depend on each other's results, thread `(0,0)` and thread `(1,5)` run simultaneously with no coordination. However, within a single thread the dot product is **sequential**: the `n` loop iterates N times, accumulating one multiply-add per step.
Each output element is assigned to exactly one thread, and threads don't depend on each other's results: thread `(0,0)` and thread `(1,5)` run simultaneously with no coordination. However, within a single thread the dot product is **sequential**: the `n` loop iterates N times, accumulating one multiply-add per step.

#### Row-Major Memory Layout

Expand Down Expand Up @@ -1148,7 +1148,7 @@ $code | & $Python -

#### Approach B: C++ Extension

The full manual path: write the kernel and Python binding in a `.cu` file, compile it as a native extension, then import and call it from Python. Mirrors the structure of `add_one_kernel.cu` exactly, only the kernel signature and launcher logic differ.
The full manual path: write the kernel and Python binding in a `.cu` file, compile it as a native extension, then import and call it from Python. This mirrors the structure of `add_one_kernel.cu` exactly; only the kernel signature and launcher logic differ.

**Files:**
<!-- @os:windows -->
Expand Down Expand Up @@ -1216,7 +1216,7 @@ Compared to `add_one_launcher` in Walkthrough 1, the launcher here:
pip install --no-build-isolation -v .
```

Produces these in the same directory:
This produces the following in the same directory:
<!-- @os:windows -->
- `build/`: directory with the `.pyd` files
- `matmul_kernel.hip`: the HIP source generated by hipifying the `.cu` file; this is what `hipcc` actually compiled
Expand Down
4 changes: 2 additions & 2 deletions playbooks/supplemental/pytorch-kernels/platform.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This document describes the expected platform configurations for running this pl
## Required Frameworks
## Linux

If you're running on a Halo Box, ROCm and PyTorch are preinstalled. You can validate them by running:
If you're running on AMD Ryzen™ AI Halo Developer Platform, AMD ROCm™ software and PyTorch are preinstalled. You can validate them by running:

```bash
hipcc --version
Expand Down Expand Up @@ -117,7 +117,7 @@ print("HIP available:", torch.cuda.is_available())
```

### Prerequisites
- Install latest: [AMD Adrenalin Software](https://www.amd.com/en/products/software/adrenalin.html)
- Install latest: [AMD Software: Adrenalin Edition™](https://www.amd.com/en/products/software/adrenalin.html)

### Install ROCm Python packages via pip
```bash
Expand Down
4 changes: 2 additions & 2 deletions playbooks/supplemental/pytorch-kernels/playbook.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"id": "pytorch-kernels",
"title": "Custom GPU Kernels with PyTorch ROCm",
"description": "Write and optimize custom GPU kernels using PyTorch and ROCm on STX Halo\u2122",
"title": "Custom GPU Kernels with PyTorch and AMD ROCm\u2122",
"description": "Write and optimize custom GPU kernels using PyTorch and AMD ROCm\u2122 software on AMD Ryzen\u2122 AI",
"time": 120,
"supported_platforms": {
"halo": [
Expand Down
Loading