Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/profiling/INDEX.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Autoresearch run index

One row per profiling run produced by the swordfish-autoresearch chart.
Newest first. PR column links to the draft PR carrying the artifacts.

| timestamp (UTC) | source SHA | shapes | impls | GPU | 8b-b1 marlin TFLOPS | run dir | PR |
|---|---|---|---|---|---|---|---|
| 20260420T010050Z | `9a18569` | voice | fp16,marlin | NVIDIA A100-SXM4-80GB | 0.7 | [`20260420T010050Z/`](./marlin/20260420T010050Z/) | [link](https://github.com/chokevin/swordfish/pull/1) |
6 changes: 6 additions & 0 deletions docs/profiling/marlin/20260420T010050Z/70b-tp2-b1.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
==PROF== Connected to process 1191 (/usr/bin/python3.10)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
{'name': '70b-tp2-b1', 'M': 1, 'N': 8192, 'K': 4096, 'group_size': 128, 'priority': 0, 'tag': 'llama-3-70b', 'impl': 'fp16', 'ms_mean': 0.3571135997772217, 'ms_p50': 0.3571135997772217, 'ms_p95': 0.3571135997772217, 'ms_min': 0.3571135997772217, 'tflops_mean': 0.1879202137411304, 'error': None, 'speedup_vs_fp16': 1.0}
{'name': '70b-tp2-b1', 'M': 1, 'N': 8192, 'K': 4096, 'group_size': 128, 'priority': 0, 'tag': 'llama-3-70b', 'impl': 'marlin', 'ms_mean': 0.7097536087036133, 'ms_p50': 0.7097536087036133, 'ms_p95': 0.7097536087036133, 'ms_min': 0.7097536087036133, 'tflops_mean': 0.09455233925837503, 'error': None, 'speedup_vs_fp16': 0.50315150976055}
==PROF== Disconnected from process 1191
==WARNING== No kernels were profiled.
6 changes: 6 additions & 0 deletions docs/profiling/marlin/20260420T010050Z/70b-tp2-b4.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
==PROF== Connected to process 1375 (/usr/bin/python3.10)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
{'name': '70b-tp2-b4', 'M': 4, 'N': 8192, 'K': 4096, 'group_size': 128, 'priority': 0, 'tag': 'llama-3-70b', 'impl': 'fp16', 'ms_mean': 0.41987199783325196, 'ms_p50': 0.41987199783325196, 'ms_p95': 0.41987199783325196, 'ms_min': 0.41987199783325196, 'tflops_mean': 0.6393268838723712, 'error': None, 'speedup_vs_fp16': 1.0}
{'name': '70b-tp2-b4', 'M': 4, 'N': 8192, 'K': 4096, 'group_size': 128, 'priority': 0, 'tag': 'llama-3-70b', 'impl': 'marlin', 'ms_mean': 0.6886784076690674, 'ms_p50': 0.6886784076690674, 'ms_p95': 0.6886784076690674, 'ms_min': 0.6886784076690674, 'tflops_mean': 0.3897834649826746, 'error': None, 'speedup_vs_fp16': 0.6096778890663496}
==PROF== Disconnected from process 1375
==WARNING== No kernels were profiled.
6 changes: 6 additions & 0 deletions docs/profiling/marlin/20260420T010050Z/70b-tp2-b8.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
==PROF== Connected to process 1559 (/usr/bin/python3.10)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
{'name': '70b-tp2-b8', 'M': 8, 'N': 8192, 'K': 4096, 'group_size': 128, 'priority': 0, 'tag': 'llama-3-70b', 'impl': 'fp16', 'ms_mean': 0.43329920768737795, 'ms_p50': 0.43329920768737795, 'ms_p95': 0.43329920768737795, 'ms_min': 0.43329920768737795, 'tflops_mean': 1.2390304493410205, 'error': None, 'speedup_vs_fp16': 1.0}
{'name': '70b-tp2-b8', 'M': 8, 'N': 8192, 'K': 4096, 'group_size': 128, 'priority': 0, 'tag': 'llama-3-70b', 'impl': 'marlin', 'ms_mean': 0.7029119968414307, 'ms_p50': 0.7029119968414307, 'ms_p95': 0.7029119968414307, 'ms_min': 0.7029119968414307, 'tflops_mean': 0.7637811197026877, 'error': None, 'speedup_vs_fp16': 0.6164345033722984}
==PROF== Disconnected from process 1559
==WARNING== No kernels were profiled.
6 changes: 6 additions & 0 deletions docs/profiling/marlin/20260420T010050Z/8b-b1.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
==PROF== Connected to process 639 (/usr/bin/python3.10)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
{'name': '8b-b1', 'M': 1, 'N': 4096, 'K': 4096, 'group_size': 128, 'priority': 0, 'tag': 'llama-3-8b', 'impl': 'fp16', 'ms_mean': 0.36798079013824464, 'ms_p50': 0.36798079013824464, 'ms_p95': 0.36798079013824464, 'ms_min': 0.36798079013824464, 'tflops_mean': 0.09118528167569324, 'error': None, 'speedup_vs_fp16': 1.0}
{'name': '8b-b1', 'M': 1, 'N': 4096, 'K': 4096, 'group_size': 128, 'priority': 0, 'tag': 'llama-3-8b', 'impl': 'marlin', 'ms_mean': 0.665228796005249, 'ms_p50': 0.665228796005249, 'ms_p95': 0.665228796005249, 'ms_min': 0.665228796005249, 'tflops_mean': 0.05044043823944031, 'error': None, 'speedup_vs_fp16': 0.5531642531832628}
==PROF== Disconnected from process 639
==WARNING== No kernels were profiled.
6 changes: 6 additions & 0 deletions docs/profiling/marlin/20260420T010050Z/8b-b4.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
==PROF== Connected to process 823 (/usr/bin/python3.10)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
{'name': '8b-b4', 'M': 4, 'N': 4096, 'K': 4096, 'group_size': 128, 'priority': 0, 'tag': 'llama-3-8b', 'impl': 'fp16', 'ms_mean': 0.4070079803466797, 'ms_p50': 0.4070079803466797, 'ms_p95': 0.4070079803466797, 'ms_min': 0.4070079803466797, 'tflops_mean': 0.3297668214900272, 'error': None, 'speedup_vs_fp16': 1.0}
{'name': '8b-b4', 'M': 4, 'N': 4096, 'K': 4096, 'group_size': 128, 'priority': 0, 'tag': 'llama-3-8b', 'impl': 'marlin', 'ms_mean': 0.7766079902648926, 'ms_p50': 0.7766079902648926, 'ms_p95': 0.7766079902648926, 'ms_min': 0.7766079902648926, 'tflops_mean': 0.17282558212441232, 'error': None, 'speedup_vs_fp16': 0.5240842039338968}
==PROF== Disconnected from process 823
==WARNING== No kernels were profiled.
6 changes: 6 additions & 0 deletions docs/profiling/marlin/20260420T010050Z/8b-b8.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
==PROF== Connected to process 1007 (/usr/bin/python3.10)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
{'name': '8b-b8', 'M': 8, 'N': 4096, 'K': 4096, 'group_size': 128, 'priority': 0, 'tag': 'llama-3-8b', 'impl': 'fp16', 'ms_mean': 0.4288832187652588, 'ms_p50': 0.4288832187652588, 'ms_p95': 0.4288832187652588, 'ms_min': 0.4288832187652588, 'tflops_mean': 0.625894052867858, 'error': None, 'speedup_vs_fp16': 1.0}
{'name': '8b-b8', 'M': 8, 'N': 4096, 'K': 4096, 'group_size': 128, 'priority': 0, 'tag': 'llama-3-8b', 'impl': 'marlin', 'ms_mean': 0.6776383876800537, 'ms_p50': 0.6776383876800537, 'ms_p95': 0.6776383876800537, 'ms_min': 0.6776383876800537, 'tflops_mean': 0.39613378002242333, 'error': None, 'speedup_vs_fp16': 0.6329086819204163}
==PROF== Disconnected from process 1007
==WARNING== No kernels were profiled.
26 changes: 26 additions & 0 deletions docs/profiling/marlin/20260420T010050Z/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Autoresearch run `20260420T010050Z`

- **source SHA:** `9a18569`
- **GPU:** NVIDIA A100-SXM4-80GB (cc 8.0, 79.3 GB)
- **CUDA / torch / triton:** 12.4 / 2.4.0a0+07cecf4168.nv24.05 / 3.0.0
- **shapes:** `voice` **impls:** `fp16,marlin` **repeats:** 5
- **marlin SHA:** `1f25790bdd49fba53106164a24666dade68d7c90`

## Results

| shape | impl | ms_mean | ms_p95 | TFLOPS | speedup vs fp16 | error |
|---|---|---|---|---|---|---|
| 8b-b1 | fp16 | 0.031 | 0.033 | 1.1 | x1.00 | |
| 8b-b1 | marlin | 0.049 | 0.050 | 0.7 | x0.64 | |
| 8b-b4 | fp16 | 0.031 | 0.032 | 4.3 | x1.00 | |
| 8b-b4 | marlin | 0.049 | 0.050 | 2.7 | x0.63 | |
| 8b-b8 | fp16 | 0.031 | 0.032 | 8.6 | x1.00 | |
| 8b-b8 | marlin | 0.050 | 0.052 | 5.4 | x0.62 | |
| 70b-tp2-b1 | fp16 | 0.050 | 0.055 | 1.3 | x1.00 | |
| 70b-tp2-b1 | marlin | 0.050 | 0.050 | 1.4 | x1.01 | |
| 70b-tp2-b4 | fp16 | 0.049 | 0.049 | 5.5 | x1.00 | |
| 70b-tp2-b4 | marlin | 0.049 | 0.050 | 5.4 | x0.99 | |
| 70b-tp2-b8 | fp16 | 0.049 | 0.049 | 10.9 | x1.00 | |
| 70b-tp2-b8 | marlin | 0.049 | 0.049 | 10.9 | x1.00 | |

![roofline](./roofline.png)
43 changes: 43 additions & 0 deletions docs/profiling/marlin/20260420T010050Z/env.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
=== profile_marlin.sh @ 20260420T010050Z ===
--- host ---
Linux swordfish-profile-sf-prof-260420-005203-98ckx 6.6.126.1-1.azl3 #1 SMP PREEMPT_DYNAMIC Wed Mar 4 05:04:40 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
--- nvidia-smi ---
Mon Apr 20 01:00:50 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000001:00:00.0 Off | 0 |
| N/A 38C P0 67W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
--- nvcc ---
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
--- nsys ---
NVIDIA Nsight Systems version 2024.2.1.106-242134037904v0
--- ncu ---
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.1.1.0 (build 33998838) (public-release)
--- python / torch / triton / marlin ---
python 3.10.12
torch 2.4.0a0+07cecf4168.nv24.05 cuda 12.4
triton 3.0.0
marlin unknown
--- repo SHA ---
9a185695e6c7089adc44924cd456989316502d0b
223 changes: 223 additions & 0 deletions docs/profiling/marlin/20260420T010050Z/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
{
"env": {
"host": "swordfish-profile-sf-prof-260420-005203-98ckx",
"os": "Linux 6.6.126.1-1.azl3",
"python": "3.10.12",
"torch": "2.4.0a0+07cecf4168.nv24.05",
"cuda_available": true,
"timestamp": "2026-04-20T01:00:56+0000",
"gpu_name": "NVIDIA A100-SXM4-80GB",
"gpu_cc": "8.0",
"gpu_mem_gb": 79.3,
"gpu_sm_count": 108,
"torch_cuda": "12.4",
"cudnn": 90100,
"triton": "3.0.0"
},
"rows": [
{
"name": "8b-b1",
"M": 1,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "fp16",
"ms_mean": 0.031485823631286616,
"ms_p50": 0.03128191947937012,
"ms_p95": 0.03333247900009155,
"ms_min": 0.030536320209503174,
"tflops_mean": 1.065699674651606,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "8b-b1",
"M": 1,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "marlin",
"ms_mean": 0.04932159900665283,
"ms_p50": 0.04929855823516846,
"ms_p95": 0.050101118087768556,
"ms_min": 0.04868607997894287,
"tflops_mean": 0.6803192247573715,
"error": null,
"speedup_vs_fp16": 0.6383779979850125
},
{
"name": "8b-b4",
"M": 4,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "fp16",
"ms_mean": 0.031109248161315918,
"ms_p50": 0.03108288049697876,
"ms_p95": 0.031852159500122074,
"ms_min": 0.030641920566558838,
"tflops_mean": 4.3143996056741285,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "8b-b4",
"M": 4,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "marlin",
"ms_mean": 0.049413503646850584,
"ms_p50": 0.04933504104614258,
"ms_p95": 0.04998335838317871,
"ms_min": 0.04883967876434326,
"tflops_mean": 2.7162155705296662,
"error": null,
"speedup_vs_fp16": 0.6295697707179017
},
{
"name": "8b-b8",
"M": 8,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "fp16",
"ms_mean": 0.031138304233551024,
"ms_p50": 0.031077120304107666,
"ms_p95": 0.03157824039459228,
"ms_min": 0.030805120468139647,
"tflops_mean": 8.620747423707328,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "8b-b8",
"M": 8,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "marlin",
"ms_mean": 0.04992921543121338,
"ms_p50": 0.04937983989715576,
"ms_p95": 0.05245567798614502,
"ms_min": 0.048923521041870116,
"tflops_mean": 5.376320330324815,
"error": null,
"speedup_vs_fp16": 0.6236489791522907
},
{
"name": "70b-tp2-b1",
"M": 1,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "fp16",
"ms_mean": 0.0499359998703003,
"ms_p50": 0.04868544101715088,
"ms_p95": 0.05492608070373535,
"ms_min": 0.048542718887329105,
"tflops_mean": 1.3438974722505426,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "70b-tp2-b1",
"M": 1,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "marlin",
"ms_mean": 0.04960268688201904,
"ms_p50": 0.049536638259887696,
"ms_p95": 0.05030464172363281,
"ms_min": 0.04909311771392822,
"tflops_mean": 1.3529280008484166,
"error": null,
"speedup_vs_fp16": 1.006719655914488
},
{
"name": "70b-tp2-b4",
"M": 4,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "fp16",
"ms_mean": 0.04888806343078613,
"ms_p50": 0.048923521041870116,
"ms_p95": 0.04904128074645996,
"ms_min": 0.04871103763580322,
"tflops_mean": 5.490817945366986,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "70b-tp2-b4",
"M": 4,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "marlin",
"ms_mean": 0.04930214405059814,
"ms_p50": 0.04949632167816162,
"ms_p95": 0.04953023910522461,
"ms_min": 0.048799362182617184,
"tflops_mean": 5.4447014662183495,
"error": null,
"speedup_vs_fp16": 0.9916011640510594
},
{
"name": "70b-tp2-b8",
"M": 8,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "fp16",
"ms_mean": 0.049118206977844234,
"ms_p50": 0.049133439064025876,
"ms_p95": 0.04922239780426026,
"ms_min": 0.048924798965454104,
"tflops_mean": 10.93018139367682,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "70b-tp2-b8",
"M": 8,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "marlin",
"ms_mean": 0.04914662361145019,
"ms_p50": 0.04927743911743164,
"ms_p95": 0.04943679809570312,
"ms_min": 0.048578557968139646,
"tflops_mean": 10.923861550377588,
"error": null,
"speedup_vs_fp16": 0.9994217988639339
}
]
}
13 changes: 13 additions & 0 deletions docs/profiling/marlin/20260420T010050Z/results.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name,impl,M,N,K,group_size,priority,tag,ms_mean,ms_p50,ms_p95,ms_min,tflops_mean,speedup_vs_fp16,error
8b-b1,fp16,1,4096,4096,128,0,llama-3-8b,0.031485823631286616,0.03128191947937012,0.03333247900009155,0.030536320209503174,1.065699674651606,1.0,
8b-b1,marlin,1,4096,4096,128,0,llama-3-8b,0.04932159900665283,0.04929855823516846,0.050101118087768556,0.04868607997894287,0.6803192247573715,0.6383779979850125,
8b-b4,fp16,4,4096,4096,128,0,llama-3-8b,0.031109248161315918,0.03108288049697876,0.031852159500122074,0.030641920566558838,4.3143996056741285,1.0,
8b-b4,marlin,4,4096,4096,128,0,llama-3-8b,0.049413503646850584,0.04933504104614258,0.04998335838317871,0.04883967876434326,2.7162155705296662,0.6295697707179017,
8b-b8,fp16,8,4096,4096,128,0,llama-3-8b,0.031138304233551024,0.031077120304107666,0.03157824039459228,0.030805120468139647,8.620747423707328,1.0,
8b-b8,marlin,8,4096,4096,128,0,llama-3-8b,0.04992921543121338,0.04937983989715576,0.05245567798614502,0.048923521041870116,5.376320330324815,0.6236489791522907,
70b-tp2-b1,fp16,1,8192,4096,128,0,llama-3-70b,0.0499359998703003,0.04868544101715088,0.05492608070373535,0.048542718887329105,1.3438974722505426,1.0,
70b-tp2-b1,marlin,1,8192,4096,128,0,llama-3-70b,0.04960268688201904,0.049536638259887696,0.05030464172363281,0.04909311771392822,1.3529280008484166,1.006719655914488,
70b-tp2-b4,fp16,4,8192,4096,128,0,llama-3-70b,0.04888806343078613,0.048923521041870116,0.04904128074645996,0.04871103763580322,5.490817945366986,1.0,
70b-tp2-b4,marlin,4,8192,4096,128,0,llama-3-70b,0.04930214405059814,0.04949632167816162,0.04953023910522461,0.048799362182617184,5.4447014662183495,0.9916011640510594,
70b-tp2-b8,fp16,8,8192,4096,128,0,llama-3-70b,0.049118206977844234,0.049133439064025876,0.04922239780426026,0.048924798965454104,10.93018139367682,1.0,
70b-tp2-b8,marlin,8,8192,4096,128,0,llama-3-70b,0.04914662361145019,0.04927743911743164,0.04943679809570312,0.048578557968139646,10.923861550377588,0.9994217988639339,
Loading
Loading