[CuTeDSL] Add narrow SM120 NVFP4 79a-style warp MMA/TMA path by alecco · Pull Request #3239 · NVIDIA/cutlass

alecco · 2026-05-15T15:04:31Z

This PR adds the CuTe DSL SM120 MXF4NVF4/NVFP4 helper surface needed to build the Blackwell GeForce warp-MMA/TMA mainloop path from Python.

This is stacked on #3204, so this branch includes those 4 commits. Reviewers should focus on the 9 SM120 NVFP4 changes on top of that base.

This PR supersedes #3185 and #3189.

The immediate downstream motivation is to unblock SM120 NVFP4 support for QuACK by Dao AI Lab.

This PR also contains a temporary workaround for #3236 that was found while working on this.

Supported path

This PR targets the narrow SM120 NVFP4 blockscaled path used by the Blackwell GeForce 79a-style mainloop:

A/B: Float4E2M1FN
SFA/SFB: Float8E4M3FN
scale vector size: sf_vec_size=16
accumulator: Float32
output epilogue: BF16 D
CTA tile: tile_shape_mnk=(128, 128, 128)
cluster shape: (1, 1, 1)

The implementation uses external-descriptor TMA for A/B/SFA/SFB. A/B TMA writes the physical ALIGN16B SMEM layout, then CuTe DSL stages that data into the 79a-style consumer SMEM layout. Scale tensors use a rank-4 TMA physical layout and CuTe DSL scale-fragment scratch. The BF16 epilogue uses the S2G TMA store partition path with the TMA tensor returned by make_tiled_tma_atom(...).

Limitations

This is not a general SM120 GEMM abstraction. Current limitations are intentional:

no C/beta path
no varlen path
no gather path
no multicast path
no non-128 K tile support
no attempt to generalize all SM100 helper APIs to SM120

Performance status

This PR is correctness- and enablement-focused, not performance-optimized.

The current implementation is intended to establish the CuTe DSL surface for the narrow SM120 NVFP4 79a-style path and to make the data-movement, descriptor, consumer-SMEM, scale-fragment, and BF16 S2G store contracts testable from Python. It deliberately prioritizes explicit staging, validation, and focused runtime canaries over optimized scheduling.

Known non-optimized areas include:

A/B TMA data is staged through an explicit physical-to-consumer-SMEM bridge.
Scale TMA data is staged through explicit physical-to-fragment scratch.
The example uses a simple one-stage pipeline scaffold.
Producer/consumer scheduling is intentionally minimal.
No attempt is made here to tune occupancy, overlap, warp specialization, register pressure, or epilogue throughput.

The goal is to provide a correct and reviewable SM120 NVFP4 foundation for downstream work, including QuACK. Performance tuning should be handled in follow-up work once the helper contracts and descriptor path are accepted.

Changes

Add SM120 MXF4NVF4 warp-MMA support in CuTe DSL:
- bundled (fragment, scale) operands for cute.gemm(...)
- warp.mma_unpack(...)
- SFA/SFB register scale-fragment layout and construction helpers
- validation for the supported SM120 shape, FP4 fragment layout, FP32 accumulator layout, and E4M3 scale layout
Add external-descriptor TMA support for SM120:
- CTA-local rank-2/rank-3/rank-4 issue helpers
- descriptor acquire fences before TMA issue
- optional tile mode, optional L2 cache hint, and already-elected issue paths
- SM120 pipeline wrappers for rank-3/rank-4 descriptor loads
Add temporary Driver API tensor-map builders for SM120 NVFP4:
- rank-3 packed FP4 A/B descriptors using CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B
- rank-4 UINT8 scale descriptors
- validation for alignment, extents, strides, box dimensions, and transaction byte counts
- runtime canaries for descriptor-created TMA copies
Add the SM120 NVFP4 GEMM helper layer:
- cutlass.utils.gemm.sm120
- TMA transaction byte helpers
- descriptor contract validation wrappers
- TMA atom builders for A/B/SFA/SFB
- external descriptor pointer helpers
- physical A/B TMA-to-consumer-SMEM bridge
- scale physical-to-fragment bridge
- SM120 NVFP4 scale fragment loading helpers
Port the 79a-style consumer SMEM bridge:
- consumer SMEM layouts matching the C++ path
- position-independent swizzle tensor support
- A/B physical TMA-to-consumer-SMEM staging
- tiled-copy-based consumer fragment loading
- local m_atom / n_atom selection for consumer microtiles
- multi-warp lane-index coverage
Add an SM120 Blackwell GeForce helper façade:
- SM100-shaped wrapper names for the supported SM120 NVFP4 warp-MMA/TMA path
- unsupported SM100/TMEM/tcgen05/dense-MMA helpers intentionally omitted
Fix and document the tiled S2G TMA store partition path:
- add cpasync.tma_partition_s2g_tile
- clarify that S2G TMA stores should partition the TMA tensor returned by make_tiled_tma_atom(...), not a manually tiled GMEM view
- add BF16 S2G TMA store partition regression coverage

Notes

Development of this code used the CUDA/C++ SM120 reference path in:

examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu

as an oracle, while reusing the existing SM100 CuTe DSL scaffolding where it maps cleanly.

The SM120 path is intentionally narrow. It targets the NVFP4/MXF4NVF4 warp-MMA, external-descriptor TMA, 79a-style consumer SMEM bridge, scale bridge, and BF16 S2G TMA epilogue path needed by the current Blackwell GeForce mainloop work.

Validation

Suggested reviewer commands:

export CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache

python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_warp_mma.py
python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_tma_issue_helpers.py
python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_pipeline_tma_issue_helpers.py
python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_tma_canary.py
python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_tma_mainloop.py
python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_bf16_tma_store_partition.py

python -m pytest -q test/python/CuTeDSL/test_sm120_tma_helpers.py
python -m pytest -q test/python/CuTeDSL/test_sm120_tensormap_manager.py
python -m pytest -q test/python/CuTeDSL/test_sm120_blackwell_geforce_helpers.py

python examples/python/CuTeDSL/blackwell_geforce/sm120_mxf4nvf4_tma_mainloop.py --compile-only
python examples/python/CuTeDSL/blackwell_geforce/sm120_mxf4nvf4_tma_mainloop.py --ptx-inspect
python examples/python/CuTeDSL/blackwell_geforce/sm120_mxf4nvf4_tma_mainloop.py --runtime-check
python examples/python/CuTeDSL/blackwell_geforce/sm120_mxf4nvf4_tma_mainloop.py --runtime-tma-check

python -m py_compile \
  python/CuTeDSL/cutlass/cute/__init__.py \
  python/CuTeDSL/cutlass/cute/atom.py \
  python/CuTeDSL/cutlass/cute/tensor.py \
  python/CuTeDSL/cutlass/utils/gemm/sm120.py \
  test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_tma_mainloop.py \
  test/examples/CuTeDSL/sm_120a/test_sm120_bf16_tma_store_partition.py \
  test/python/CuTeDSL/test_sm120_blackwell_geforce_helpers.py \
  examples/python/CuTeDSL/blackwell_geforce/sm120_mxf4nvf4_tma_mainloop.py

git diff --cached --check
git diff --check

CuTeDSL environment

This branch needs the CuTeDSL development runtime payload, not the released 4.4.2 or 4.5.0 generated _mlir files.

From the CUTLASS repo root:

python -m pip install -U "nvidia-cutlass-dsl[cu13]==4.5.0.dev0"

Then refresh the editable CuTeDSL generated runtime payload:

python - <<'PY'
from pathlib import Path
import shutil
import site

repo = Path("python/CuTeDSL").resolve()
payload = Path(site.getusersitepackages()) / "nvidia_cutlass_dsl"
src_cutlass = payload / "python_packages/cutlass"

for rel in ["cutlass/_mlir", "lib"]:
    dst = repo / rel
    if dst.exists():
        shutil.rmtree(dst)

shutil.copytree(src_cutlass / "_mlir", repo / "cutlass/_mlir")
shutil.copytree(payload / "lib", repo / "lib")

for rel in ["base_dsl/py.typed", "cute/py.typed"]:
    src = src_cutlass / rel
    if src.exists():
        dst = repo / "cutlass" / rel
        dst.parent.mkdir(parents=True, exist_ok=True)
        shutil.copy2(src, dst)

(repo / "VERSION.EDITABLE").write_text("4.5.0.dev1\n")
PY

Install the editable package:

python -m pip install -e python/CuTeDSL

Sanity check:

python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_warp_mma.py

Expected result:

11 passed

Do not use nvidia-cutlass-dsl-libs-*=4.5.0 for this branch: its generated _mlir bindings are incompatible with the SM120 blockscaled MMA code in this branch and can fail
with errors such as VectorType object is not iterable.

Teach prep_editable_install.py to handle metadata-only nvidia-cutlass-dsl wheels without copying generated runtime files from arbitrary ambient site-packages. Recent nvidia-cutlass-dsl wheels can declare exact nvidia-cutlass-dsl-libs-* companion packages for the generated cutlass._mlir Python payload and runtime shared libraries; mixing files from a different installed provider or version is invalid because libs-base and libs-cu13 install overlapping payload paths. The script now reads the downloaded nvidia-cutlass-dsl wheel METADATA, extracts the exact Requires-Dist entries for nvidia-cutlass-dsl-libs-*, downloads the selected companion wheel into the same temporary directory, and copies from the extracted downloaded wheels only. The default runtime provider is base, with CUTLASS_DSL_RUNTIME_PROVIDER=cu13 or a matching package extra available for CUDA 13 payload selection. Before copying, editable setup removes generated runtime state from previous runs: cutlass/_mlir, lib/, and copied py.typed markers. This makes the runtime payload replacement atomic enough for editable installs and prevents stale generated Python or shared libraries from surviving a version/provider change while VERSION.EDITABLE is updated. After copying, setup validates that cutlass/_mlir, lib/, and at least one runtime shared library are present before writing VERSION.EDITABLE. Missing runtime payloads now fail hard instead of producing an editable install that cannot import cutlass._mlir. This keeps editable installs reproducible and aligned with the downloaded DSL wheel metadata while still supporting metadata-only packaging layouts. It also keeps the editable-install artifact ignores from the original change. Validation: python -m py_compile python/CuTeDSL/prep_editable_install.py; ruff check python/CuTeDSL/prep_editable_install.py; git diff --check; temp-only probe resolved default libs-base and explicit cu13 companion wheels; temp-only stale payload probe verified cleanup, fresh copy, and runtime validation

Replace the pytest sharding fallback import of a top-level device_info.compute_capability module with cutlass.base_dsl.runtime.cuda.get_compute_capability_major_minor(). The previous import can resolve to unrelated third-party or environment-provided device_info packages; in our environment it resolved to an installed empty package and caused pytest collection to fail before any CuTe DSL tests ran. The CuTe DSL runtime helper already owns CUDA Driver API capability detection and is also used by the DSL environment manager, so using it here keeps test selection self-contained in the repo. Validation: python -m py_compile test/utils/test_sharding.py; ruff check --select F401 test/utils/test_sharding.py.

Remove the base DSL compiler's import-time sys.path mutation. The module already imports _mlir through the package-relative path, so appending its own directory to sys.path is unnecessary global interpreter state. While touching the compiler module, clean up two ruff-visible issues: avoid an unused CUDA exception binding and keep the TVM FFI availability check lint-clean while preserving the old importability semantics. enable_tvm_ffi now uses importlib.import_module("tvm_ffi") so a broken discoverable installation fails locally instead of passing a find_spec check and failing later.

Add the CuTe DSL surface needed to issue SM120 MXF4NVF4 warp-level MMA from Python. This introduces the narrow operand-bundle hook on MmaOp, routes only supported bundled cute.gemm operands through the SM120 blockscaled op, and keeps unsupported multi-fragment operands on the existing variadic GEMM path. Add register scale-fragment helpers plus warp.mma_unpack for the NVFP4 E4M3 scale path. The helper validates the supported SM120 shape and physical fragment contract: A/B fragments must be register-resident Float4E2M1FN with the expected packed Int32 footprint, accumulators must be register-resident Float32 with four lanes, and scale fragments must be register-resident Float8E4M3FN with a zero-stride-filtered compact layout for scale_vec::4X. Keep small compatibility shims needed by the current CuTe DSL bindings without changing broad tensor typing behavior: TypedTensor still prefers LayoutType.get(), value-caster registration only suppresses duplicate registration, and the L2 prefetch enum fallback avoids eager evaluation. The runtime MMA tests compile specialized canaries for each constexpr scale choice, assert that generated PTX uses the documented mma.sync.aligned.m16n8k64.row.col.kind::mxf4nvf4.block_scale.scale_vec::4X.f32.e2m1.e2m1.f32.ue4m3 mnemonic, reject the malformed _mma.block_scale form, materialize CUBIN artifacts, and invoke the compiled functions with runtime tensor arguments. Reviewer test command: python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_warp_mma.py The test covers runtime all-ones and scale-A=2 cases, gemm-bundle equivalence, PTX mnemonic validation, nonuniform compile coverage, and negative validation for malformed operand bundles and fragment layouts.

Add CTA-local SM120 TMA load helpers that issue external tensor-map descriptors directly from CuTe DSL. The public helpers cover 2D, 3D, and 4D loads with optional .tile mode, optional L2 cache-policy operand, and an already_elected fast path for callers that have already restricted execution to one issuing thread. The helpers intentionally do not build descriptors or assume a coordinate basis. Coordinates are forwarded to the tensor-map instruction and must match the external descriptor's rank and basis. Destination and barrier pointers are validated as shared-memory pointers, descriptor pointers are validated as global or generic pointers, and each issue path emits a tensor-map acquire fence before the TMA instruction so descriptor bytes are visible to the async/TMA proxy. The emitted PTX stays CTA-local and avoids cluster, multicast, and cta-group forms. The SM120 arch guard accepts the SM120-family target used by these CTA-local instructions. Reviewer test command: python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_tma_issue_helpers.py This is a compile-only PTX test using fake tensors. It checks rank-specific instruction strings, cache-policy plus tile mode, election control, coordinate-count misuse, address-space misuse, and verifies forbidden cluster/multicast forms are absent.

Expose a thin SM120-named pipeline layer over the existing TMA async pipeline lifecycle. PipelineTmaWarpMma.create delegates to PipelineTmaAsync.create and returns the SM120 wrapper without introducing a new barrier lifecycle. Add helpers that combine producer acquire, SM120 external-descriptor TMA issue, and producer commit for rank-3 and rank-4 descriptor loads. The public helper signatures take PipelineTmaWarpMma, and descriptor-memory visibility is handled by the raw SM120 TMA helper, which emits the tensor-map acquire fence before issue. The helpers keep the coordinate contract explicit: coordinates are passed directly to the underlying tensor-map instruction and must match the external descriptor's rank and basis. Reviewer test command: python -m pytest -q test/python/CuTeDSL/test_sm120_tma_helpers.py test/examples/CuTeDSL/sm_120a/test_sm120_pipeline_tma_issue_helpers.py The tests check public exports, verify the SM120 wrapper preserves the base pipeline fields returned by PipelineTmaAsync.create, and compile each wrapper path through to the expected SM120 TMA instruction plus descriptor acquire fence.

Add a temporary SM120 Driver API descriptor path for blockscaled NVFP4 TMA. The builders use ctypes to call cuTensorMapEncodeTiled and return the 128-byte descriptor payload expected by the external-descriptor TMA issue helpers. The FP4 builder covers the rank-3 (K, major, L) descriptor using cuda.bindings.driver.CUtensorMapDataType.CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B at the ctypes call boundary. The scale builder covers the rank-4 (major, scale_k, tile, L) UINT8 descriptor and distinguishes the padded physical scale-K extent from the logical scale-K box. The CUDA Driver enum operands are derived directly from cuda.bindings.driver with _cuda_enum_value(...) instead of hardcoded ABI integers. Driver dependency failures use CudaDriverDependencyError, and the libcuda encoder lookup is cached after the first successful initialization. The helpers are documented as an internal workaround until native CuTe descriptor lowering matches the required SM120 NVFP4 tensor-map encoding. Validation helpers are kept independent from CUDA Driver availability so pure Python tests can run on machines without a usable driver. Validation rejects nonpositive extents, boxes larger than global dimensions, and overlapping/unaligned strides before calling the Driver API. Reviewer test command: python -m pytest -q test/python/CuTeDSL/test_sm120_tensormap_manager.py The test covers alignment, pure validation failures, monkeypatched encoder arguments using cuda.bindings-derived enum values, and a real Driver API smoke path when CUDA is available.

Add focused SM120a runtime coverage for the temporary Driver API tensor-map workaround. The canary builds Driver-created descriptor bytes on the host, copies them to a 128-byte-aligned device buffer, and passes that external descriptor to the SM120 TMA issue helpers. The rank-3 canary covers the 128x128 NVFP4 payload descriptor and validates the exact swizzled physical shared-memory layout, including nonpayload ALIGN16 bytes that must remain untouched. The rank-4 canary covers the scale descriptor path and validates the exact 128B-swizzled payload layout. Both descriptor paths use deterministic nonzero source patterns, assert an untouched guard tail after the expected TMA footprint, and include nonzero-coordinate cases so shifted coordinates, over-wide copies, and wrong rank interpretation fail deterministically. The placeholder native-descriptor xfail is removed; the remaining repro is the concrete Driver descriptor path. Reviewer test command: python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_tma_canary.py The runtime canaries skip cleanly when Torch CUDA or CUDA Driver tensor-map encoding is unavailable.

Add a public `cutlass.utils.gemm.sm120` helper layer for the SM120 MXF4NVF4 warp-GEMM path. The helpers centralize the default 128x128x128 tile contract, TMA transaction byte counts, descriptor contract validation, warp MMA construction, explicit TMA physical SMEM layout names, consumer scratch layout names, layout-aware TMA atom creation, external tensor-map descriptor pointers, packed-LDSM helper entry points, SMEM-backed scale fragment source views, and SM120 scale fragment construction. Track A/B transaction bytes separately from physical A/B SMEM bytes. A 128x128 packed FP4 descriptor transaction transfers 8192 payload bytes, while the 128B-swizzled physical SMEM view consumed by LDSM occupies 16384 byte slots. The TMA atom defaults and LDSM K64 substep offsets use the physical 128-byte K stride so the descriptor-loaded SMEM view matches the load helper's address contract. Make the producer/consumer layout split explicit in the API. Add semantic physical layout names for A/B and scale TMA destinations, LDSM scratch layout helpers, scale fragment scratch layout helpers, scratch view helpers, and staging entry points. Keep the older layout helper names as compatibility aliases with docstrings that point to the TMA physical interpretation. The scale bridge models the rank-4 external descriptor layout by mapping logical payload order `scale_col * 128 + major` through the 16-byte chunk swizzle before writing fragment-compatible scratch order `local_major * 8 + scale_col`. The staging loops are lane-strided when a lane id is supplied, so CTA threads cooperatively populate scratch instead of redundantly writing the whole tile. Add a focused Blackwell GeForce example that demonstrates the intended atom-based external-descriptor TMA path: build TMA atoms, partition with `cpasync.tma_partition`, acquire one TMA pipeline stage, issue A/B/SFA/SFB `cute.copy(...)` operations with external descriptor pointers, and commit once. The example then waits as a consumer, stages TMA-populated A/B physical SMEM into CUTLASS-owned LDSM scratch, stages TMA-populated scale physical SMEM into CUTLASS-owned scale fragment scratch, loads packed A/B fragments from scratch, loads SFA/SFB fragments from scratch for two K64 substeps, calls bundled `cute.gemm(...)`, and stores BF16 output. PTX inspection covers rank-3 A/B TMA, rank-4 scale TMA, descriptor acquire fences, `ldmatrix.x4`, and the SM120 MXF4NVF4 MMA mnemonic; SASS inspection checks for `LDSM` and `OMMA.SF` when a CUBIN and `nvdisasm` are available. The fake tensors deliberately keep descriptor-rank rest extents materialized so the atom path emits rank-3 A/B and rank-4 scale TMA instructions instead of collapsing to 2D. Teach the non-exec TMA load atom path to emit a tensor-map acquire fence when callers provide `tma_desc_ptr`, so layout-aware `cute.copy(..., tma_desc_ptr=...)` has the same descriptor visibility behavior as the lower-level external-descriptor helpers. Keep `cpasync.tma_partition` compatible with current and older MLIR binding spellings by trying the `gmem_tensor` keyword first and falling back to `target_tensors` when needed. Add tests for API exports, transaction-byte helpers, wrapper-level descriptor contract validation, external descriptor override lowering, PTX/SASS shape, physical A/B TMA storage, scale TMA physical-to-fragment scratch staging, and deterministic runtime paths. The compact scale check initializes different SMEM scale bytes for each K64 substep and checks the exact accumulator value. The rank-4 scale TMA bridge check performs external descriptor TMA into physical scale SMEM, stages through CUTLASS scale scratch, loads bundled scale fragments, and checks the exact 192.0 accumulator value. The lower-level end-to-end A/B runtime test stages physical SMEM through the same LDSM scratch helper used by the example before bundled MXF4NVF4 MMA. The atom-copy end-to-end canary uses the exact `cpasync.tma_partition` plus `cute.copy(..., tma_desc_ptr=...)` path for A/B in one acquire/commit window, stages descriptor-loaded physical SMEM into LDSM scratch, and then runs bundled MMA. The physical reference helper is intentionally separate from the TMA paths so nonuniform A/B payloads exercise descriptor swizzling, lane offsets, and the second K64 LDSM substep instead of filling register fragments directly. The descriptor canaries also cover nonzero K and major coordinates. The example exposes deterministic hardware checks through `--runtime-check` and `--runtime-tma-check` for local validation. Reviewer commands: ```bash export CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache python -m pytest -q test/python/CuTeDSL/test_sm120_blackwell_geforce_helpers.py python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_tma_mainloop.py python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_tma_canary.py python -m pytest -q test/python/CuTeDSL/test_sm120_tensormap_manager.py test/python/CuTeDSL/test_sm120_tma_helpers.py python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_tma_issue_helpers.py test/examples/CuTeDSL/sm_120a/test_sm120_pipeline_tma_issue_helpers.py python -m pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_warp_mma.py python examples/python/CuTeDSL/blackwell_geforce/sm120_mxf4nvf4_tma_mainloop.py --compile-only python examples/python/CuTeDSL/blackwell_geforce/sm120_mxf4nvf4_tma_mainloop.py --ptx-inspect python examples/python/CuTeDSL/blackwell_geforce/sm120_mxf4nvf4_tma_mainloop.py --runtime-check python examples/python/CuTeDSL/blackwell_geforce/sm120_mxf4nvf4_tma_mainloop.py --runtime-tma-check ```

Port the SM120 MXF4NVF4 A/B consumer contract into the Python CuTe DSL helper layer. Add position-independent swizzle tensor support, a retile_D alias for tiled copies, consumer SMEM layouts, non-transposed A/B LDSM tiled-copy views, and a CUTLASS-owned physical TMA to consumer-SMEM adapter. Update the Blackwell GeForce mainloop scaffold to keep external descriptor TMA into the physical ALIGN16B layout, then stage through the consumer tensor layout before loading A/B fragments. The adapter writes through the recast consumer layout, rejects nonzero A/B major offsets for local full-tile staging, and keeps global major selection in tensor-map coordinates. Make the local CTA contract explicit for downstream kernels: add consumer microtile views, add m_atom/n_atom selection to the fragment and load helpers, name public helper parameters lane_idx so callers pass warp-local lane indices rather than CTA thread indices, and rename the adapter destination stage argument to consumer_stage_idx. Keep the older manual inline-LDSM scratch helpers as compatibility-only helpers outside the public __all__ surface, including their scratch allocation and TMA-physical-to-scratch staging entry points. Keep retile_D documented as verified for this SM120 MXF4NVF4 tiled-copy path rather than as a broad semantic claim. Add runtime coverage for consumer-only logical A/B probes, full atom-copy external descriptor TMA through LDSM/MMA, stage-1 physical-to-consumer adapter staging, and a 9-warp lane-index canary with one staging warp plus all eight consumer warps. The logical nonuniform A/B cases now run as strict passing tests with exact per-fragment masks, asymmetric low/high-nibble probes cover the typed FP4 post-LDSM shift hook, and the 9-warp canary keeps SFA/SFB fixed at 1.0 while stressing A/B microtile selection. The parity mask remains a smoke canary for ignored m_atom and lane/stage mistakes; separate one-hot logical-M and logical-N probes cover offset aliasing across all local m_atom and n_atom selections. Add one combined runtime canary that uses atom external descriptor TMA for nonuniform A/B, stages through the consumer SMEM adapter and LDSM path, loads rank-4 scale descriptors through the scale physical-to-fragment bridge, and verifies bundled MXF4NVF4 MMA with nonuniform SFA K64 substep scales. Reviewer commands: export CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_tma_mainloop.py test/python/CuTeDSL/test_sm120_blackwell_geforce_helpers.py --tb=short python -m py_compile python/CuTeDSL/cutlass/cute/__init__.py python/CuTeDSL/cutlass/cute/atom.py python/CuTeDSL/cutlass/cute/tensor.py python/CuTeDSL/cutlass/utils/gemm/sm120.py test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_tma_mainloop.py test/python/CuTeDSL/test_sm120_blackwell_geforce_helpers.py examples/python/CuTeDSL/blackwell_geforce/sm120_mxf4nvf4_tma_mainloop.py git diff --cached --check git diff --check

Turn cutlass.utils.blackwell_geforce_helpers into an SM120-facing facade rather than a re-export of cutlass.utils.gemm.sm120. Expose SM100-shaped wrapper names only for the supported SM120 NVFP4 warp-MMA/TMA path, while keeping lower-level implementation names private to cutlass.utils.gemm.sm120. Add facade entry points for blockscaled tiled MMA validation, physical and consumer A/B SMEM layouts, A/B physical and consumer SMEM view allocation, CTA-local TMA atoms, tiled TMA atom construction, transaction byte queries, external descriptor pointer access, descriptor-contract validation, A/B consumer-fragment loading, and scale fragment scratch staging/loading. Keep SM120 differences explicit by omission: dense MMA, TMEM/tcgen05 helpers, generic SM100-style SMEM store helpers, and low-level MXF4NVF4 helper names are not exported from the facade. Add API tests for facade exports, absence of low-level and unsupported stub names, transaction byte wrappers, positive CTA-local TMA atom creation, positive blockscaled MMA wrapper compilation, unsupported dtype and scale-vector validation, and multicast cluster rejection. Reviewer commands: export CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache pytest -q test/python/CuTeDSL/test_sm120_blackwell_geforce_helpers.py --tb=short pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_tma_mainloop.py test/python/CuTeDSL/test_sm120_blackwell_geforce_helpers.py --tb=short python -m py_compile python/CuTeDSL/cutlass/utils/blackwell_geforce_helpers.py test/python/CuTeDSL/test_sm120_blackwell_geforce_helpers.py git diff --check

Add a cpasync.tma_partition_s2g_tile helper for rank-3 MN/L epilogue store tiles so callers partition the TMA tensor returned by make_tiled_tma_atom instead of manually tiled ordinary GMEM views. Document the returned-tensor requirement on make_tiled_tma_atom and wrap S2G partition verifier failures with a targeted diagnostic. The helper also validates tile_coord_mnkl before indexing its L coordinate. Add an SM120 BF16 regression test covering the direct returned-tensor recipe, the helper recipe, reduce-store partitioning, ordinary-GMEM misuse diagnostics, and short-coordinate validation. Reviewer commands: export CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_bf16_tma_store_partition.py pytest -q test/python/CuTeDSL/test_sm120_blackwell_geforce_helpers.py pytest -q test/examples/CuTeDSL/sm_120a/test_sm120_mxf4nvf4_tma_mainloop.py

agent added 13 commits May 15, 2026 13:27

[CuTeDSL] clean up unused import

a4faf7e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CuTeDSL] Add narrow SM120 NVFP4 79a-style warp MMA/TMA path#3239

[CuTeDSL] Add narrow SM120 NVFP4 79a-style warp MMA/TMA path#3239
alecco wants to merge 13 commits into
NVIDIA:mainfrom
alecco:sm120-blockscaled-nvfp4

alecco commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alecco commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Table of Contents

Supported path

Limitations

Performance status

Changes

Notes

Validation

CuTeDSL environment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alecco commented May 15, 2026 •

edited

Loading