Skip to content
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 164 additions & 0 deletions docs/finn/pwpolyf.md

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please move this into the components subsection? https://github.com/Xilinx/finn/tree/dev/docs/finn/components

Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# PWPolyF — Piecewise Polynomial Activation

## Overview

PWPolyF is a hardware activation layer that approximates nonlinear functions
(GELU, SiLU, Sigmoid, Tanh) using piecewise polynomials evaluated via Horner's
method on a chain of DSPFP32 FMA units. With the default degree 2, this uses
two cascaded DSPs per PE, giving single-cycle-per-element throughput with no
BRAM usage. Per-function configuration (clamping behaviour and polynomial
coefficients) is delivered through a SystemVerilog package (`pwpolyf_pkg`)
using a `func_cfg_t` struct.

The input domain is partitioned into `1 + 2*5*(2^K)` segments: one near-zero
region, positive octave sub-segments, and negative mirrors. With the default
K=3 this gives 81 segments. Segment selection reuses the FP32
exponent/mantissa bit-fields directly, matching the RTL implementation.

Polynomial coefficients are generated at HDL build time by
`generate_coeffs_pkg()` in `pwpolyf_rtl.py`, which fits degree-2 polynomials
to the reference PyTorch functions and writes `pwpolyf_pkg.sv` — a
SystemVerilog package with one `func_cfg_t` struct per activation
(clamping config + coefficient table). K can take any value; it defaults
to 3 when inferred from standard ONNX ops.

## Architecture

PWPolyF is **RTL-only** (no HLS variant). Two export paths are supported:

```
Path A: PiecewisePolyActivation Path B: nn.GELU / nn.SiLU / etc.
| torch.onnx.export | torch.onnx.export
| (dynamo=False) | (dynamo=True or False)
v v
PWPolyF custom ONNX node Standard ONNX ops (Gelu, Sigmoid,
| Tanh, Sigmoid+Mul for SiLU,
| Div+Erf+Add+Mul+Mul for GELU)
| |
+------------- both paths -------------+
|
InferPWPolyFLayer
v
PWPolyF HW op (finn.custom_op.fpgadataflow)
| SpecializeLayers
v
PWPolyF_rtl (finn.custom_op.fpgadataflow.rtl)
| generate_hdl
v
finn-rtllib/pwpolyf/hdl/ SystemVerilog IP
```

### Standard ONNX op inference

`InferPWPolyFLayer` recognises standard ONNX activation ops in addition to
the explicit `PWPolyF` custom op. This allows models that use `nn.GELU`,
`nn.SiLU`, `nn.Sigmoid`, or `nn.Tanh` to be exported with `dynamo=True`
(or `dynamo=False`) and automatically converted to PWPolyF HW layers.

| ONNX op type | Pattern | Maps to |
|---|---|---|
| `Gelu` (opset 20+) | Single node | `func="gelu"` |
| `Div`+`Erf`+`Add`+`Mul`+`Mul` | `x * 0.5 * (1 + erf(x / sqrt(2)))` | `func="gelu"` |
| `Sigmoid` | Single node (standalone) | `func="sigmoid"` |
| `Tanh` | Single node | `func="tanh"` |
| `Sigmoid` + `Mul` | `Mul(x, Sigmoid(x))` | `func="silu"` |

Notes:
- `Gelu` as a single ONNX node requires opset 20 or later. With lower
opsets (including `dynamo=True` which defaults to opset 18), GELU
decomposes into a 5-node Erf-based pattern. Both forms are matched.
- SiLU (`nn.SiLU`) has no standard ONNX op; it decomposes to
`Sigmoid(x) * x`. The transformation detects this two-node pattern.
- Only FLOAT32 inputs are converted. Quantised activations are skipped.

## Folding

PWPolyF uses PE parallelism. `NumChannels % PE == 0` must hold.
Each PE instantiates its own polynomial evaluation pipeline (2 DSPs).
`SetFolding` handles PE selection automatically.

| PE | DSPs | Approx LUTs | Cycles (per spatial position) |
|----|------|-------------|-------------------------------|
| 1 | 2 | 200 | NumChannels |
| C | 2C | 200C | 1 |

## Resource estimates

- **DSP:** 2 per PE (two FP32 FMA stages)
- **LUT:** ~200 per PE (segment address decode + control)
- **BRAM/URAM:** 0 (coefficients stored in LUT/registers)

## ONNX export

Two export paths are supported:

1. **`PiecewisePolyActivation` (explicit)** — exports as a single `PWPolyF`
custom op via `torch.autograd.Function.symbolic()`. Requires
`dynamo=False`. Preserves the `K` attribute on the ONNX node.

2. **Standard nn modules** (`nn.GELU`, `nn.SiLU`, `nn.Sigmoid`, `nn.Tanh`) —
export with `dynamo=True` or `dynamo=False`. Produces standard ONNX ops
that `InferPWPolyFLayer` converts to PWPolyF with default `K=3`.

Attributes on the explicit PWPolyF ONNX node:
- `func` (string): one of `gelu`, `silu`, `sigmoid`, `tanh`
- `K` (int): mantissa subdivision bits (default 3)

## Node attributes (HW op)

| Attribute | Type | Description |
|--------------------|--------|------------------------------------------|
| `func` | string | Activation function name |
| `K` | int | Mantissa subdivision bits |
| `NumChannels` | int | Number of channels (last input dim) |
| `PE` | int | Processing elements |
| `inputDataType` | string | Input data type (FLOAT32) |
| `outputDataType` | string | Output data type (FLOAT32) |
| `numInputVectors` | ints | Batch/spatial dimensions |

## Supported functions

| Function | Negative clamp | Positive behaviour |
|----------|---------------|--------------------|
| GELU | 0.0 | passthrough (y=x) |
| SiLU | 0.0 | passthrough (y=x) |
| Sigmoid | 0.0 | clamp to 1.0 |
| Tanh | -1.0 | clamp to 1.0 |

## Files

### Python

| File | Purpose |
|------|---------|
| `custom_op/fpgadataflow/pwpolyf.py` | Base HW op (shape, folding, resource estimates, cppsim) |
| `custom_op/fpgadataflow/rtl/pwpolyf_rtl.py` | RTL backend (HDL generation, package generation, rtlsim, IPI) |
| `util/pwpolyf.py` | PyTorch activation module, ONNX export, software simulation |
| `transformation/fpgadataflow/convert_to_hw_layers.py` | `InferPWPolyFLayer` transformation |
| `builder/build_dataflow_steps.py` | Build pipeline integration |
| `transformation/fpgadataflow/set_folding.py` | Folding support (pe_ops list) |

### RTL

| File | Purpose |
|------|---------|
| `finn-rtllib/pwpolyf/hdl/pwpolyf_pkg.sv` | `func_cfg_t` struct per activation (coeffs + clamp config, regenerated per K) |
| `finn-rtllib/pwpolyf/hdl/pwpolyf.sv` | Polynomial evaluation pipeline (Horner chain on DSPFP32) |
| `finn-rtllib/pwpolyf/hdl/queue.sv` | Elastic FIFO for backpressure |
| `finn-rtllib/pwpolyf/hdl/pwpolyf_template_wrapper.v` | AXI-Stream wrapper template |

## Tests

`tests/fpgadataflow/test_fpgadataflow_pwpolyf.py`:

- **cppsim**: all 4 functions x 2 channel counts x 2 spatial shapes x 3 foldings
- **ONNX export**: verifies single-node export for all functions
- **InferPWPolyFLayer**: end-to-end export → transform → execute
- **Standard op inference**: Gelu/Sigmoid/Tanh single-node + SiLU pattern
- **Erf-based GELU inference**: 5-node Erf decomposition pattern matching + execution
- **SiLU edge cases**: reversed Mul input order, multi-consumer Sigmoid
- **Execution correctness**: standard ops produce same output as PiecewisePolyActivation
- **SpecializeLayers**: verifies RTL specialization
- **Resource estimates**: DSP/LUT/BRAM checks across PE values
- **Folded shapes**: input/output/stream width calculations
- **Expected cycles**: cycle count estimation + analysis pass integration
3 changes: 3 additions & 0 deletions docs/finn/reference/folding-constraints.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,9 @@ Constraint Table
* - Pool
- PE
- inp_channels % PE == 0
* - PWPolyF
- PE
- NumChannels % PE == 0
* - Thresholding
- PE
- MH % PE == 0
Expand Down
9 changes: 9 additions & 0 deletions docs/finn/source_code/finn.custom_op.fpgadataflow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,15 @@ finn.custom\_op.fpgadataflow.pool
:undoc-members:
:show-inheritance:

finn.custom\_op.fpgadataflow.pwpolyf
--------------------------------------

.. automodule:: finn.custom_op.fpgadataflow.pwpolyf
:members:
:undoc-members:
:show-inheritance:


finn.custom\_op.fpgadataflow.streamingdataflowpartition
--------------------------------------------------------

Expand Down
8 changes: 8 additions & 0 deletions docs/finn/source_code/finn.custom_op.fpgadataflow.rtl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,14 @@ finn.custom\_op.fpgadataflow.streamingfifo\_rtl
:undoc-members:
:show-inheritance:

finn.custom\_op.fpgadataflow.pwpolyf\_rtl
--------------------------------------------

.. automodule:: finn.custom_op.fpgadataflow.rtl.pwpolyf_rtl
:members:
:undoc-members:
:show-inheritance:

finn.custom\_op.fpgadataflow.thresholding\_rtl
-------------------------------------------------------

Expand Down
9 changes: 9 additions & 0 deletions docs/finn/source_code/finn.util.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,15 @@ finn.util.pytorch
:show-inheritance:


finn.util.pwpolyf
-------------------

.. automodule:: finn.util.pwpolyf
:members:
:undoc-members:
:show-inheritance:


finn.util.test
---------------------

Expand Down
5 changes: 5 additions & 0 deletions finn-rtllib/pwpolyf/hdl/pwpolyf.abc
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
import queue
read_sv pwpolyf_pkg.sv
read_sv pwpolyf.sv
setup_tb pwpolyf_tb
setup_top pwpolyf
Loading
Loading