Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/finn/components/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ This section provides detailed documentation for specific FINN hardware componen
:maxdepth: 2

rtl-swg
pwpolyf
272 changes: 272 additions & 0 deletions docs/finn/components/pwpolyf.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
PWPolyF Piecewise Polynomial Activation
=======================================

Overview
--------

PWPolyF is a hardware activation layer that approximates nonlinear functions
(GELU, SiLU, Sigmoid, Tanh) using piecewise polynomials evaluated with
Horner's method on a chain of DSPFP32 FMA units. With the default degree of 2,
this uses two cascaded DSPs and one RAMB18 coefficient ROM per PE, giving
single-cycle-per-element throughput. Per-function configuration, including
clamping behaviour and polynomial coefficients, is delivered through a
SystemVerilog package (``pwpolyf_pkg``) using a ``func_cfg_t`` struct.

The input domain is partitioned into ``1 + 2*5*(2^K)`` segments: one near-zero
region, positive octave sub-segments, and negative mirrors. With the default
``K=3`` this gives 81 segments. Segment selection reuses the FP32 exponent and
mantissa bit fields directly, matching the RTL implementation.

Polynomial coefficients are generated at HDL build time by
``PWPolyF_rtl._generate_coeffs_pkg()``, which fits polynomials of the
configured degree to the reference PyTorch functions and writes
``pwpolyf_pkg.sv``. Both ``K`` and ``degree`` are configurable. They default to
``K=3`` and ``degree=2`` when inferred from standard ONNX ops.

Architecture
------------

PWPolyF is RTL-only, with no HLS variant, and targets Versal devices only. The
RTL instantiates the Versal DSPFP32 primitive, so UltraScale+ and older parts
must not be specialized to this backend.

Two export paths are supported:

.. code-block:: text

Path A: PWPolyFActivation Path B: nn.GELU / nn.SiLU / etc.
| torch.onnx.export | torch.onnx.export
| (dynamo=False) | (dynamo=True or False)
v v
PWPolyF custom ONNX node Standard ONNX ops (Gelu, Sigmoid,
| Tanh, Sigmoid+Mul for SiLU,
| Div+Erf+Add+Mul+Mul for GELU)
| |
+------------- both paths -------------+
|
InferPWPolyFLayer
v
PWPolyF HW op (finn.custom_op.fpgadataflow)
| SpecializeLayers
v
PWPolyF_rtl (finn.custom_op.fpgadataflow.rtl)
| generate_hdl
v
finn-rtllib/pwpolyf/hdl/ SystemVerilog IP

Standard ONNX Op Inference
--------------------------

``InferPWPolyFLayer`` recognises standard ONNX activation ops in addition to
the explicit ``PWPolyF`` custom op. This allows models that use ``nn.GELU``,
``nn.SiLU``, ``nn.Sigmoid``, or ``nn.Tanh`` to be exported with ``dynamo=True``
or ``dynamo=False`` and automatically converted to PWPolyF HW layers.

.. list-table::
:header-rows: 1
:widths: 20 45 20

* - ONNX op type
- Pattern
- Maps to
* - ``Gelu`` (opset 20+)
- Single node
- ``func="gelu"``
* - ``Div`` + ``Erf`` + ``Add`` + ``Mul`` + ``Mul``
- ``x * 0.5 * (1 + erf(x / sqrt(2)))``
- ``func="gelu"``
* - ``Sigmoid``
- Single node (standalone)
- ``func="sigmoid"``
* - ``Tanh``
- Single node
- ``func="tanh"``
* - ``Sigmoid`` + ``Mul``
- ``Mul(x, Sigmoid(x))``
- ``func="silu"``

``Gelu`` as a single ONNX node requires opset 20 or later. With lower opsets,
including ``dynamo=True`` export defaults to opset 18, GELU decomposes into a
5-node Erf-based pattern. Both forms are matched. SiLU has no standard ONNX op
and decomposes to ``Sigmoid(x) * x``. Only FLOAT32 inputs are converted.

Folding
-------

PWPolyF uses PE parallelism. ``NumChannels % PE == 0`` must hold. Each PE
instantiates its own polynomial evaluation pipeline with ``degree`` DSPs.
``SetFolding`` handles PE selection automatically.

.. list-table::
:header-rows: 1
:widths: 10 10 15 15 15 25

* - PE
- Degree
- DSPs
- BRAM18s
- Approx LUTs
- Cycles per spatial position
* - 1
- 2
- 2
- 1
- 200
- NumChannels
* - C
- 2
- 2C
- C
- 200C
- 1
* - 1
- 3
- 3
- 2
- 300
- NumChannels

Resource Estimates
------------------

* DSP: ``degree * PE`` (one FP32 FMA stage per polynomial degree per PE)
* LUT: approximately ``100 * degree * PE`` for segment address decode and
control
* BRAM18: ``(degree - 1) * PE`` for default ``K=3``. Vivado infers delayed
coefficient lookups as 32-bit ROMs.
* URAM: 0

ONNX Export
-----------

Two export paths are supported:

* ``PWPolyFActivation`` exports as a single ``PWPolyF`` custom op via
``torch.autograd.Function.symbolic()``. It requires ``dynamo=False`` and
preserves the ``K`` attribute on the ONNX node.
* Standard PyTorch modules (``nn.GELU``, ``nn.SiLU``, ``nn.Sigmoid``,
``nn.Tanh``) export with ``dynamo=True`` or ``dynamo=False`` and produce
standard ONNX ops that ``InferPWPolyFLayer`` converts to PWPolyF with
default ``K=3``.

Attributes on the explicit PWPolyF ONNX node are:

* ``func``: one of ``gelu``, ``silu``, ``sigmoid``, ``tanh``
* ``K``: mantissa subdivision bits, default 3

Node Attributes
---------------

.. list-table::
:header-rows: 1
:widths: 25 15 45

* - Attribute
- Type
- Description
* - ``func``
- string
- Activation function name
* - ``K``
- int
- Mantissa subdivision bits, default 3
* - ``degree``
- int
- Polynomial degree / FMA stages, default 2
* - ``NumChannels``
- int
- Number of channels in the last input dimension
* - ``PE``
- int
- Processing elements
* - ``inputDataType``
- string
- Input data type, always FLOAT32
* - ``outputDataType``
- string
- Output data type, always FLOAT32
* - ``numInputVectors``
- ints
- Batch/spatial dimensions

Supported Functions
-------------------

.. list-table::
:header-rows: 1
:widths: 20 20 30

* - Function
- Negative clamp
- Positive behaviour
* - GELU
- 0.0
- passthrough (``y=x``)
* - SiLU
- 0.0
- passthrough (``y=x``)
* - Sigmoid
- 0.0
- clamp to 1.0
* - Tanh
- -1.0
- clamp to 1.0

Files
-----

Python files:

.. list-table::
:header-rows: 1
:widths: 35 50

* - File
- Purpose
* - ``util/torch_hw_modules.py``
- PyTorch activation module, ONNX export, software simulation
* - ``custom_op/fpgadataflow/pwpolyf.py``
- Base HW op for shape, folding, resource estimates, cppsim
* - ``custom_op/fpgadataflow/rtl/pwpolyf_rtl.py``
- RTL backend for HDL generation, package generation, rtlsim, IPI
* - ``util/pwpolyf.py``
- Compatibility imports for existing PWPolyF utility users
* - ``transformation/fpgadataflow/convert_to_hw_layers.py``
- ``InferPWPolyFLayer`` transformation
* - ``builder/build_dataflow_steps.py``
- Build pipeline integration
* - ``transformation/fpgadataflow/set_folding.py``
- Folding support

RTL files:

.. list-table::
:header-rows: 1
:widths: 35 50

* - File
- Purpose
* - ``finn-rtllib/pwpolyf/hdl/pwpolyf_pkg.sv``
- ``func_cfg_t`` struct per activation, regenerated per K
* - ``finn-rtllib/pwpolyf/hdl/pwpolyf.sv``
- Polynomial evaluation pipeline using a Horner chain on DSPFP32
* - ``finn-rtllib/pwpolyf/hdl/queue.sv``
- Elastic FIFO for backpressure
* - ``finn-rtllib/pwpolyf/hdl/pwpolyf_template_wrapper.v``
- AXI-Stream wrapper template

Tests
-----

``tests/fpgadataflow/test_fpgadataflow_pwpolyf.py`` covers:

* cppsim for all supported functions, channel counts, spatial shapes, and
foldings
* ONNX export for the explicit ``PWPolyFActivation`` path
* ``InferPWPolyFLayer`` conversion and execution
* standard op inference for Gelu, Sigmoid, Tanh, SiLU, and Erf-based GELU
* execution correctness against ``PWPolyFActivation``
* Versal-only specialization checks
* resource estimates, folded shapes, and expected cycles
* coefficient package generation for ``K`` and ``degree``
* Vivado HDL generation, RTL simulation, and stitched IP simulation
3 changes: 3 additions & 0 deletions docs/finn/reference/folding-constraints.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,9 @@ Constraint Table
* - Pool
- PE
- inp_channels % PE == 0
* - PWPolyF
- PE
- NumChannels % PE == 0
* - Thresholding
- PE
- MH % PE == 0
Expand Down
9 changes: 9 additions & 0 deletions docs/finn/source_code/finn.custom_op.fpgadataflow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,15 @@ finn.custom\_op.fpgadataflow.pool
:undoc-members:
:show-inheritance:

finn.custom\_op.fpgadataflow.pwpolyf
--------------------------------------

.. automodule:: finn.custom_op.fpgadataflow.pwpolyf
:members:
:undoc-members:
:show-inheritance:


finn.custom\_op.fpgadataflow.streamingdataflowpartition
--------------------------------------------------------

Expand Down
8 changes: 8 additions & 0 deletions docs/finn/source_code/finn.custom_op.fpgadataflow.rtl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,14 @@ finn.custom\_op.fpgadataflow.streamingfifo\_rtl
:undoc-members:
:show-inheritance:

finn.custom\_op.fpgadataflow.pwpolyf\_rtl
--------------------------------------------

.. automodule:: finn.custom_op.fpgadataflow.rtl.pwpolyf_rtl
:members:
:undoc-members:
:show-inheritance:

finn.custom\_op.fpgadataflow.thresholding\_rtl
-------------------------------------------------------

Expand Down
18 changes: 18 additions & 0 deletions docs/finn/source_code/finn.util.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,24 @@ finn.util.pytorch
:show-inheritance:


finn.util.torch_hw_modules
---------------------------

.. automodule:: finn.util.torch_hw_modules
:members:
:undoc-members:
:show-inheritance:


finn.util.pwpolyf
-------------------

.. automodule:: finn.util.pwpolyf
:members:
:undoc-members:
:show-inheritance:


finn.util.test
---------------------

Expand Down
5 changes: 5 additions & 0 deletions finn-rtllib/pwpolyf/hdl/pwpolyf.abc
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
import queue
read_sv pwpolyf_pkg.sv
read_sv pwpolyf.sv
setup_tb pwpolyf_tb
setup_top pwpolyf
Loading
Loading