A parallel, self-balancing CI pipeline for FINN [CI 8/9] by merkelmarrow · Pull Request #1608 · Xilinx/finn

merkelmarrow · 2026-06-29T19:33:42Z

This is PR 8 of 9 of a series intended to make CI faster and more robust.

This PR overhauls FINN's main Jenkins build job with a parallel build pipeline. Instead of one long serial run, the test matrix is split into shards that run side by side across the available build agents, and the pipeline learns from past runs to keep those shards balanced over time.

The pipeline itself delegates as much logic as possible to Python. All of the real logic (such as which tests run, how they are sharded, how long they took last time, what to keep on disk, how to clean up) lives in the dependency-free finn_ci Python package that can be unit tested easily. The Jenkins side calls into it frequently. This is the main strategy shift.

Why this matters

Much faster CI. Work that used to run in sequence now runs in parallel, cutting CI times by about 75%.
Self-balancing. The pipeline records how long each group of tests takes, and uses that history to spread work evenly across shards automatically.
Reliability and observability. The diff is very large. However, a very large proportion of this is mechanical error handling and error observability, which is the difference between a CI pipeline that can be repaired easily and a flaky one that will be abandoned. These robustness measures were developed and refined in response to real failure modes observed during over 80 Jenkins runs over the span of 3 months.
Testability. CI logic now mainly lives in testable Python, and configuration drift between the pipeline definition, the docs, and the Python source of truth is caught by a unit test.

Using it

For most contributors, nothing changes. You still write tests with the same markers, and you can run exactly what CI runs locally with a regular pytest invocation through Docker. There is a README guide for contributors and maintainers included with this change.

For Jenkins operators, a single shared-storage directory is the only additional setting needed. With it set, the pipeline gets the shared image cache, persistent timing history, per-agent download caches and hardware bitstream handoff. Retention is handled automatically, with old images, build artifacts and timing snapshots pruned according to the pruning policy in finn_ci.retention.

This change also adds some optional helpers for LSF integration.

Behaviour

All logic has been tried and tested in a real lab environment. The failure paths learned from shared NFS, flaky build agents, orphaned LSF jobs, aborted builds etc. are much of the value of the pipeline. On failure, it surfaces the relevant tool logs and per-test failure summaries directly in the build output, and fails fast with clear errors when something can be caught at the validation stage.

The pipeline is now much more resilient and flexible to unknown variables: it can safely delegate tool calls to a HPC cluster, or run everything locally, adapt to any number of build machines, use networked storage or fall back to local scratch.

It also lays the groundwork for the HW Jenkinsfile. Successful builds aggregate one package per board with a readiness marker that a later hardware pipeline will pick up (PR 9 of this series).

Breaking changes

As it stands, Jenkinsfile_HW has not been moved or changed. This is so that it can continue running with any saved artifacts in the legacy ARTIFACT_DIR until the migration is complete. The location of the CI, Brevitas, and full Jenkinsfiles will need to be updated in the relevant DSL jobs on the deployment side (they were moved to the ci/ subdirectory). This is a breaking change for anyone running Jenkins with FINN. Otherwise, there are no other breaking changes, and only added functionality by configuring a shared storage location (FINN_CI_NFS_ROOT) on your deployment.

Test/review

I recommend reading the included README during review. The finn_ci package is unit tested without a FINN install. To run the tests locally:

python -m pytest tests/util/test_finn_ci_*.py tests/util/test_ci_config_sync.py

Add the FINN build pipelien (ci/Jenkinsfile) and the finn_ci CLI that backs it, building on the sharding core and timing plugin. The Jenkinsfile carries no matrix logic of its own. It calls "python3 -m finn_ci" for the shard plan, timing state, retention, and LSF parsing, so the board and stage tables stay the single source of truth in finn_ci.config and the Groovy side stays as thin as possible. The pipeline runs four stages: - Validate: load the shard plan in one subprocess, prepare a timing snapshot, check the executor budget, reap orphaned LSF jobs and rotate the shared image, artifact, and snapshot trees. - Build Docker Image: build the image with run-docker.sh and publish it to NFS so the test shards load it instead of rebuilding. - Run Tests: fan out one parallel branch per shard, aggregate one board zip per (hwTestType, board), refresh the timing master and merge the reports. Extend the finn_ci package with the CLI (finn_ci.__main__) and three stdlib-only submodules it dispatches to: - timing: the self-maintaining per-group timing master and the per-shard wall-clock summary. - retention: numbered-tree rotation for the image, artifact, and snapshot trees, plus pip-cache pruning, tolerant of concurrent deletion on shared NFS. - lsf: bjobs ophan-job name parsing for the build reaper. Move the JUnit failure printer into finn_ci.failures and give finn_ci.jsonio an atomic writer. Add ci/common.groovy for the shared Groovy helpers and a set of ci/scripts shell helpers for the zip staging, failure-log capture, LSF summary and image publish. Rework run-docker.sh for a build-scoped shared image directory. Rename FINN_DOCKER_SHARED_DIR to FINN_DOCKER_SHARED_IMAGE_DIR to better reflect its purpose, add a read-only print-tag subcommand for the publish step, fail fast when prebuilt mode has no usable image, and add an optional host cache for model weights via FINN_DOCKER_CACHE_DIR. Gate the XRT .deb copy on the file rather than the directory so an empty cache dir no longer breaks the build. Relocate the Jenkinsfiles (except Jenkinsfile_HW, for compat reasons mid-migration) under ci/. Install tcsh in the image for the LSF esub scripts, and drop the now-unused pytest-forked dependency. Add a maintainer guide at ci/README.md and a drift guard test (test_ci_config_sync) that fails if the Jenkinsfile STAGES choices or the README table fall out of step with finn_ci.config. Add unit tests for the CLI, timing, retention, and LSF parsing. With FINN_CI_NFS_ROOT unset the pipeline still runs in a local fallback mode, building the image on each agent and skipping the shared caches, timing master, and build-to-HW handoff. Signed-off-by: Marco Blackwell <mblackwe@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A parallel, self-balancing CI pipeline for FINN [CI 8/9]#1608

A parallel, self-balancing CI pipeline for FINN [CI 8/9]#1608
merkelmarrow wants to merge 1 commit into
Xilinx:devfrom
merkelmarrow:8-ci-build-pipeline-pr

merkelmarrow commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

merkelmarrow commented Jun 29, 2026

Why this matters

Using it

Behaviour

Breaking changes

Test/review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant