Skip to content

A parallel, self-balancing CI pipeline for FINN [CI 8/9]#1608

Open
merkelmarrow wants to merge 1 commit into
Xilinx:devfrom
merkelmarrow:8-ci-build-pipeline-pr
Open

A parallel, self-balancing CI pipeline for FINN [CI 8/9]#1608
merkelmarrow wants to merge 1 commit into
Xilinx:devfrom
merkelmarrow:8-ci-build-pipeline-pr

Conversation

@merkelmarrow

Copy link
Copy Markdown
Contributor

This is PR 8 of 9 of a series intended to make CI faster and more robust.

This PR overhauls FINN's main Jenkins build job with a parallel build pipeline. Instead of one long serial run, the test matrix is split into shards that run side by side across the available build agents, and the pipeline learns from past runs to keep those shards balanced over time.

The pipeline itself delegates as much logic as possible to Python. All of the real logic (such as which tests run, how they are sharded, how long they took last time, what to keep on disk, how to clean up) lives in the dependency-free finn_ci Python package that can be unit tested easily. The Jenkins side calls into it frequently. This is the main strategy shift.

Why this matters

  • Much faster CI. Work that used to run in sequence now runs in parallel, cutting CI times by about 75%.
  • Self-balancing. The pipeline records how long each group of tests takes, and uses that history to spread work evenly across shards automatically.
  • Reliability and observability. The diff is very large. However, a very large proportion of this is mechanical error handling and error observability, which is the difference between a CI pipeline that can be repaired easily and a flaky one that will be abandoned. These robustness measures were developed and refined in response to real failure modes observed during over 80 Jenkins runs over the span of 3 months.
  • Testability. CI logic now mainly lives in testable Python, and configuration drift between the pipeline definition, the docs, and the Python source of truth is caught by a unit test.

Using it

For most contributors, nothing changes. You still write tests with the same markers, and you can run exactly what CI runs locally with a regular pytest invocation through Docker. There is a README guide for contributors and maintainers included with this change.

For Jenkins operators, a single shared-storage directory is the only additional setting needed. With it set, the pipeline gets the shared image cache, persistent timing history, per-agent download caches and hardware bitstream handoff. Retention is handled automatically, with old images, build artifacts and timing snapshots pruned according to the pruning policy in finn_ci.retention.

This change also adds some optional helpers for LSF integration.

Behaviour

All logic has been tried and tested in a real lab environment. The failure paths learned from shared NFS, flaky build agents, orphaned LSF jobs, aborted builds etc. are much of the value of the pipeline. On failure, it surfaces the relevant tool logs and per-test failure summaries directly in the build output, and fails fast with clear errors when something can be caught at the validation stage.

The pipeline is now much more resilient and flexible to unknown variables: it can safely delegate tool calls to a HPC cluster, or run everything locally, adapt to any number of build machines, use networked storage or fall back to local scratch.

It also lays the groundwork for the HW Jenkinsfile. Successful builds aggregate one package per board with a readiness marker that a later hardware pipeline will pick up (PR 9 of this series).

Breaking changes

As it stands, Jenkinsfile_HW has not been moved or changed. This is so that it can continue running with any saved artifacts in the legacy ARTIFACT_DIR until the migration is complete. The location of the CI, Brevitas, and full Jenkinsfiles will need to be updated in the relevant DSL jobs on the deployment side (they were moved to the ci/ subdirectory). This is a breaking change for anyone running Jenkins with FINN. Otherwise, there are no other breaking changes, and only added functionality by configuring a shared storage location (FINN_CI_NFS_ROOT) on your deployment.

Test/review

I recommend reading the included README during review. The finn_ci package is unit tested without a FINN install. To run the tests locally:

python -m pytest tests/util/test_finn_ci_*.py tests/util/test_ci_config_sync.py

Add the FINN build pipelien (ci/Jenkinsfile) and the finn_ci CLI that
backs it, building on the sharding core and timing plugin. The
Jenkinsfile carries no matrix logic of its own. It calls
"python3 -m finn_ci" for the shard plan, timing state, retention,
and LSF parsing, so the board and stage tables stay the single source
of truth in finn_ci.config and the Groovy side stays as thin as
possible.

The pipeline runs four stages:

- Validate: load the shard plan in one subprocess, prepare a timing
  snapshot, check the executor budget, reap orphaned LSF jobs and
  rotate the shared image, artifact, and snapshot trees.
- Build Docker Image: build the image with run-docker.sh and publish
  it to NFS so the test shards load it instead of rebuilding.
- Run Tests: fan out one parallel branch per shard, aggregate one
  board zip per (hwTestType, board), refresh the timing master
  and merge the reports.

Extend the finn_ci package with the CLI (finn_ci.__main__) and three
stdlib-only submodules it dispatches to:

- timing: the self-maintaining per-group timing master and the
  per-shard wall-clock summary.
- retention: numbered-tree rotation for the image, artifact, and
  snapshot trees, plus pip-cache pruning, tolerant of concurrent
  deletion on shared NFS.
- lsf: bjobs ophan-job name parsing for the build reaper.

Move the JUnit failure printer into finn_ci.failures and give
finn_ci.jsonio an atomic writer. Add ci/common.groovy for the
shared Groovy helpers and a set of ci/scripts shell helpers for the
zip staging, failure-log capture, LSF summary and image publish.

Rework run-docker.sh for a build-scoped shared image directory.
Rename FINN_DOCKER_SHARED_DIR to FINN_DOCKER_SHARED_IMAGE_DIR to
better reflect its purpose, add a read-only print-tag subcommand
for the publish step, fail fast when prebuilt mode has no usable
image, and add an optional host cache for model weights via
FINN_DOCKER_CACHE_DIR. Gate the XRT .deb copy on the file rather
than the directory so an empty cache dir no longer breaks the
build.

Relocate the Jenkinsfiles (except Jenkinsfile_HW, for compat reasons
mid-migration) under ci/. Install tcsh in the image for the LSF
esub scripts, and drop the now-unused pytest-forked dependency.

Add a maintainer guide at ci/README.md and a drift guard test
(test_ci_config_sync) that fails if the Jenkinsfile STAGES choices
or the README table fall out of step with finn_ci.config. Add unit
tests for the CLI, timing, retention, and LSF parsing.

With FINN_CI_NFS_ROOT unset the pipeline still runs in a local
fallback mode, building the image on each agent and skipping the
shared caches, timing master, and build-to-HW handoff.

Signed-off-by: Marco Blackwell <mblackwe@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant