A parallel, self-balancing CI pipeline for FINN [CI 8/9]#1608
Open
merkelmarrow wants to merge 1 commit into
Open
A parallel, self-balancing CI pipeline for FINN [CI 8/9]#1608merkelmarrow wants to merge 1 commit into
merkelmarrow wants to merge 1 commit into
Conversation
Add the FINN build pipelien (ci/Jenkinsfile) and the finn_ci CLI that backs it, building on the sharding core and timing plugin. The Jenkinsfile carries no matrix logic of its own. It calls "python3 -m finn_ci" for the shard plan, timing state, retention, and LSF parsing, so the board and stage tables stay the single source of truth in finn_ci.config and the Groovy side stays as thin as possible. The pipeline runs four stages: - Validate: load the shard plan in one subprocess, prepare a timing snapshot, check the executor budget, reap orphaned LSF jobs and rotate the shared image, artifact, and snapshot trees. - Build Docker Image: build the image with run-docker.sh and publish it to NFS so the test shards load it instead of rebuilding. - Run Tests: fan out one parallel branch per shard, aggregate one board zip per (hwTestType, board), refresh the timing master and merge the reports. Extend the finn_ci package with the CLI (finn_ci.__main__) and three stdlib-only submodules it dispatches to: - timing: the self-maintaining per-group timing master and the per-shard wall-clock summary. - retention: numbered-tree rotation for the image, artifact, and snapshot trees, plus pip-cache pruning, tolerant of concurrent deletion on shared NFS. - lsf: bjobs ophan-job name parsing for the build reaper. Move the JUnit failure printer into finn_ci.failures and give finn_ci.jsonio an atomic writer. Add ci/common.groovy for the shared Groovy helpers and a set of ci/scripts shell helpers for the zip staging, failure-log capture, LSF summary and image publish. Rework run-docker.sh for a build-scoped shared image directory. Rename FINN_DOCKER_SHARED_DIR to FINN_DOCKER_SHARED_IMAGE_DIR to better reflect its purpose, add a read-only print-tag subcommand for the publish step, fail fast when prebuilt mode has no usable image, and add an optional host cache for model weights via FINN_DOCKER_CACHE_DIR. Gate the XRT .deb copy on the file rather than the directory so an empty cache dir no longer breaks the build. Relocate the Jenkinsfiles (except Jenkinsfile_HW, for compat reasons mid-migration) under ci/. Install tcsh in the image for the LSF esub scripts, and drop the now-unused pytest-forked dependency. Add a maintainer guide at ci/README.md and a drift guard test (test_ci_config_sync) that fails if the Jenkinsfile STAGES choices or the README table fall out of step with finn_ci.config. Add unit tests for the CLI, timing, retention, and LSF parsing. With FINN_CI_NFS_ROOT unset the pipeline still runs in a local fallback mode, building the image on each agent and skipping the shared caches, timing master, and build-to-HW handoff. Signed-off-by: Marco Blackwell <mblackwe@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is PR 8 of 9 of a series intended to make CI faster and more robust.
This PR overhauls FINN's main Jenkins build job with a parallel build pipeline. Instead of one long serial run, the test matrix is split into shards that run side by side across the available build agents, and the pipeline learns from past runs to keep those shards balanced over time.
The pipeline itself delegates as much logic as possible to Python. All of the real logic (such as which tests run, how they are sharded, how long they took last time, what to keep on disk, how to clean up) lives in the dependency-free
finn_ciPython package that can be unit tested easily. The Jenkins side calls into it frequently. This is the main strategy shift.Why this matters
Using it
For most contributors, nothing changes. You still write tests with the same markers, and you can run exactly what CI runs locally with a regular pytest invocation through Docker. There is a README guide for contributors and maintainers included with this change.
For Jenkins operators, a single shared-storage directory is the only additional setting needed. With it set, the pipeline gets the shared image cache, persistent timing history, per-agent download caches and hardware bitstream handoff. Retention is handled automatically, with old images, build artifacts and timing snapshots pruned according to the pruning policy in
finn_ci.retention.This change also adds some optional helpers for LSF integration.
Behaviour
All logic has been tried and tested in a real lab environment. The failure paths learned from shared NFS, flaky build agents, orphaned LSF jobs, aborted builds etc. are much of the value of the pipeline. On failure, it surfaces the relevant tool logs and per-test failure summaries directly in the build output, and fails fast with clear errors when something can be caught at the validation stage.
The pipeline is now much more resilient and flexible to unknown variables: it can safely delegate tool calls to a HPC cluster, or run everything locally, adapt to any number of build machines, use networked storage or fall back to local scratch.
It also lays the groundwork for the HW Jenkinsfile. Successful builds aggregate one package per board with a readiness marker that a later hardware pipeline will pick up (PR 9 of this series).
Breaking changes
As it stands,
Jenkinsfile_HWhas not been moved or changed. This is so that it can continue running with any saved artifacts in the legacyARTIFACT_DIRuntil the migration is complete. The location of the CI, Brevitas, and full Jenkinsfiles will need to be updated in the relevant DSL jobs on the deployment side (they were moved to theci/subdirectory). This is a breaking change for anyone running Jenkins with FINN. Otherwise, there are no other breaking changes, and only added functionality by configuring a shared storage location (FINN_CI_NFS_ROOT) on your deployment.Test/review
I recommend reading the included README during review. The
finn_cipackage is unit tested without a FINN install. To run the tests locally:python -m pytest tests/util/test_finn_ci_*.py tests/util/test_ci_config_sync.py