Heartbeat backfill#1191
Merged
Merged
Conversation
a23455d to
c6b7c97
Compare
backfill
c6b7c97 to
02c0365
Compare
bbalser
approved these changes
May 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backfill CLI command for heartbeats
Many of the early heartbeat files contain ~500k valid heartbeats.
Locally, I've been running with a
--batch-size 20,000,000as to not create a snapshot per file.This gives us an arrow file of
~2.5Gbbefore upload. The parquet files in iceberg are~32Mb.Because of the size of heartbeat files, and the constraints we've been running the backfill jobs with, I would suggest also providing
--batch-timeout 30minso we don't roll the default1minand create a load of snapshots while waiting on parsing.(
--batch-sizeand--batch-timeoutwere also added as args for bans, speedtests, and speedtest-avgs.)