Gate block-manager historical storage behind PROMPP feature flag#389
Open
vporoshok wants to merge 15 commits into
Open
Gate block-manager historical storage behind PROMPP feature flag#389vporoshok wants to merge 15 commits into
vporoshok wants to merge 15 commits into
Conversation
Add compaction plan/result logs, restore missing block-manager TSDB gauges, and cap compaction ranges by max block duration so 2h block setups follow configured bounds. Co-authored-by: Cursor <cursoragent@cursor.com>
Default to the pre-PR-377 historical TSDB path and enable block-manager only with PROMPP_FEATURES=enable_block_manager, keeping PP head+adapter as the write path in both modes. Co-authored-by: Cursor <cursoragent@cursor.com>
Expose prometheus_tsdb_blocks_loaded_by_size to track loaded block size buckets after reload and help diagnose startup spikes caused by unexpected compaction output. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the size-based gauge with prometheus_tsdb_blocks_loaded_by_duration so legacy TSDB exposes the same block-duration view as block-manager during startup diagnostics. Co-authored-by: Cursor <cursoragent@cursor.com>
Add Grafana panels for loaded block layout and normalize loaded-block duration buckets to 5 minutes in both block-manager and legacy TSDB so duration heatmaps are stable and easy to read. Co-authored-by: Cursor <cursoragent@cursor.com>
Expose loaded-block duration labels in minutes with 1-minute rounding for both block-manager and legacy TSDB paths, and update Grafana panels to query duration_minutes for clearer heatmap grouping. Co-authored-by: Cursor <cursoragent@cursor.com>
Mirror legacy tsdb "Found healthy block" output so operators can see the on-disk block layout when the block manager starts, including each block's normalized duration in minutes. Co-authored-by: Cursor <cursoragent@cursor.com>
Run reload (with deletion) and a single compaction pass sequentially in one goroutine, mirroring tsdb's compact/reload loop. This removes the race where the compactor's independent loop could plan/compact blocks that the manager's reload was concurrently deleting (open meta.json: no such file or directory) and the resulting repeated re-compaction of overlapping blocks (CPU/mmap churn). The compactor no longer runs its own goroutine, ticker or shared mutex: it exposes a one-shot Compact() called by Manager after each reload. Also render the compaction plan as a string in logs (go-kit cannot encode []string). Co-authored-by: Cursor <cursoragent@cursor.com>
When the C++ scrape parser rejects a buffer with invalid UTF-8 (in a HELP text or a label value), print the containing line, the buffer size and the line start offset to stdout. The Go-side buffer is mutated in place during parsing, so the offending bytes can only be inspected reliably here. Co-authored-by: Cursor <cursoragent@cursor.com>
After a successful compaction, immediately reload and compact again instead of waiting for the next ticker interval, so multiple pending compactions converge in one tick (mirroring tsdb's compact/reload loop). Compact now reports whether it did any work to drive the loop. Co-authored-by: Cursor <cursoragent@cursor.com>
Revert the mixin panels and example dashboard JSON added for block duration observability; keep only the code-level metrics and logging. Co-authored-by: Cursor <cursoragent@cursor.com>
e93b6cd to
f6bbeb4
Compare
u-veles-a
approved these changes
Jun 26, 2026
| } | ||
| // Reload to load the freshly created block and delete the obsolete | ||
| // parents before planning the next compaction. | ||
| if err := m.reloadBlocks(); err != nil { |
Collaborator
There was a problem hiding this comment.
In stock, they delete the compacted block in case of an unsuccessful reload:
if err := db.reloadBlocks(); err != nil {
errs := tsdb_errors.NewMulti(fmt.Errorf("reloadBlocks blocks: %w", err))
for _, uid := range uids {
if errRemoveAll := os.RemoveAll(filepath.Join(db.dir, uid.String())); errRemoveAll != nil {
errs.Add(fmt.Errorf("delete persisted block after failed db reloadBlocks:%s: %w", uid, errRemoveAll))
}
}
return errs.Err()
}
Mirror tsdb: if the reload that loads a newly compacted block fails, remove that block so a half-applied compaction does not leave orphaned blocks on disk. Compact now returns the created block ULIDs and the manager removes them on reload failure. Co-authored-by: Cursor <cursoragent@cursor.com>
vporoshok
commented
Jun 29, 2026
| // adapter; localStorage stays an empty stub. In agent mode the secondary is | ||
| // still localStorage (agent.DB set later). | ||
| // Server mode supports two historical-block storage schemes selected by | ||
| // --storage.prompp.use-block-manager: |
Collaborator
Author
There was a problem hiding this comment.
using PROMPP_FEATURE is enough, this flag is unneeded
| // describes where it came from (e.g. "help" or "label_value"). It is a | ||
| // diagnostic aid: the Go buffer is mutated in place during parsing, so the | ||
| // offending bytes can only be inspected reliably here, before they are encoded. | ||
| inline void log_invalid_utf8(std::string_view buffer, std::string_view bad, std::string_view kind) noexcept { |
Collaborator
Author
There was a problem hiding this comment.
remove this debug code
Open the historical backend (tsdb.DB or block.Manager) eagerly during setup for both schemes. The historical path does no WAL replay (the PP head + adapter is the only write path), so the open is as cheap as the block manager's initial reload and there is no need to defer it into the run group. Remove the now-unnecessary dbOpen channel and the startup gating on it in the config loader and head manager; the server storage actor only waits for shutdown and closes the fanout. Simplify tsdbHistoricalStorage now that its db is set at construction. Co-authored-by: Cursor <cursoragent@cursor.com>
Remove the scraper invalid-UTF-8 diagnostic logging (log_invalid_utf8 helper and its call sites) that was added only for stage diagnostics. Update the storage-scheme comment to reference the PROMPP_FEATURES=enable_block_manager feature flag instead of the removed CLI flag. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PROMPP_FEATURES=enable_block_managerWhy
A stage rollout showed a sharp CPU and mmap spike after switching storage behavior, so we need a safer migration path and better visibility before enabling the new scheme by default.
Test plan
devcontainer exec --workspace-folder . --config .devcontainer/arm/devcontainer.json go test -tags stringlabels ./pp/go/storage/block/...devcontainer exec --workspace-folder . --config .devcontainer/arm/devcontainer.json go test -tags stringlabels ./cmd/prometheus/...Made with Cursor