Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
7857978
Add lifecycle conductor scan metrics.
benzekrimaha Apr 29, 2026
a1661ea
Propagate conductor scan ids through lifecycle tasks.
benzekrimaha Apr 29, 2026
592d911
Update lifecycle scan dashboards.
benzekrimaha Apr 29, 2026
d064eec
Clarify lifecycle scan alert wording.
benzekrimaha Apr 29, 2026
5561284
Test lifecycle conductor scan monitoring.
benzekrimaha Apr 29, 2026
43d3f90
Propagate lifecycle scan timestamps.
benzekrimaha May 21, 2026
8a19b71
Track lifecycle bucket task pickups.
benzekrimaha May 21, 2026
d7000eb
Refine lifecycle scan alerts.
benzekrimaha May 21, 2026
4520f5a
Refine lifecycle scan dashboard panels.
benzekrimaha May 21, 2026
5d84665
Close lifecycle scans on errors.
benzekrimaha May 21, 2026
13537ec
fixup! Track lifecycle bucket task pickups.
benzekrimaha Jun 1, 2026
7ab1387
fixup! Close lifecycle scans on errors.
benzekrimaha Jun 1, 2026
a24c6c2
fixup! Propagate conductor scan ids through lifecycle tasks.
benzekrimaha Jun 1, 2026
86cfaf1
fixup! Refine lifecycle scan alerts.
benzekrimaha Jun 1, 2026
d8cf09a
fixup! Refine lifecycle scan dashboard panels.
benzekrimaha Jun 1, 2026
c00900b
fixup! Propagate conductor scan ids through lifecycle tasks.
benzekrimaha Jun 1, 2026
0a2db86
fixup! Refine lifecycle scan alerts.
benzekrimaha Jun 1, 2026
32f8197
fixup! Propagate conductor scan ids through lifecycle tasks.
benzekrimaha Jun 1, 2026
516907c
fixup! Propagate conductor scan ids through lifecycle tasks.
benzekrimaha Jun 1, 2026
560d382
fixup! Propagate conductor scan ids through lifecycle tasks.
benzekrimaha Jun 1, 2026
89c323b
fixup! Track lifecycle bucket task pickups.
benzekrimaha Jun 1, 2026
e9353d6
fixup! Close lifecycle scans on errors.
benzekrimaha Jun 1, 2026
de1170d
Clarify lifecycle scan monitoring tradeoffs.
benzekrimaha Jun 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ jobs:
lifecycle_transition_replicas=3
lifecycle_latency_warning_threshold=120
lifecycle_latency_critical_threshold=180
lifecycle_conductor_scan_warning_threshold=120
lifecycle_conductor_scan_critical_threshold=180
github_token: ${{ secrets.GIT_ACCESS_TOKEN }}

- name: Render and test replication
Expand Down
1 change: 1 addition & 0 deletions conf/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,7 @@
}
},
"concurrency": 10,
"scanMetricRetentionS": 86400,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retention default now lives in three places: 86400 here, and DEFAULT_SCAN_METRIC_RETENTION_S = 24 * 60 * 60 duplicated in both LifecycleMetrics.js and LifecycleConfigValidator.js. Suggest a single exported constant (re-used by the validator's .default(...)) so these can't drift apart.

"probeServer": {
"bindAddress": "0.0.0.0",
"port": 8553
Expand Down
4 changes: 4 additions & 0 deletions extensions/lifecycle/LifecycleConfigValidator.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ const {
const { backbeatConsumer: { MAX_QUEUED_DEFAULT } } = require('../../lib/constants');
const { ValidLifecycleRules: supportedLifecycleRules } = require('arsenal').models;

const DEFAULT_SCAN_METRIC_RETENTION_S = 24 * 60 * 60;

const joiSchema = joi.object({
zookeeperPath: joi.string().required(),
bucketTasksTopic: joi.string().required(),
Expand Down Expand Up @@ -52,6 +54,8 @@ const joiSchema = joi.object({
// the processing, no need to add more here to avoid
// overloading the system
concurrency: joi.number().greater(0).default(1),
scanMetricRetentionS: joi.number().integer().positive()
.default(DEFAULT_SCAN_METRIC_RETENTION_S),
probeServer: probeServerJoi.default(),
circuitBreaker: joi.object().optional(),
},
Expand Down
209 changes: 204 additions & 5 deletions extensions/lifecycle/LifecycleMetrics.js
Original file line number Diff line number Diff line change
@@ -1,16 +1,42 @@
const { ZenkoMetrics } = require('arsenal').metrics;

const LIFECYCLE_LABEL_ORIGIN = 'origin';
const LIFECYCLE_LABEL_ORIGIN = 'origin';
const LIFECYCLE_LABEL_OP = 'op';
const LIFECYCLE_LABEL_STATUS = 'status';
const LIFECYCLE_LABEL_LOCATION = 'location';
const LIFECYCLE_LABEL_TYPE = 'type';
const LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID = 'conductor_scan_id';

const LIFECYCLE_MARKER_METRICS_LOCATION = '-delete-marker-';

// Keep per-scan series long enough for scraping and debugging recent overlap,
// but remove them from prom-client after a configurable retention interval.
// We intentionally do not cap the number of tracked scan IDs: if overlapping
// scans happen, hiding older IDs would remove the signal this metric provides.
// Prometheus retains scraped scan-id series until TSDB retention expires.
const DEFAULT_SCAN_METRIC_RETENTION_S = 24 * 60 * 60;
const CONDUCTOR_ORIGIN = 'conductor';
const BUCKET_PROCESSOR_ORIGIN = 'bucket_processor';
let scanMetricRetentionMs = DEFAULT_SCAN_METRIC_RETENTION_S * 1000;

// Conductor scheduling heartbeat: timestamp (ms since epoch) of the
// instant the conductor most recently *started* a scan. Use this to
// detect "the conductor is no longer scheduling scans" via the
// LifecycleLateScan alert.
const conductorLatestBatchStartTime = ZenkoMetrics.createGauge({
name: 's3_lifecycle_latest_batch_start_time',
help: 'Timestamp of latest lifecycle batch start time',
help: 'Conductor scheduling heartbeat: ms-since-epoch timestamp of ' +
'the most recent scan start. Use to detect that the conductor is ' +
'still scheduling scans (LifecycleLateScan alert).',
labelNames: [LIFECYCLE_LABEL_ORIGIN],
});

// Conductor scan-end timestamp (ms since epoch) of the last scan that reached
// the listing phase.
const conductorLatestBatchEndTime = ZenkoMetrics.createGauge({
name: 's3_lifecycle_latest_batch_end_time',
help: 'Timestamp (ms since epoch) of the most recent lifecycle ' +
'conductor scan end after the scan reached bucket listing.',
labelNames: [LIFECYCLE_LABEL_ORIGIN],
});

Expand Down Expand Up @@ -50,6 +76,102 @@ const lifecycleLegacyTask = ZenkoMetrics.createCounter({
labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_STATUS],
});

const conductorLatestBatchBucketCount = ZenkoMetrics.createGauge({
name: 's3_lifecycle_latest_batch_bucket_count',
help: 'Number of buckets listed in the latest lifecycle conductor batch',
labelNames: [LIFECYCLE_LABEL_ORIGIN],
});

const bucketProcessorScanMessagesProcessed = ZenkoMetrics.createCounter({
name: 's3_lifecycle_bucket_processor_scan_messages_processed_total',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit: this counter is incremented at message receipt, before processing and regardless of success or object count (the JSDoc on onBucketProcessorScanMessageReceived says as much), yet it's named ..._scan_messages_processed_total and the method says ...Received. "processed" overstates what it counts. Suggest ..._scan_messages_received_total to match the semantics and the method name.

help: 'Total number of bucket-tasks topic messages picked up by this ' +
'bucket processor, grouped by conductor scan id. Each message ' +
'corresponds to a single listing slice (initial or continuation), not ' +
'a unique bucket: a bucket with multiple listings (truncated v1, or ' +
'current/noncurrent/orphan splits in v2) increments this counter once ' +
'per slice. Multiple conductor_scan_id label values over the same ' +
'query window indicate that bucket processors recently handled work ' +
'from different scans. Normal operation is expected to expose one ' +
'scan id at a time; scan-id series are removed locally after the ' +
'configured bucket processor retention interval without update to ' +
'avoid unbounded process memory growth. ' +
'Prometheus retains scraped scan-id series until TSDB retention.',
labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conductor_scan_id label creates a new time series per scan (UUID). While the 24h removeStaleBucketProcessorScanMetrics cleanup bounds prom-client process memory, every unique scan ID still produces a distinct series in the Prometheus TSDB until it becomes stale (5 min after the last scrape). With a typical lifecycle interval of ~6 min, that is ~240 distinct label values/day/pod. This is manageable but worth documenting: if the scan interval is ever shortened (e.g. 1 min), cardinality rises proportionally. Consider adding a note in the metric help string about the expected cardinality bounds and the cleanup mechanism.

— Claude Code

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID],
scan-id series are removed locally some time hours after their latest update, to avoid having too high cardinality.

@benzekrimaha The "24h" should also be configurable (in backbeat), so that the compromise could be tweaked on a platform, based on cronjob value and typical scan duration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The local per-scan series cleanup is now configurable with bucketProcessor.scanMetricRetentionMs, defaulting to 24h. I also kept the metric help text concise: it says series are removed locally after the configured bucket-processor retention interval and that Prometheus keeps scraped series until TSDB retention.

Comment thread
benzekrimaha marked this conversation as resolved.
Comment thread
benzekrimaha marked this conversation as resolved.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conductor_scan_id label on bucketProcessorScanMessagesProcessed has unbounded cardinality — each scan produces a new UUID. The local cleanup timer limits prom-client memory, but Prometheus TSDB retains every scraped series until its own retention expires. Over weeks/months this can cause significant storage growth and slow queries on the TSDB side.

Consider whether you've estimated the TSDB cardinality impact (e.g. one new scan-id per hour × 30-day retention = 720 label values × replicas). If this is expected and acceptable, no change needed, but it's worth documenting the expected growth rate in a runbook or the alert description.

— Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an intentional trade-off for the overlapping-scan alert: the alert needs to distinguish scan IDs to detect bucket processors handling work from more than one conductor scan. Local prom-client cleanup is configurable through bucketProcessor.scanMetricRetentionMs, and the metric help text now states that Prometheus retains scraped scan-id series until TSDB retention. I kept the wording concise to avoid over-explaining this internal troubleshooting detail in user-facing monitoring text.

Comment thread
benzekrimaha marked this conversation as resolved.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conductor_scan_id label on bucketProcessorScanMessagesProcessed can take a unique UUID value per scan. Since each scan generates a new UUID, this creates a new time series for every scan. While the cleanup timer in setScanMetricTimeout bounds prom-client's in-process memory, Prometheus server will retain all scraped series until TSDB retention expires. For a system scanning every few minutes, this could produce thousands of unique label values per day. Consider whether the LifecycleBucketProcessorMultipleParallelScans alert (which is the main consumer of this cardinality) could be achieved with a lower-cardinality approach, e.g. a gauge tracking the count of distinct active scans rather than one series per scan UUID.

— Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a known trade-off and matches the compromise discussed with François earlier. We keep conductor_scan_id as a label because it makes overlapping scans visible if the expected “one scan at a time” invariant breaks. The prom-client side is bounded by configurable local cleanup, and the help text explicitly notes that Prometheus retains scraped scan-id series until TSDB retention. So I do not think we should replace this with a lower-cardinality gauge in this PR unless we decide to change that accepted trade-off.

Comment thread
benzekrimaha marked this conversation as resolved.
});
Comment thread
francoisferrand marked this conversation as resolved.
Comment thread
francoisferrand marked this conversation as resolved.
Comment thread
benzekrimaha marked this conversation as resolved.
Comment thread
francoisferrand marked this conversation as resolved.
Comment thread
benzekrimaha marked this conversation as resolved.

const bucketProcessorScanMessageAgeSeconds = ZenkoMetrics.createHistogram({
name: 's3_lifecycle_bucket_processor_scan_message_age_seconds',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The help text says the age is measured "when they finish processing in the bucket processor," but onBucketProcessorScanMessageReceived is called at message pickup (in _processBucketEntry, before fetching the bucket lifecycle config or scheduling the task). So this histogram actually measures "elapsed wall-time since the scan started, sampled at dequeue" — a backlog/lag signal, not processing time. Continuation slices also inherit the original scan-start timestamp, so age keeps growing across a long scan regardless of when a given slice was enqueued.

Either fix the help text to describe what's measured, or move the observation to actual task completion if processing time is what we want.

help: 'Age in seconds of bucket-tasks topic messages when they finish ' +
'processing in the bucket processor, measured from the conductor scan ' +
'start timestamp propagated in the message context.',
labelNames: [LIFECYCLE_LABEL_ORIGIN],
buckets: [60, 300, 600, 1800, 3600, 7200, 14400, 28800, 43200, 86400],
});

const scanMetricTimers = new Map();
Comment thread
benzekrimaha marked this conversation as resolved.

function removeBucketProcessorScanMetrics(conductorScanId) {
try {
bucketProcessorScanMessagesProcessed.remove({
[LIFECYCLE_LABEL_ORIGIN]: BUCKET_PROCESSOR_ORIGIN,
[LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID]: conductorScanId,
});
} catch {
// Best-effort cleanup: metrics are observational only.
}
}

function setScanMetricTimeout(conductorScanId) {
const previousTimer = scanMetricTimers.get(conductorScanId);
if (previousTimer) {
clearTimeout(previousTimer);
}

// Reset retention on every message so an active scan remains observable.
// Cleanup starts only after the scan stops producing bucket-task messages.
const cleanupTimer = setTimeout(() => {
removeBucketProcessorScanMetrics(conductorScanId);
scanMetricTimers.delete(conductorScanId);
}, scanMetricRetentionMs);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanMetricRetentionMs is a module-level mutable variable shared across all callers. configureLifecycleScanMetricRetention sets it globally, and setScanMetricTimeout reads it when scheduling timers. If multiple LifecycleBucketProcessor instances were ever created in the same process with different retention configs, the last one wins for all future timers (including timers already scheduled by the first instance, since they capture the variable by reference through the closure). This is likely fine for the current single-instance deployment model, but could become a subtle bug if the architecture changes.

— Claude Code

if (typeof cleanupTimer.unref === 'function') {
cleanupTimer.unref();
}
scanMetricTimers.set(conductorScanId, cleanupTimer);
}

function observeBucketProcessorScanMessageAge(conductorScanStartTimestamp) {
// Messages produced before this field existed can still be consumed during
// rolling upgrades, so skip invalid timestamps instead of logging noise.
if (typeof conductorScanStartTimestamp !== 'number' ||
!Number.isFinite(conductorScanStartTimestamp) ||
conductorScanStartTimestamp <= 0) {
return;
}
Comment on lines +145 to +149
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we make an explicit check, or just catch error (like done in multiple places already) ?

  • it should not happen after the update
  • it may happen only if upgrade if triggered during lifecycle scan
  • catching error is probably simpler and more bulletproof than trying to caver all cases

Copy link
Copy Markdown
Contributor Author

@benzekrimaha benzekrimaha May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the explicit guard here intentionally. Missing/invalid conductorScanStartTimestamp can happen during rolling upgrades for messages produced before this field existed, and that should simply skip the age observation rather than log a metric error. The rest of the metric update is still protected by the outer catch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are introducing observability:

  • if it does not work, we want to know
  • if upgrade happens during scan, a few (expected) error lines is acceptable

we should also ensure we keep the try/catch block, it is a good safety measure (we often have bad/missing metrics due to bug in the code)


const ageSeconds = (Date.now() - conductorScanStartTimestamp) / 1000;
if (ageSeconds >= 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ageSeconds >= 0 guard drops the observation entirely when the computed age is negative, rather than clamping it to 0. Since the age is a cross-host subtraction (Date.now() here minus the conductor's Date.now() carried in the message), small negative values are expected and dropping them silently removes the fastest samples, biasing the histogram upward. Suggest observe(..., Math.max(0, ageSeconds)) so those samples still land in the lowest bucket.

bucketProcessorScanMessageAgeSeconds.observe({
[LIFECYCLE_LABEL_ORIGIN]: BUCKET_PROCESSOR_ORIGIN,
}, ageSeconds);
}
}

function resetLifecycleScanMetricCleanupTimers() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear why we would ever want to clear the timers BUT not the associated metrics : it will leave many series forever, leaving to the unbounded labels we want to manage with the timers....

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timer reset now also removes associated metric series.

scanMetricTimers.forEach((timer, conductorScanId) => {
clearTimeout(timer);
removeBucketProcessorScanMetrics(conductorScanId);
});
scanMetricTimers.clear();
scanMetricRetentionMs = DEFAULT_SCAN_METRIC_RETENTION_S * 1000;
}

function configureLifecycleScanMetricRetention(retentionS) {
// Called during bucket-processor startup before scan messages are consumed.
// Runtime config reload is not supported, so existing timers are not
// rescheduled when this value is set.
scanMetricRetentionMs = retentionS * 1000;
Comment thread
benzekrimaha marked this conversation as resolved.
}

const lifecycleS3Operations = ZenkoMetrics.createCounter({
name: 's3_lifecycle_s3_operations_total',
help: 'Total number of S3 operations by the lifecycle processes',
Expand Down Expand Up @@ -113,11 +235,23 @@ class LifecycleMetrics {
}
}

static onProcessBuckets(log) {
/**
* Update the conductor scheduling heartbeat. Called at the start of
* every conductor scan; consumed by the LifecycleLateScan alert to
* detect that the conductor has stopped scheduling.
*
* @param {Object} log - logger
* @param {number} scanStartTimestamp - scan start timestamp in ms
*/
static onProcessBuckets(log, scanStartTimestamp = Date.now()) {
try {
conductorLatestBatchStartTime.set({ origin: 'conductor' }, Date.now());
conductorLatestBatchStartTime.set(
{ [LIFECYCLE_LABEL_ORIGIN]: CONDUCTOR_ORIGIN },
scanStartTimestamp);
} catch (err) {
LifecycleMetrics.handleError(log, err, 'LifecycleMetrics.onProcessBuckets');
LifecycleMetrics.handleError(log, err, 'LifecycleMetrics.onProcessBuckets', {
scanStartTimestamp,
});
}
}

Expand Down Expand Up @@ -172,6 +306,69 @@ class LifecycleMetrics {
}
}

/**
* Record metrics at the end of a full conductor scan.
* @param {Object} log - logger
* @param {number} bucketCount - total buckets listed
*/
Comment thread
benzekrimaha marked this conversation as resolved.
static onConductorScanComplete(log, bucketCount) {
try {
const endTimestamp = Date.now();
conductorLatestBatchEndTime.set({
[LIFECYCLE_LABEL_ORIGIN]: CONDUCTOR_ORIGIN,
}, endTimestamp);
conductorLatestBatchBucketCount.set({
[LIFECYCLE_LABEL_ORIGIN]: CONDUCTOR_ORIGIN,
}, bucketCount);
} catch (err) {
LifecycleMetrics.handleError(
log, err, 'LifecycleMetrics.onConductorScanComplete', {
bucketCount,
}
);
}
}

/**
* Increment the count of bucket-tasks topic messages picked up by this
* bucket processor for a specific conductor scan. Called before the task
* is dispatched to the scheduler, once per Kafka message regardless of how
* many objects it covers or whether processing eventually succeeds.
*
* Note: this counts messages (initial + continuation/listing slices),
* not unique buckets. Keep one time series per conductor_scan_id so that
* overlapping scans remain visible. Old scan series are removed by a
* timer after the configured scanMetricRetentionS interval without
* update to avoid unbounded prom-client memory growth.
*
* @param {Object} log - logger
* @param {string} conductorScanId - conductor scan id from contextInfo
* @param {number} [conductorScanStartTimestamp] - conductor scan start
* timestamp from contextInfo
*/
static onBucketProcessorScanMessageReceived(
log, conductorScanId, conductorScanStartTimestamp) {
// Old conductor messages produced during rolling upgrades do not have
// a scan id. Do not create a synthetic "undefined" scan-id series.
if (!conductorScanId) {
return;
}
try {
bucketProcessorScanMessagesProcessed.inc({
Comment thread
benzekrimaha marked this conversation as resolved.
[LIFECYCLE_LABEL_ORIGIN]: BUCKET_PROCESSOR_ORIGIN,
[LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID]: conductorScanId,
});
observeBucketProcessorScanMessageAge(conductorScanStartTimestamp);
setScanMetricTimeout(conductorScanId);
} catch (err) {
Comment thread
francoisferrand marked this conversation as resolved.
LifecycleMetrics.handleError(
log, err,
'LifecycleMetrics.onBucketProcessorScanMessageReceived',
{ conductorScanId, conductorScanStartTimestamp }
);
}
}

static onLifecycleTriggered(log, process, type, location, latencyMs) {
try {
lifecycleTriggerLatency.observe({
Expand Down Expand Up @@ -249,4 +446,6 @@ class LifecycleMetrics {
module.exports = {
LifecycleMetrics,
LIFECYCLE_MARKER_METRICS_LOCATION,
configureLifecycleScanMetricRetention,
resetLifecycleScanMetricCleanupTimers,
};
Loading
Loading