Skip to content
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ jobs:
lifecycle_transition_replicas=3
lifecycle_latency_warning_threshold=120
lifecycle_latency_critical_threshold=180
lifecycle_conductor_scan_warning_threshold=120
lifecycle_conductor_scan_critical_threshold=180
github_token: ${{ secrets.GIT_ACCESS_TOKEN }}

- name: Render and test replication
Expand Down
1 change: 1 addition & 0 deletions conf/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,7 @@
}
},
"concurrency": 10,
"scanMetricRetentionMs": 86400000,
"probeServer": {
"bindAddress": "0.0.0.0",
"port": 8553
Expand Down
4 changes: 4 additions & 0 deletions extensions/lifecycle/LifecycleConfigValidator.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ const {
const { backbeatConsumer: { MAX_QUEUED_DEFAULT } } = require('../../lib/constants');
const { ValidLifecycleRules: supportedLifecycleRules } = require('arsenal').models;

const DEFAULT_SCAN_METRIC_RETENTION_MS = 24 * 60 * 60 * 1000;

const joiSchema = joi.object({
zookeeperPath: joi.string().required(),
bucketTasksTopic: joi.string().required(),
Expand Down Expand Up @@ -52,6 +54,8 @@ const joiSchema = joi.object({
// the processing, no need to add more here to avoid
// overloading the system
concurrency: joi.number().greater(0).default(1),
scanMetricRetentionMs: joi.number().integer().positive()
.default(DEFAULT_SCAN_METRIC_RETENTION_MS),
probeServer: probeServerJoi.default(),
circuitBreaker: joi.object().optional(),
},
Expand Down
240 changes: 235 additions & 5 deletions extensions/lifecycle/LifecycleMetrics.js
Original file line number Diff line number Diff line change
@@ -1,16 +1,60 @@
const { ZenkoMetrics } = require('arsenal').metrics;

const LIFECYCLE_LABEL_ORIGIN = 'origin';
const LIFECYCLE_LABEL_ORIGIN = 'origin';
const LIFECYCLE_LABEL_OP = 'op';
const LIFECYCLE_LABEL_STATUS = 'status';
const LIFECYCLE_LABEL_LOCATION = 'location';
const LIFECYCLE_LABEL_TYPE = 'type';
const LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID = 'conductor_scan_id';

const LIFECYCLE_MARKER_METRICS_LOCATION = '-delete-marker-';

// Keep per-scan series long enough for scraping and debugging recent overlap,
// but remove them from prom-client after a configurable retention interval.
// Prometheus retains scraped scan-id series until TSDB retention expires.
const DEFAULT_SCAN_METRIC_RETENTION_MS = 24 * 60 * 60 * 1000;
const CONDUCTOR_ORIGIN = 'conductor';
const BUCKET_PROCESSOR_ORIGIN = 'bucket_processor';
let scanMetricRetentionMs = DEFAULT_SCAN_METRIC_RETENTION_MS;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module-level mutable state (let scanMetricRetentionMs) shared across tests. The resetLifecycleScanMetricCleanupTimers function handles this for tests, but in production, if configureLifecycleScanMetricRetention is called multiple times (e.g. multiple LifecycleBucketProcessor instances in the same process), the last call wins silently. This is likely fine given the current single-instance architecture, but worth noting.

— Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is process-wide state by design for the current bucket processor deployment model: metrics are process-wide and there is a single lifecycle bucket processor service configuration per process. Tests reset it through resetLifecycleScanMetricCleanupTimers(). If we ever run multiple differently configured lifecycle bucket processors in one process, this should be revisited with instance-scoped metric state.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanMetricRetentionMs is module-level mutable state shared across all callers in the same process. configureLifecycleScanMetricRetention is called from LifecycleBucketProcessor constructor. If a process ever created two LifecycleBucketProcessor instances with different configs, the last one wins silently. This is probably fine for the current single-instance usage, but worth noting.

— Claude Code


// Conductor scheduling heartbeat: timestamp (ms since epoch) of the
// instant the conductor most recently *started* a scan. Use this to
// detect "the conductor is no longer scheduling scans" via the
// LifecycleLateScan alert; do NOT subtract it from latest_batch_end_time
// to derive the scan duration: while a scan is in progress, end_time is
// from the previous run and start_time has just been refreshed, so the
// difference is negative. Use s3_lifecycle_conductor_last_batch_duration_seconds
// instead.
const conductorLatestBatchStartTime = ZenkoMetrics.createGauge({
name: 's3_lifecycle_latest_batch_start_time',
help: 'Timestamp of latest lifecycle batch start time',
help: 'Conductor scheduling heartbeat: ms-since-epoch timestamp of ' +
'the most recent scan start. Use to detect that the conductor is ' +
'still scheduling scans (LifecycleLateScan alert). Do NOT use to ' +
'derive scan duration; use ' +
's3_lifecycle_conductor_last_batch_duration_seconds for that.',
labelNames: [LIFECYCLE_LABEL_ORIGIN],
});

// Conductor scan-completion timestamp (ms since epoch) of the last
// successfully completed scan. Useful as a "scan completed at all"
// signal; combine with conductor_last_batch_duration_seconds to know
// "the most recent scan finished N seconds ago and took M seconds".
const conductorLatestBatchEndTime = ZenkoMetrics.createGauge({
name: 's3_lifecycle_latest_batch_end_time',
help: 'Timestamp (ms since epoch) of the most recent successful ' +
'lifecycle conductor scan completion.',
labelNames: [LIFECYCLE_LABEL_ORIGIN],
});

// Duration of the latest conductor scan, computed by the conductor itself
// at scan completion. Exposed as a gauge so dashboards can render the most
// recent batch duration directly, without computing end - start in PromQL
// (which would yield negative values mid-scan, when end is from the
// previous batch and start has just been refreshed).
const conductorLastBatchDurationSeconds = ZenkoMetrics.createGauge({
name: 's3_lifecycle_conductor_last_batch_duration_seconds',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this redundant with s3_lifecycle_latest_batch_start_time and s3_lifecycle_latest_batch_end_time ?

if we are really interested in duration, should be an histogram

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the dedicated latest-duration gauge. It is not exactly redundant with start/end timestamps because latest_batch_start_time is refreshed at scan start while latest_batch_end_time is only refreshed when a scan completes. During an in-progress scan, computing end - start can be negative or misleading. The gauge records the last completed scan duration directly. I kept it as a gauge because it represents the latest completed batch duration, not a distribution of per-bucket or per-object durations.

help: 'Duration in seconds of the latest lifecycle conductor batch, ' +
'as measured by the conductor at scan completion.',
labelNames: [LIFECYCLE_LABEL_ORIGIN],
});

Expand Down Expand Up @@ -50,6 +94,102 @@ const lifecycleLegacyTask = ZenkoMetrics.createCounter({
labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_STATUS],
});

const conductorLatestBatchBucketCount = ZenkoMetrics.createGauge({
name: 's3_lifecycle_latest_batch_bucket_count',
help: 'Number of buckets listed in the latest lifecycle conductor batch',
labelNames: [LIFECYCLE_LABEL_ORIGIN],
Comment thread
francoisferrand marked this conversation as resolved.
});

const bucketProcessorScanMessagesProcessed = ZenkoMetrics.createCounter({
name: 's3_lifecycle_bucket_processor_scan_messages_processed_total',
help: 'Total number of bucket-tasks topic messages picked up by this ' +
'bucket processor, grouped by conductor scan id. Each message ' +
'corresponds to a single listing slice (initial or continuation), not ' +
'a unique bucket: a bucket with multiple listings (truncated v1, or ' +
'current/noncurrent/orphan splits in v2) increments this counter once ' +
'per slice. Multiple conductor_scan_id label values over the same ' +
'query window indicate that bucket processors recently handled work ' +
'from different scans. Normal operation is expected to expose one ' +
'scan id at a time; scan-id series are removed locally after the ' +
'configured bucket processor retention interval without update to ' +
'avoid unbounded process memory growth. ' +
'Prometheus retains scraped scan-id series until TSDB retention.',
labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded label cardinality: bucketProcessorScanStartTime and bucketProcessorScanBucketDoneCount use conductor_scan_id (a per-scan UUID) as a Prometheus label. Each scan creates new time series that are never cleaned up, causing unbounded memory growth in both the app and Prometheus. Consider tracking only the current scan (without the scan-ID label) and resetting the gauge on each new scan, or using a fixed-cardinality approach.

— Claude Code

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benzekrimaha this is an issue.

  • Since we expect only a single scan at a time, no problem in prometheus in prometheus: we expect backbeat to produce new serie, and prometheus would see the previous series is closed and eventually discard it. Fine.
  • However, in promclient implementation, every metric stays forever : so we either need to change this scanId label approach, or add some logic in lifecycleMetrics to eventually remove series.

Copy link
Copy Markdown
Contributor

@francoisferrand francoisferrand Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we actually don't need the same approach for both metrics: having a metrics whose value is the ScanID (which would be an integer, not a UUID) would provide this info if we simply record the metrics when we detect a new ScanID (the "time" is recorded implicitly by prometheus)

yet to track each "bucket" event, we may need to increment some counter... and thus would need to have a timer (in lifecyleMetrics!) to ensure we remove the series once we've not updated the series for a "long" time.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conductor_scan_id label creates a new time series per scan (UUID). While the 24h removeStaleBucketProcessorScanMetrics cleanup bounds prom-client process memory, every unique scan ID still produces a distinct series in the Prometheus TSDB until it becomes stale (5 min after the last scrape). With a typical lifecycle interval of ~6 min, that is ~240 distinct label values/day/pod. This is manageable but worth documenting: if the scan interval is ever shortened (e.g. 1 min), cardinality rises proportionally. Consider adding a note in the metric help string about the expected cardinality bounds and the cleanup mechanism.

— Claude Code

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID],
scan-id series are removed locally some time hours after their latest update, to avoid having too high cardinality.

@benzekrimaha The "24h" should also be configurable (in backbeat), so that the compromise could be tweaked on a platform, based on cronjob value and typical scan duration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The local per-scan series cleanup is now configurable with bucketProcessor.scanMetricRetentionMs, defaulting to 24h. I also kept the metric help text concise: it says series are removed locally after the configured bucket-processor retention interval and that Prometheus keeps scraped series until TSDB retention.

Comment thread
benzekrimaha marked this conversation as resolved.
Comment thread
benzekrimaha marked this conversation as resolved.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conductor_scan_id label on bucketProcessorScanMessagesProcessed has unbounded cardinality — each scan produces a new UUID. The local cleanup timer limits prom-client memory, but Prometheus TSDB retains every scraped series until its own retention expires. Over weeks/months this can cause significant storage growth and slow queries on the TSDB side.

Consider whether you've estimated the TSDB cardinality impact (e.g. one new scan-id per hour × 30-day retention = 720 label values × replicas). If this is expected and acceptable, no change needed, but it's worth documenting the expected growth rate in a runbook or the alert description.

— Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an intentional trade-off for the overlapping-scan alert: the alert needs to distinguish scan IDs to detect bucket processors handling work from more than one conductor scan. Local prom-client cleanup is configurable through bucketProcessor.scanMetricRetentionMs, and the metric help text now states that Prometheus retains scraped scan-id series until TSDB retention. I kept the wording concise to avoid over-explaining this internal troubleshooting detail in user-facing monitoring text.

Comment thread
benzekrimaha marked this conversation as resolved.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conductor_scan_id label on bucketProcessorScanMessagesProcessed can take a unique UUID value per scan. Since each scan generates a new UUID, this creates a new time series for every scan. While the cleanup timer in setScanMetricTimeout bounds prom-client's in-process memory, Prometheus server will retain all scraped series until TSDB retention expires. For a system scanning every few minutes, this could produce thousands of unique label values per day. Consider whether the LifecycleBucketProcessorMultipleParallelScans alert (which is the main consumer of this cardinality) could be achieved with a lower-cardinality approach, e.g. a gauge tracking the count of distinct active scans rather than one series per scan UUID.

— Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a known trade-off and matches the compromise discussed with François earlier. We keep conductor_scan_id as a label because it makes overlapping scans visible if the expected “one scan at a time” invariant breaks. The prom-client side is bounded by configurable local cleanup, and the help text explicitly notes that Prometheus retains scraped scan-id series until TSDB retention. So I do not think we should replace this with a lower-cardinality gauge in this PR unless we decide to change that accepted trade-off.

Comment thread
benzekrimaha marked this conversation as resolved.
});
Comment thread
francoisferrand marked this conversation as resolved.
Comment thread
francoisferrand marked this conversation as resolved.
Comment thread
benzekrimaha marked this conversation as resolved.
Comment thread
francoisferrand marked this conversation as resolved.
Comment thread
benzekrimaha marked this conversation as resolved.

const bucketProcessorScanMessageAgeSeconds = ZenkoMetrics.createHistogram({
name: 's3_lifecycle_bucket_processor_scan_message_age_seconds',
help: 'Age in seconds of bucket-tasks topic messages when they finish ' +
'processing in the bucket processor, measured from the conductor scan ' +
'start timestamp propagated in the message context.',
labelNames: [LIFECYCLE_LABEL_ORIGIN],
buckets: [60, 300, 600, 1800, 3600, 7200, 14400, 28800, 43200, 86400],
});

const scanMetricTimers = new Map();

function removeBucketProcessorScanMetrics(conductorScanId) {
try {
bucketProcessorScanMessagesProcessed.remove({
[LIFECYCLE_LABEL_ORIGIN]: BUCKET_PROCESSOR_ORIGIN,
[LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID]: conductorScanId,
});
} catch {
// Best-effort cleanup: metrics are observational only.
}
}

function setScanMetricTimeout(conductorScanId) {
const previousTimer = scanMetricTimers.get(conductorScanId);
if (previousTimer) {
clearTimeout(previousTimer);
}

const cleanupTimer = setTimeout(() => {
removeBucketProcessorScanMetrics(conductorScanId);
scanMetricTimers.delete(conductorScanId);
}, scanMetricRetentionMs);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanMetricRetentionMs is a module-level mutable variable shared across all callers. configureLifecycleScanMetricRetention sets it globally, and setScanMetricTimeout reads it when scheduling timers. If multiple LifecycleBucketProcessor instances were ever created in the same process with different retention configs, the last one wins for all future timers (including timers already scheduled by the first instance, since they capture the variable by reference through the closure). This is likely fine for the current single-instance deployment model, but could become a subtle bug if the architecture changes.

— Claude Code

if (typeof cleanupTimer.unref === 'function') {
cleanupTimer.unref();
}
scanMetricTimers.set(conductorScanId, cleanupTimer);
}

function observeBucketProcessorScanMessageAge(conductorScanStartTimestamp) {
// Messages produced before this field existed can still be consumed during
// rolling upgrades, so skip invalid timestamps instead of logging noise.
if (typeof conductorScanStartTimestamp !== 'number' ||
!Number.isFinite(conductorScanStartTimestamp) ||
conductorScanStartTimestamp <= 0) {
return;
}
Comment on lines +161 to +165
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we make an explicit check, or just catch error (like done in multiple places already) ?

  • it should not happen after the update
  • it may happen only if upgrade if triggered during lifecycle scan
  • catching error is probably simpler and more bulletproof than trying to caver all cases

Copy link
Copy Markdown
Contributor Author

@benzekrimaha benzekrimaha May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the explicit guard here intentionally. Missing/invalid conductorScanStartTimestamp can happen during rolling upgrades for messages produced before this field existed, and that should simply skip the age observation rather than log a metric error. The rest of the metric update is still protected by the outer catch.


const ageSeconds = (Date.now() - conductorScanStartTimestamp) / 1000;
if (ageSeconds >= 0) {
bucketProcessorScanMessageAgeSeconds.observe({
[LIFECYCLE_LABEL_ORIGIN]: BUCKET_PROCESSOR_ORIGIN,
}, ageSeconds);
}
}

function clearScanMetricTimers() {
scanMetricTimers.forEach(timer => clearTimeout(timer));
scanMetricTimers.clear();
}

function resetLifecycleScanMetricCleanupTimers() {
clearScanMetricTimers();
scanMetricRetentionMs = DEFAULT_SCAN_METRIC_RETENTION_MS;
}

function configureLifecycleScanMetricRetention(retentionMs) {
if (typeof retentionMs === 'number' &&
Number.isFinite(retentionMs) &&
retentionMs > 0) {
scanMetricRetentionMs = retentionMs;
}
}

const lifecycleS3Operations = ZenkoMetrics.createCounter({
name: 's3_lifecycle_s3_operations_total',
help: 'Total number of S3 operations by the lifecycle processes',
Expand Down Expand Up @@ -113,11 +253,26 @@ class LifecycleMetrics {
}
}

static onProcessBuckets(log) {
/**
* Update the conductor scheduling heartbeat. Called at the start of
* every conductor scan; consumed by the LifecycleLateScan alert to
* detect that the conductor has stopped scheduling. Does NOT mark a
* scan as in progress and is NOT meant to be subtracted from
* latest_batch_end_time to derive a duration: use
* onConductorScanComplete's durationSeconds for that.
*
* @param {Object} log - logger
* @param {number} scanStartTimestamp - scan start timestamp in ms
*/
static onProcessBuckets(log, scanStartTimestamp = Date.now()) {
try {
conductorLatestBatchStartTime.set({ origin: 'conductor' }, Date.now());
conductorLatestBatchStartTime.set(
{ [LIFECYCLE_LABEL_ORIGIN]: CONDUCTOR_ORIGIN },
scanStartTimestamp);
} catch (err) {
LifecycleMetrics.handleError(log, err, 'LifecycleMetrics.onProcessBuckets');
LifecycleMetrics.handleError(log, err, 'LifecycleMetrics.onProcessBuckets', {
scanStartTimestamp,
});
}
}

Expand Down Expand Up @@ -172,6 +327,79 @@ class LifecycleMetrics {
}
}

/**
* Record metrics at the end of a full conductor scan.
* @param {Object} log - logger
* @param {number} bucketCount - total buckets listed
* @param {number} [durationSeconds] - duration of the scan in seconds,
* as measured by the conductor. When provided and finite, sets the
* s3_lifecycle_conductor_last_batch_duration_seconds gauge. Optional
* for forward-compatibility with callers that do not measure it.
*/
Comment thread
benzekrimaha marked this conversation as resolved.
static onConductorScanComplete(log, bucketCount, durationSeconds) {
try {
const endTimestamp = Date.now();
conductorLatestBatchEndTime.set({
[LIFECYCLE_LABEL_ORIGIN]: CONDUCTOR_ORIGIN,
}, endTimestamp);
conductorLatestBatchBucketCount.set({
[LIFECYCLE_LABEL_ORIGIN]: CONDUCTOR_ORIGIN,
}, bucketCount);
if (typeof durationSeconds === 'number' &&
Number.isFinite(durationSeconds) &&
durationSeconds >= 0) {
conductorLastBatchDurationSeconds.set({
[LIFECYCLE_LABEL_ORIGIN]: CONDUCTOR_ORIGIN,
}, durationSeconds);
}
} catch (err) {
LifecycleMetrics.handleError(
log, err, 'LifecycleMetrics.onConductorScanComplete', {
bucketCount,
durationSeconds,
}
);
}
}

/**
* Increment the count of bucket-tasks topic messages picked up by this
* bucket processor for a specific conductor scan. Called before the task
* is dispatched to the scheduler, once per Kafka message regardless of how
* many objects it covers or whether processing eventually succeeds.
*
* Note: this counts messages (initial + continuation/listing slices),
* not unique buckets. Keep one time series per conductor_scan_id so that
* overlapping scans remain visible. Old scan series are removed by a
* timer after the configured scanMetricRetentionMs interval without
* update to avoid unbounded prom-client memory growth.
*
* @param {Object} log - logger
* @param {string} conductorScanId - conductor scan id from contextInfo
* @param {number} [conductorScanStartTimestamp] - conductor scan start
* timestamp from contextInfo
*/
static onBucketProcessorScanMessageReceived(
log, conductorScanId, conductorScanStartTimestamp) {
if (!conductorScanId) {
return;
}
try {
bucketProcessorScanMessagesProcessed.inc({
[LIFECYCLE_LABEL_ORIGIN]: BUCKET_PROCESSOR_ORIGIN,
[LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID]: conductorScanId,
});
observeBucketProcessorScanMessageAge(conductorScanStartTimestamp);
setScanMetricTimeout(conductorScanId);
} catch (err) {
Comment thread
francoisferrand marked this conversation as resolved.
LifecycleMetrics.handleError(
log, err,
'LifecycleMetrics.onBucketProcessorScanMessageReceived',
{ conductorScanId, conductorScanStartTimestamp }
);
}
}

static onLifecycleTriggered(log, process, type, location, latencyMs) {
try {
lifecycleTriggerLatency.observe({
Expand Down Expand Up @@ -249,4 +477,6 @@ class LifecycleMetrics {
module.exports = {
LifecycleMetrics,
LIFECYCLE_MARKER_METRICS_LOCATION,
configureLifecycleScanMetricRetention,
resetLifecycleScanMetricCleanupTimers,
};
Loading
Loading