fix: auto-repopulate persistent store after data loss by nbouvrette · Pull Request #664 · launchdarkly/ld-relay

nbouvrette · 2026-05-10T20:45:06Z

Summary

Fixes #141.

The primary motivation is Redis: when Redis restarts without persistence enabled, all data is lost, but ld-relay's streaming connection to LaunchDarkly remains open so no re-initialization is triggered. SDKs using daemon mode (such as PHP, which reads directly from Redis) receive empty flag evaluations indefinitely until ld-relay is manually restarted.

While Redis is the most common case (since it is ephemeral by default), the same class of problem can affect any persistent store -- Consul KV data could be deleted or lost during a restore, and DynamoDB items could be removed or TTL'd. This PR implements the fix for all three supported backends to provide consistent resilience regardless of store choice.

This PR adds three complementary resilience layers to detect and recover from persistent store data loss:

In-memory snapshot — streamUpdatesStoreWrapper now maintains a deep copy of the latest flags and segments, updated on every Init and Upsert from the streaming data source. Out-of-order updates are rejected via version comparison.
Circuit breaker — When Get/GetAll encounter a store read error and a snapshot is available, the wrapper serves from the snapshot and stops hitting the store. This prevents connection pool exhaustion during an outage and ensures clients continue receiving correct evaluations.
Periodic health check — A background goroutine queries the store's $inited sentinel key via a dedicated connection (bypassing the SDK's cache). When the key is missing (indicating data loss), it repopulates the store from the snapshot and clears the circuit breaker.

Architecture

flowchart TB
    subgraph normal["1 - Normal Operation"]
        LD["LaunchDarkly Stream"] -->|Init / Upsert| SW["streamUpdatesStoreWrapper"]
        SW -->|write| Store[(Persistent Store)]
        SW -->|deep copy| Snap["In-Memory Snapshot"]
        SDK["SDK Client Get/GetAll"] --> SW
        SW -->|read| Store
    end

    subgraph failure["2 - Store Failure: Circuit Breaker"]
        SDK2["SDK Client Get/GetAll"] --> SW2["streamUpdatesStoreWrapper"]
        SW2 -->|read fails| Store2[(Store X)]
        SW2 -->|circuit opens, serve from snapshot| Snap2["In-Memory Snapshot"]
    end

    subgraph recovery["3 - Recovery: Health Check"]
        HC["StoreHealthCheck (every 30s)"] -->|"EXISTS $inited"| Store3[(Persistent Store)]
        Store3 -->|key missing| HC
        HC -->|"RepopulateStore(snapshot)"| SW3["streamUpdatesStoreWrapper"]
        SW3 -->|repopulate| Store3
        HC -->|clear circuit breaker| SW3
    end

    normal ~~~ failure
    failure ~~~ recovery

    style Store fill:#d4edda,stroke:#28a745
    style Store2 fill:#f8d7da,stroke:#dc3545
    style Store3 fill:#fff3cd,stroke:#ffc107
    style Snap fill:#d1ecf1,stroke:#17a2b8
    style Snap2 fill:#d1ecf1,stroke:#17a2b8

Supported backends

The health check works with all three persistent store backends:

Backend	Sentinel key	Check mechanism
Redis	`{prefix}:$inited`	`EXISTS` command via dedicated Redigo pool
Consul	`{prefix}/$inited`	`KV.Get` via dedicated Consul client
DynamoDB	partition+sort = `{prefix}:$inited`	`GetItem` via dedicated DynamoDB client

Configuration

A new optional healthCheckInterval field for each backend (default: 30s, set to 0 to disable):

[Redis]
    healthCheckInterval = 30s

[Consul]
    healthCheckInterval = 30s

[DynamoDB]
    healthCheckInterval = 30s

Or via environment variables: REDIS_HEALTH_CHECK_INTERVAL, CONSUL_HEALTH_CHECK_INTERVAL, DYNAMODB_HEALTH_CHECK_INTERVAL.

Scope

The change is fully backwards-compatible — no new required configuration.
The in-memory snapshot adds minimal overhead since ld-relay already holds this data via the SDK's in-memory cache layer.

Test plan

All components were developed using test-driven development — tests were written before the implementation to verify expected behavior independently, ensuring they validate correctness rather than merely confirming existing code paths.

Unit tests for snapshot tracking: Init saves snapshot, Upsert updates snapshot, empty/nil Init does not set snapshotHasData, deep copy isolation, out-of-order version rejection
Unit tests for circuit breaker: opens on Get/GetAll errors, serves from snapshot, skips store on subsequent reads, clears on recovery, returns errors when no snapshot exists
Unit tests for health check: detects missing $inited key, triggers repopulation, clears circuit breaker, handles connection errors gracefully, stops cleanly, no-ops without snapshot
All existing tests pass (go test ./...)
Manual testing with Redis/Consul/DynamoDB restart while ld-relay is connected (recommended for reviewer)

Note

Medium Risk
Changes core persistent-store read behavior by adding snapshot fallback and a circuit breaker, plus a background health check that can repopulate Redis/Consul/DynamoDB automatically. Misconfiguration or edge cases could mask real store errors or cause unexpected write bursts during recovery.

Overview
Adds automatic recovery when a configured persistent store loses data (e.g. Redis restart) by maintaining an in-memory snapshot of the latest dataset and using it to repopulate the store.

Introduces a circuit breaker in streamUpdatesStoreWrapper: on Get/GetAll read errors, Relay serves from the snapshot and stops hitting the store until recovery is detected.

Adds a periodic data store health check (default 30s, configurable/disableable via new healthCheckInterval fields and env vars) that checks the $inited sentinel in Redis/Consul/DynamoDB via new per-backend init checkers; when missing, it repopulates the store and clears the circuit breaker. Documentation and unit tests cover snapshot, circuit breaker, and health-check behavior.

^{Reviewed by Cursor Bugbot for commit f273bed. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ly#141) When Redis restarts and loses data while ld-relay's streaming connection remains open, daemon-mode SDK clients receive empty flag evaluations because no mechanism existed to detect and recover from this state. This adds three complementary resilience layers: 1. In-memory snapshot: the store wrapper now keeps a deep copy of the latest flag/segment data, updated on every Init and Upsert. 2. Circuit breaker: on Redis read errors, Get/GetAll serve from the snapshot and stop hitting Redis until the health check clears it. 3. Periodic health check: a background goroutine checks the Redis $inited sentinel key via a dedicated connection. When the key is missing, it repopulates Redis from the snapshot and clears the circuit breaker. Configuration: [Redis] HealthCheckInterval (default 30s). Co-authored-by: Cursor <cursoragent@cursor.com>

1. Snapshot version regression: updateSnapshotItem now compares versions and skips updates with older data, preventing stale flags from being written to the snapshot on out-of-order upserts. 2. Race condition in Close(): storeHealthCheck is now read under the mutex, and deferredHealthCheckStart checks c.closed before starting the health check to prevent goroutine leaks. Co-authored-by: Cursor <cursoragent@cursor.com>

Repopulation now uses RepopulateStore() which writes directly to the underlying store, bypassing saveSnapshot() and SendAllDataUpdate(). This prevents a race where a streaming Upsert between GetSnapshot() and Init() could be overwritten in both the snapshot and Redis, and avoids broadcasting stale put events to connected SSE clients. Co-authored-by: Cursor <cursoragent@cursor.com>

The health check and circuit breaker now work with all three persistent store backends, not just Redis. Each backend has its own init checker that queries the $inited sentinel key directly: - Redis: EXISTS {prefix}:$inited (via dedicated Redigo pool) - Consul: KV.Get {prefix}/$inited (via dedicated Consul client) - DynamoDB: GetItem {prefix}:$inited (via dedicated DynamoDB client) Health check interval is now configurable per backend via CONSUL_HEALTH_CHECK_INTERVAL and DYNAMODB_HEALTH_CHECK_INTERVAL environment variables (or healthCheckInterval in the config file). Setting the interval to 0 disables the health check entirely. Co-authored-by: Cursor <cursoragent@cursor.com>

When healthCheckInterval is set to 0, return early from createInitChecker before allocating the checker and its connection pool. Previously the checker was created but never stored or closed when the interval guard rejected it. Co-authored-by: Cursor <cursoragent@cursor.com>

cursor

Cursor Bugbot has reviewed your changes using default mode and found 2 potential issues.

^{Reviewed by Cursor Bugbot for commit f273bed. Configure here.}

The function copies collection and item slices but shares the underlying ItemDescriptor.Item pointers. The old name implied full deep copying which was misleading. Added a comment explaining the shared pointer semantics and why it is safe (SDK treats flag/segment objects as immutable). Co-authored-by: Cursor <cursoragent@cursor.com>

keelerm84 · 2026-05-20T16:46:14Z

@nbouvrette I am sorry I haven't responded to this yet. I have been unavailable for most of last week and will be for the remainder of this week.

I will try to take a look at this first thing on Tuesday.

Thank you for your patience with me.

nbouvrette requested a review from a team as a code owner May 10, 2026 20:45

cursor Bot reviewed May 10, 2026

View reviewed changes

Comment thread internal/store/relay_feature_store.go

Comment thread internal/relayenv/env_context_impl.go

cursor Bot reviewed May 10, 2026

View reviewed changes

Comment thread internal/store/store_health_check.go Outdated

nbouvrette and others added 2 commits May 10, 2026 17:04

nbouvrette changed the title ~~fix: repopulate Redis automatically after data loss~~ fix: auto-repopulate persistent store after data loss May 10, 2026

cursor Bot reviewed May 10, 2026

View reviewed changes

Comment thread internal/relayenv/env_context_impl.go

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread internal/store/relay_feature_store.go

Comment thread internal/store/relay_feature_store.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: auto-repopulate persistent store after data loss#664

fix: auto-repopulate persistent store after data loss#664
nbouvrette wants to merge 6 commits into
launchdarkly:v8from
nbouvrette:fix/redis-data-loss-repopulation

nbouvrette commented May 10, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

keelerm84 commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nbouvrette commented May 10, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Supported backends

Configuration

Scope

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

keelerm84 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nbouvrette commented May 10, 2026 •

edited by cursor Bot

Loading

keelerm84 commented May 20, 2026 •

edited

Loading