Skip to content

fix: auto-repopulate persistent store after data loss#664

Open
nbouvrette wants to merge 6 commits into
launchdarkly:v8from
nbouvrette:fix/redis-data-loss-repopulation
Open

fix: auto-repopulate persistent store after data loss#664
nbouvrette wants to merge 6 commits into
launchdarkly:v8from
nbouvrette:fix/redis-data-loss-repopulation

Conversation

@nbouvrette

@nbouvrette nbouvrette commented May 10, 2026

Copy link
Copy Markdown

Summary

Fixes #141.

The primary motivation is Redis: when Redis restarts without persistence enabled, all data is lost, but ld-relay's streaming connection to LaunchDarkly remains open so no re-initialization is triggered. SDKs using daemon mode (such as PHP, which reads directly from Redis) receive empty flag evaluations indefinitely until ld-relay is manually restarted.

While Redis is the most common case (since it is ephemeral by default), the same class of problem can affect any persistent store -- Consul KV data could be deleted or lost during a restore, and DynamoDB items could be removed or TTL'd. This PR implements the fix for all three supported backends to provide consistent resilience regardless of store choice.

This PR adds three complementary resilience layers to detect and recover from persistent store data loss:

  1. In-memory snapshotstreamUpdatesStoreWrapper now maintains a deep copy of the latest flags and segments, updated on every Init and Upsert from the streaming data source. Out-of-order updates are rejected via version comparison.

  2. Circuit breaker — When Get/GetAll encounter a store read error and a snapshot is available, the wrapper serves from the snapshot and stops hitting the store. This prevents connection pool exhaustion during an outage and ensures clients continue receiving correct evaluations.

  3. Periodic health check — A background goroutine queries the store's $inited sentinel key via a dedicated connection (bypassing the SDK's cache). When the key is missing (indicating data loss), it repopulates the store from the snapshot and clears the circuit breaker.

Architecture

flowchart TB
    subgraph normal["1 - Normal Operation"]
        LD["LaunchDarkly Stream"] -->|Init / Upsert| SW["streamUpdatesStoreWrapper"]
        SW -->|write| Store[(Persistent Store)]
        SW -->|deep copy| Snap["In-Memory Snapshot"]
        SDK["SDK Client Get/GetAll"] --> SW
        SW -->|read| Store
    end

    subgraph failure["2 - Store Failure: Circuit Breaker"]
        SDK2["SDK Client Get/GetAll"] --> SW2["streamUpdatesStoreWrapper"]
        SW2 -->|read fails| Store2[(Store X)]
        SW2 -->|circuit opens, serve from snapshot| Snap2["In-Memory Snapshot"]
    end

    subgraph recovery["3 - Recovery: Health Check"]
        HC["StoreHealthCheck (every 30s)"] -->|"EXISTS $inited"| Store3[(Persistent Store)]
        Store3 -->|key missing| HC
        HC -->|"RepopulateStore(snapshot)"| SW3["streamUpdatesStoreWrapper"]
        SW3 -->|repopulate| Store3
        HC -->|clear circuit breaker| SW3
    end

    normal ~~~ failure
    failure ~~~ recovery

    style Store fill:#d4edda,stroke:#28a745
    style Store2 fill:#f8d7da,stroke:#dc3545
    style Store3 fill:#fff3cd,stroke:#ffc107
    style Snap fill:#d1ecf1,stroke:#17a2b8
    style Snap2 fill:#d1ecf1,stroke:#17a2b8
Loading

Supported backends

The health check works with all three persistent store backends:

Backend Sentinel key Check mechanism
Redis {prefix}:$inited EXISTS command via dedicated Redigo pool
Consul {prefix}/$inited KV.Get via dedicated Consul client
DynamoDB partition+sort = {prefix}:$inited GetItem via dedicated DynamoDB client

Configuration

A new optional healthCheckInterval field for each backend (default: 30s, set to 0 to disable):

[Redis]
    healthCheckInterval = 30s

[Consul]
    healthCheckInterval = 30s

[DynamoDB]
    healthCheckInterval = 30s

Or via environment variables: REDIS_HEALTH_CHECK_INTERVAL, CONSUL_HEALTH_CHECK_INTERVAL, DYNAMODB_HEALTH_CHECK_INTERVAL.

Scope

  • The change is fully backwards-compatible — no new required configuration.
  • The in-memory snapshot adds minimal overhead since ld-relay already holds this data via the SDK's in-memory cache layer.

Test plan

All components were developed using test-driven development — tests were written before the implementation to verify expected behavior independently, ensuring they validate correctness rather than merely confirming existing code paths.

  • Unit tests for snapshot tracking: Init saves snapshot, Upsert updates snapshot, empty/nil Init does not set snapshotHasData, deep copy isolation, out-of-order version rejection
  • Unit tests for circuit breaker: opens on Get/GetAll errors, serves from snapshot, skips store on subsequent reads, clears on recovery, returns errors when no snapshot exists
  • Unit tests for health check: detects missing $inited key, triggers repopulation, clears circuit breaker, handles connection errors gracefully, stops cleanly, no-ops without snapshot
  • All existing tests pass (go test ./...)
  • Manual testing with Redis/Consul/DynamoDB restart while ld-relay is connected (recommended for reviewer)

Note

Medium Risk
Changes core persistent-store read behavior by adding snapshot fallback and a circuit breaker, plus a background health check that can repopulate Redis/Consul/DynamoDB automatically. Misconfiguration or edge cases could mask real store errors or cause unexpected write bursts during recovery.

Overview
Adds automatic recovery when a configured persistent store loses data (e.g. Redis restart) by maintaining an in-memory snapshot of the latest dataset and using it to repopulate the store.

Introduces a circuit breaker in streamUpdatesStoreWrapper: on Get/GetAll read errors, Relay serves from the snapshot and stops hitting the store until recovery is detected.

Adds a periodic data store health check (default 30s, configurable/disableable via new healthCheckInterval fields and env vars) that checks the $inited sentinel in Redis/Consul/DynamoDB via new per-backend init checkers; when missing, it repopulates the store and clears the circuit breaker. Documentation and unit tests cover snapshot, circuit breaker, and health-check behavior.

Reviewed by Cursor Bugbot for commit f273bed. Bugbot is set up for automated code reviews on this repo. Configure here.

…ly#141)

When Redis restarts and loses data while ld-relay's streaming connection
remains open, daemon-mode SDK clients receive empty flag evaluations
because no mechanism existed to detect and recover from this state.

This adds three complementary resilience layers:

1. In-memory snapshot: the store wrapper now keeps a deep copy of the
   latest flag/segment data, updated on every Init and Upsert.

2. Circuit breaker: on Redis read errors, Get/GetAll serve from the
   snapshot and stop hitting Redis until the health check clears it.

3. Periodic health check: a background goroutine checks the Redis
   $inited sentinel key via a dedicated connection. When the key is
   missing, it repopulates Redis from the snapshot and clears the
   circuit breaker.

Configuration: [Redis] HealthCheckInterval (default 30s).
Co-authored-by: Cursor <cursoragent@cursor.com>
@nbouvrette nbouvrette requested a review from a team as a code owner May 10, 2026 20:45
Comment thread internal/store/relay_feature_store.go
Comment thread internal/relayenv/env_context_impl.go
1. Snapshot version regression: updateSnapshotItem now compares
   versions and skips updates with older data, preventing stale
   flags from being written to the snapshot on out-of-order upserts.

2. Race condition in Close(): storeHealthCheck is now read under
   the mutex, and deferredHealthCheckStart checks c.closed before
   starting the health check to prevent goroutine leaks.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread internal/store/store_health_check.go Outdated
nbouvrette and others added 2 commits May 10, 2026 17:04
Repopulation now uses RepopulateStore() which writes directly to the
underlying store, bypassing saveSnapshot() and SendAllDataUpdate().

This prevents a race where a streaming Upsert between GetSnapshot()
and Init() could be overwritten in both the snapshot and Redis, and
avoids broadcasting stale put events to connected SSE clients.

Co-authored-by: Cursor <cursoragent@cursor.com>
The health check and circuit breaker now work with all three persistent
store backends, not just Redis. Each backend has its own init checker
that queries the $inited sentinel key directly:

- Redis: EXISTS {prefix}:$inited (via dedicated Redigo pool)
- Consul: KV.Get {prefix}/$inited (via dedicated Consul client)
- DynamoDB: GetItem {prefix}:$inited (via dedicated DynamoDB client)

Health check interval is now configurable per backend via
CONSUL_HEALTH_CHECK_INTERVAL and DYNAMODB_HEALTH_CHECK_INTERVAL
environment variables (or healthCheckInterval in the config file).

Setting the interval to 0 disables the health check entirely.

Co-authored-by: Cursor <cursoragent@cursor.com>
@nbouvrette nbouvrette changed the title fix: repopulate Redis automatically after data loss fix: auto-repopulate persistent store after data loss May 10, 2026
Comment thread internal/relayenv/env_context_impl.go
When healthCheckInterval is set to 0, return early from
createInitChecker before allocating the checker and its
connection pool. Previously the checker was created but
never stored or closed when the interval guard rejected it.

Co-authored-by: Cursor <cursoragent@cursor.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default mode and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit f273bed. Configure here.

Comment thread internal/store/relay_feature_store.go
Comment thread internal/store/relay_feature_store.go
The function copies collection and item slices but shares the
underlying ItemDescriptor.Item pointers. The old name implied
full deep copying which was misleading. Added a comment explaining
the shared pointer semantics and why it is safe (SDK treats
flag/segment objects as immutable).

Co-authored-by: Cursor <cursoragent@cursor.com>
@keelerm84

keelerm84 commented May 20, 2026

Copy link
Copy Markdown
Member

@nbouvrette I am sorry I haven't responded to this yet. I have been unavailable for most of last week and will be for the remainder of this week.

I will try to take a look at this first thing on Tuesday.

Thank you for your patience with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LD Relay doesn't repopulate Redis when Redis is restarted

2 participants