fix: auto-repopulate persistent store after data loss#664
Open
nbouvrette wants to merge 6 commits into
Open
Conversation
…ly#141) When Redis restarts and loses data while ld-relay's streaming connection remains open, daemon-mode SDK clients receive empty flag evaluations because no mechanism existed to detect and recover from this state. This adds three complementary resilience layers: 1. In-memory snapshot: the store wrapper now keeps a deep copy of the latest flag/segment data, updated on every Init and Upsert. 2. Circuit breaker: on Redis read errors, Get/GetAll serve from the snapshot and stop hitting Redis until the health check clears it. 3. Periodic health check: a background goroutine checks the Redis $inited sentinel key via a dedicated connection. When the key is missing, it repopulates Redis from the snapshot and clears the circuit breaker. Configuration: [Redis] HealthCheckInterval (default 30s). Co-authored-by: Cursor <cursoragent@cursor.com>
1. Snapshot version regression: updateSnapshotItem now compares versions and skips updates with older data, preventing stale flags from being written to the snapshot on out-of-order upserts. 2. Race condition in Close(): storeHealthCheck is now read under the mutex, and deferredHealthCheckStart checks c.closed before starting the health check to prevent goroutine leaks. Co-authored-by: Cursor <cursoragent@cursor.com>
Repopulation now uses RepopulateStore() which writes directly to the underlying store, bypassing saveSnapshot() and SendAllDataUpdate(). This prevents a race where a streaming Upsert between GetSnapshot() and Init() could be overwritten in both the snapshot and Redis, and avoids broadcasting stale put events to connected SSE clients. Co-authored-by: Cursor <cursoragent@cursor.com>
The health check and circuit breaker now work with all three persistent
store backends, not just Redis. Each backend has its own init checker
that queries the $inited sentinel key directly:
- Redis: EXISTS {prefix}:$inited (via dedicated Redigo pool)
- Consul: KV.Get {prefix}/$inited (via dedicated Consul client)
- DynamoDB: GetItem {prefix}:$inited (via dedicated DynamoDB client)
Health check interval is now configurable per backend via
CONSUL_HEALTH_CHECK_INTERVAL and DYNAMODB_HEALTH_CHECK_INTERVAL
environment variables (or healthCheckInterval in the config file).
Setting the interval to 0 disables the health check entirely.
Co-authored-by: Cursor <cursoragent@cursor.com>
When healthCheckInterval is set to 0, return early from createInitChecker before allocating the checker and its connection pool. Previously the checker was created but never stored or closed when the interval guard rejected it. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default mode and found 2 potential issues.
Reviewed by Cursor Bugbot for commit f273bed. Configure here.
The function copies collection and item slices but shares the underlying ItemDescriptor.Item pointers. The old name implied full deep copying which was misleading. Added a comment explaining the shared pointer semantics and why it is safe (SDK treats flag/segment objects as immutable). Co-authored-by: Cursor <cursoragent@cursor.com>
Member
|
@nbouvrette I am sorry I haven't responded to this yet. I have been unavailable for most of last week and will be for the remainder of this week. I will try to take a look at this first thing on Tuesday. Thank you for your patience with me. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Fixes #141.
The primary motivation is Redis: when Redis restarts without persistence enabled, all data is lost, but ld-relay's streaming connection to LaunchDarkly remains open so no re-initialization is triggered. SDKs using daemon mode (such as PHP, which reads directly from Redis) receive empty flag evaluations indefinitely until ld-relay is manually restarted.
While Redis is the most common case (since it is ephemeral by default), the same class of problem can affect any persistent store -- Consul KV data could be deleted or lost during a restore, and DynamoDB items could be removed or TTL'd. This PR implements the fix for all three supported backends to provide consistent resilience regardless of store choice.
This PR adds three complementary resilience layers to detect and recover from persistent store data loss:
In-memory snapshot —
streamUpdatesStoreWrappernow maintains a deep copy of the latest flags and segments, updated on everyInitandUpsertfrom the streaming data source. Out-of-order updates are rejected via version comparison.Circuit breaker — When
Get/GetAllencounter a store read error and a snapshot is available, the wrapper serves from the snapshot and stops hitting the store. This prevents connection pool exhaustion during an outage and ensures clients continue receiving correct evaluations.Periodic health check — A background goroutine queries the store's
$initedsentinel key via a dedicated connection (bypassing the SDK's cache). When the key is missing (indicating data loss), it repopulates the store from the snapshot and clears the circuit breaker.Architecture
flowchart TB subgraph normal["1 - Normal Operation"] LD["LaunchDarkly Stream"] -->|Init / Upsert| SW["streamUpdatesStoreWrapper"] SW -->|write| Store[(Persistent Store)] SW -->|deep copy| Snap["In-Memory Snapshot"] SDK["SDK Client Get/GetAll"] --> SW SW -->|read| Store end subgraph failure["2 - Store Failure: Circuit Breaker"] SDK2["SDK Client Get/GetAll"] --> SW2["streamUpdatesStoreWrapper"] SW2 -->|read fails| Store2[(Store X)] SW2 -->|circuit opens, serve from snapshot| Snap2["In-Memory Snapshot"] end subgraph recovery["3 - Recovery: Health Check"] HC["StoreHealthCheck (every 30s)"] -->|"EXISTS $inited"| Store3[(Persistent Store)] Store3 -->|key missing| HC HC -->|"RepopulateStore(snapshot)"| SW3["streamUpdatesStoreWrapper"] SW3 -->|repopulate| Store3 HC -->|clear circuit breaker| SW3 end normal ~~~ failure failure ~~~ recovery style Store fill:#d4edda,stroke:#28a745 style Store2 fill:#f8d7da,stroke:#dc3545 style Store3 fill:#fff3cd,stroke:#ffc107 style Snap fill:#d1ecf1,stroke:#17a2b8 style Snap2 fill:#d1ecf1,stroke:#17a2b8Supported backends
The health check works with all three persistent store backends:
{prefix}:$initedEXISTScommand via dedicated Redigo pool{prefix}/$initedKV.Getvia dedicated Consul client{prefix}:$initedGetItemvia dedicated DynamoDB clientConfiguration
A new optional
healthCheckIntervalfield for each backend (default: 30s, set to0to disable):Or via environment variables:
REDIS_HEALTH_CHECK_INTERVAL,CONSUL_HEALTH_CHECK_INTERVAL,DYNAMODB_HEALTH_CHECK_INTERVAL.Scope
Test plan
All components were developed using test-driven development — tests were written before the implementation to verify expected behavior independently, ensuring they validate correctness rather than merely confirming existing code paths.
Initsaves snapshot,Upsertupdates snapshot, empty/nilInitdoes not setsnapshotHasData, deep copy isolation, out-of-order version rejectionGet/GetAllerrors, serves from snapshot, skips store on subsequent reads, clears on recovery, returns errors when no snapshot exists$initedkey, triggers repopulation, clears circuit breaker, handles connection errors gracefully, stops cleanly, no-ops without snapshotgo test ./...)Note
Medium Risk
Changes core persistent-store read behavior by adding snapshot fallback and a circuit breaker, plus a background health check that can repopulate Redis/Consul/DynamoDB automatically. Misconfiguration or edge cases could mask real store errors or cause unexpected write bursts during recovery.
Overview
Adds automatic recovery when a configured persistent store loses data (e.g. Redis restart) by maintaining an in-memory snapshot of the latest dataset and using it to repopulate the store.
Introduces a circuit breaker in
streamUpdatesStoreWrapper: onGet/GetAllread errors, Relay serves from the snapshot and stops hitting the store until recovery is detected.Adds a periodic data store health check (default 30s, configurable/disableable via new
healthCheckIntervalfields and env vars) that checks the$initedsentinel in Redis/Consul/DynamoDB via new per-backend init checkers; when missing, it repopulates the store and clears the circuit breaker. Documentation and unit tests cover snapshot, circuit breaker, and health-check behavior.Reviewed by Cursor Bugbot for commit f273bed. Bugbot is set up for automated code reviews on this repo. Configure here.