feat(redis): [proof of concept] add AWS IAM authentication for ElastiCache#691
Draft
aaron-zeisler wants to merge 6 commits into
Draft
feat(redis): [proof of concept] add AWS IAM authentication for ElastiCache#691aaron-zeisler wants to merge 6 commits into
aaron-zeisler wants to merge 6 commits into
Conversation
Adds opt-in support for authenticating to AWS ElastiCache for Redis/Valkey using IAM identities (IRSA / Pod Identity), eliminating the need for long-lived static Redis passwords. New configuration: - REDIS_AWS_AUTH (bool) — enable IAM auth. - REDIS_AWS_CACHE_NAME (string) — ElastiCache cluster name, signed into the SigV4 presigned authentication URL. When enabled, the relay generates a fresh SigV4-presigned URL on each new Redis connection and uses it as the AUTH/HELLO password. Token generation is stateless, per-connection, microseconds of CPU; the AWS SDK's credential chain handles IAM credential refresh transparently. To accommodate the AWS-imposed 12-hour connection lifetime for IAM- authenticated connections, all three Redis client pools internally set MaxConnAge / MaxConnLifetime to 11 hours in IAM mode. Validation rules (in normalizeRedisConfig): - REDIS_AWS_AUTH requires REDIS_TLS=true. - REDIS_AWS_AUTH requires REDIS_USERNAME. - REDIS_AWS_AUTH requires REDIS_AWS_CACHE_NAME. - REDIS_AWS_AUTH and REDIS_PASSWORD are mutually exclusive. - REDIS_AWS_CACHE_NAME is lowercased silently to match the AWS-side canonicalization performed at cache-creation time. Scope covers all three Redis client construction sites: the SDK persistent data store (redigo), big segments (go-redis/v8), and the auto-config cache (go-redis/v8). Each site fails fast at startup if IAM credentials cannot be retrieved. Tests: - Unit tests for new validation rules and config normalization. - Unit tests for the stateless TokenProvider and credential-error propagation. - Per-site unit tests covering construction-time error paths. - Integration test (build tag `integration`) using an in-process RESP-protocol mock with AUTH-once + expired-token semantics; drives the redigo path across multiple simulated 5s token-expiry boundaries with zero observed command failures. NOTE: this branch carries a local `replace` directive in go.mod pointing at a sibling worktree of go-server-sdk-redis-redigo so that the upstream PasswordProvider/MaxConnLifetime additions resolve during development. The replace must be swapped for a pseudo-version pin before this branch is opened for review against main.
…ct URL-embedded passwords Three correctness fixes surfaced by the multi-agent code review: 1. Redigo data-store path was missing DialUsername in the IAM branch, so redigo sent the single-arg `AUTH <token>` form on the wire instead of the two-arg `AUTH <user> <token>` ACL form that ElastiCache IAM auth requires. Production would have failed to authenticate. 2. The integration-test mock accepted both single-arg and two-arg AUTH, so it would have passed even with fix #1's bug present. The mock now enforces the two-arg ACL form (matching real ElastiCache behavior) and the test's redigo pool now sends DialUsername alongside DialPassword to mirror the production wiring. 3. The AWSAuth + Password mutual-exclusion guard only inspected REDIS_PASSWORD; a password embedded in REDIS_URL (e.g. `rediss://user:pw@host`) bypassed it and flowed into opts.Password, where go-redis would pipeline a static-password AUTH before OnConnect could fire the IAM auth. Validation now also rejects URL-embedded passwords. As defense-in-depth, the two go-redis wiring sites also zero opts.Password and opts.Username before installing the OnConnect callback.
…ion of IAM-auth branch Swaps the local `replace` directive for a pseudo-version pin against the upstream branch commit (`6b81558`) so CI can resolve the new `PasswordProvider` and `MaxConnLifetime` builder methods without needing a sibling worktree on disk. Tracks the head of `aaronz/SDK-2472/elasticache-iam-auth` on `go-server-sdk-redis-redigo`. Will be re-pinned to a real release tag once that branch merges and a `v3.x` release is cut.
…on, shared provider, jitter) Addresses four review findings on the IAM-auth PoC: - Serverless: add REDIS_AWS_SERVERLESS to emit ResourceType=ServerlessCache in the signed token; without it serverless caches reject auth with an opaque WRONGPASS. - Region: add optional REDIS_AWS_REGION override and log the resolved region/cache/serverless once at startup, so a region mismatch (also WRONGPASS) is diagnosable. - Shared provider: SharedTokenProvider memoizes one provider per config fingerprint, collapsing three LoadDefaultConfig calls, credential caches, and startup STS probes into one without changing relay.New. - Jitter: JitteredMaxConnAge spreads pool recycle over [10h30m, 11h] to avoid a synchronized reconnect storm at the AWS 12h connection ceiling. Staggers across pods/pools; within-pool residual after a mass reconnect is documented and resolved by the planned go-redis v9 migration.
Suppress gochecknoglobals on the intentional process-wide shared-provider memo cache and gosec G404 on the non-cryptographic connection-age jitter. Tests/build/vet were green locally; CI gates on make lint, which these now pass (0 issues).
… not functionally supported The serverless IAM flag only adds ResourceType=ServerlessCache to the SigV4 presigned token, which covers the auth handshake. ElastiCache Serverless is cluster-mode-only, but Relay's redigo and go-redis/v8 clients are not cluster-aware, so multi-key data-store and big-segments (Watch/TxPipelined) operations will hit CROSSSLOT errors. Full serverless support is blocked on cluster-aware client work that has not shipped. Resolve the contradiction in docs/persistent-storage.md (clustered Redis is unsupported vs. the unconditional serverless instructions), add an experimental caveat to the REDIS_AWS_SERVERLESS row in docs/configuration.md, and document the limitation in the AWSServerless config godoc.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds opt-in AWS IAM authentication for ElastiCache (Redis / Valkey), so operators running Relay Proxy on AWS can authenticate via IAM identities (IRSA, EKS Pod Identity) instead of long-lived static Redis passwords.
Jira: SDK-2472 — child of epic SDK-2471 "Relay Proxy: Support IAM for Elasticache". Resolves #680.
Depends on companion upstream PR launchdarkly/go-server-sdk-redis-redigo#45, which adds the
PasswordProviderandMaxConnLifetimebuilder methods this branch consumes.go.modpins to a pseudo-version against the upstream branch HEAD; before merge it should be re-pinned to a real release tag once that branch ships in av3.xrelease.Background
Long-lived Redis passwords are an operational burden and a security risk. The reporter (#680) asked for IAM-based auth gated behind a config flag, citing #127 (DynamoDB IRSA) as precedent.
DynamoDB was a one-line SDK bump — once the AWS SDK supported IRSA, every DynamoDB API call was transparently signed. ElastiCache is fundamentally different. Redis speaks its own wire protocol, not AWS APIs. Each Redis client must:
action=connect, service=elasticache).AUTH/HELLO.Relay constructs Redis clients in three places, using two different libraries, all reading the same
RedisConfig:redigoviago-server-sdk-redis-redigointernal/sdks/data_stores.gogo-redis/v8internal/bigsegments/store_redis.gogo-redis/v8internal/autoconfigcache/redis_store.goAll three are in scope here — skipping any leaves a partial feature where enabling IAM auth silently breaks one of them.
A consolidated design and decision register lives at
docs/agents/aws-iam-redis-design.md(gitignored, agent-only) and is mirrored to Confluence.Changes
Configuration surface (
config/config.go,config/config_validation.go):AWSAuth bool(REDIS_AWS_AUTH) andAWSCacheName string(REDIS_AWS_CACHE_NAME) onRedisConfig.AWSServerless bool(REDIS_AWS_SERVERLESS) — experimental, not functionally supported. Settruefor an ElastiCache Serverless cache; this adds theResourceType=ServerlessCachethe signed token requires (without it auth fails with an opaqueWRONGPASS). It covers the auth handshake only: Serverless is cluster-mode-only and Relay's Redis clients are not cluster-aware, so multi-key data-store and big-segments (Watch/TxPipelined) operations will hitCROSSSLOT. Full Serverless support is blocked on the future cluster-aware data-store work (Relay v9 "M4a" milestone). Do not use in production.AWSRegion string(REDIS_AWS_REGION) — optional override for the SigV4 signing region. Empty falls back to the standard AWS credential chain (unchanged behavior). A wrong region also surfaces asWRONGPASS, so the resolved region is logged at startup (see below).normalizeRedisConfig: whenAWSAuth=true, requiresREDIS_TLS=true,REDIS_USERNAME, andREDIS_AWS_CACHE_NAME; rejects bothREDIS_PASSWORDand URL-embedded passwords (rediss://user:pw@host); silently lowercasesAWSCacheNameto match AWS-side canonicalization.Token generation (
internal/awsredisauth/):TokenProviderthat SigV4-presignshttps://<cache-name>/?Action=connect&User=<user>(plus&ResourceType=ServerlessCachewhen serverless) against theelasticacheservice. Token lifetime parameterized (default 900 s — the AWS-documented max). No caching, no background goroutine — each call retrieves credentials fromaws.Config.Credentialsand re-signs.SharedTokenProvidermemoizes one provider per(cacheName, username, region, serverless)fingerprint, so the three wiring sites share a singleLoadDefaultConfig, credential cache, and fail-fastToken()startup probe — without changingrelay.New's signature. It logs once at startup:ElastiCache IAM auth enabled (cache=%s, region=%s, serverless=%t).Wiring: all three Redis client construction sites updated. In IAM mode:
PasswordProvider(callback invoked per Dial) plusDialUsername(ElastiCache IAM requires the ACLAUTH <user> <token>form) plus a jitteredMaxConnLifetime.OnConnectto callcn.AuthACL(ctx, username, token)and a jitteredMaxConnAge.opts.Password/opts.Usernameare explicitly zeroed before installingOnConnectas defense-in-depth — go-redis runs a pipeline-AUTH beforeOnConnectwould fire, and a leftover password (e.g., from URL parsing) would otherwise be sent to ElastiCache as static credentials.Connection recycling / jitter: AWS disconnects IAM-authenticated connections after a hard 12-hour ceiling (active or idle).
JitteredMaxConnAge()recycles each pool at a uniformly random age in[10h30m, 11h], computed once per pool. This staggers reconnects across pods (independent random draws) and across the three pools within a pod. It does not stagger connections within a single pool created at the same instant (e.g. all reconnecting after a Redis failover) — that residual is inherent to a single per-pool max-age (neither redigo nor go-redis/v8 supports per-connection max-age) and is resolved by the planned go-redis v9 migration's nativeCredentialsProvider, which re-auths in place rather than forcing a TCP recycle.Docs:
docs/configuration.md(env-var reference, incl.REDIS_AWS_SERVERLESS/REDIS_AWS_REGION) anddocs/persistent-storage.md(the "AWS IAM authentication for ElastiCache" section with prerequisites, example config, region notes, and the 12-hour connection-lifetime note).REDIS_AWS_SERVERLESSis documented as experimental / not functionally supported (auth-handshake-only; Serverless is cluster-mode-only and Relay's clients are not cluster-aware), with a prominent caveat that resolves the existing "clustered Redis is unsupported" contradiction.Tests:
ResourceType=ServerlessCachepresent/absent and region flow-through.SharedTokenProvidermemoization (same instance per fingerprint, distinct per fingerprint, probe/log once, test reset hook).JitteredMaxConnAgerange bound[10h30m, 11h].integration: an in-process TTL-enforcing RESP mock drives the redigo path through 4 simulated token-expiry boundaries with zero command failures, enforcing the two-arg ACLAUTHform so any regression of theDialUsernamewiring fails it.