Skip to content

feat(redis): [proof of concept] add AWS IAM authentication for ElastiCache#691

Draft
aaron-zeisler wants to merge 6 commits into
v8from
aaronz/SDK-2472/elasticache-iam-auth
Draft

feat(redis): [proof of concept] add AWS IAM authentication for ElastiCache#691
aaron-zeisler wants to merge 6 commits into
v8from
aaronz/SDK-2472/elasticache-iam-auth

Conversation

@aaron-zeisler

@aaron-zeisler aaron-zeisler commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds opt-in AWS IAM authentication for ElastiCache (Redis / Valkey), so operators running Relay Proxy on AWS can authenticate via IAM identities (IRSA, EKS Pod Identity) instead of long-lived static Redis passwords.

Jira: SDK-2472 — child of epic SDK-2471 "Relay Proxy: Support IAM for Elasticache". Resolves #680.

Status: proof of concept. We do not expect to ship this before Relay Proxy v9. The v9 work will fold this feature into a broader effort to upgrade go-redis to its latest version and consolidate Relay's two Redis libraries down to one (go-redis), at which point the v8-era OnConnect wiring and the manual connection-recycling below are replaced by go-redis v9's native CredentialsProvider (which re-auths in place). This branch is the hardened reference prototype that feeds that plan.

Depends on companion upstream PR launchdarkly/go-server-sdk-redis-redigo#45, which adds the PasswordProvider and MaxConnLifetime builder methods this branch consumes. go.mod pins to a pseudo-version against the upstream branch HEAD; before merge it should be re-pinned to a real release tag once that branch ships in a v3.x release.

Background

Long-lived Redis passwords are an operational burden and a security risk. The reporter (#680) asked for IAM-based auth gated behind a config flag, citing #127 (DynamoDB IRSA) as precedent.

DynamoDB was a one-line SDK bump — once the AWS SDK supported IRSA, every DynamoDB API call was transparently signed. ElastiCache is fundamentally different. Redis speaks its own wire protocol, not AWS APIs. Each Redis client must:

  1. Use AWS credentials to generate a SigV4-presigned URL (action=connect, service=elasticache).
  2. Pass that URL as the password in Redis AUTH / HELLO.
  3. The token is valid for 15 minutes; new connections need fresh tokens.

Relay constructs Redis clients in three places, using two different libraries, all reading the same RedisConfig:

Use case Library Site
SDK persistent data store redigo via go-server-sdk-redis-redigo internal/sdks/data_stores.go
Big segments store go-redis/v8 internal/bigsegments/store_redis.go
Auto-config cache go-redis/v8 internal/autoconfigcache/redis_store.go

All three are in scope here — skipping any leaves a partial feature where enabling IAM auth silently breaks one of them.

A consolidated design and decision register lives at docs/agents/aws-iam-redis-design.md (gitignored, agent-only) and is mirrored to Confluence.

Changes

Configuration surface (config/config.go, config/config_validation.go):

  • AWSAuth bool (REDIS_AWS_AUTH) and AWSCacheName string (REDIS_AWS_CACHE_NAME) on RedisConfig.
  • AWSServerless bool (REDIS_AWS_SERVERLESS) — experimental, not functionally supported. Set true for an ElastiCache Serverless cache; this adds the ResourceType=ServerlessCache the signed token requires (without it auth fails with an opaque WRONGPASS). It covers the auth handshake only: Serverless is cluster-mode-only and Relay's Redis clients are not cluster-aware, so multi-key data-store and big-segments (Watch/TxPipelined) operations will hit CROSSSLOT. Full Serverless support is blocked on the future cluster-aware data-store work (Relay v9 "M4a" milestone). Do not use in production.
  • AWSRegion string (REDIS_AWS_REGION) — optional override for the SigV4 signing region. Empty falls back to the standard AWS credential chain (unchanged behavior). A wrong region also surfaces as WRONGPASS, so the resolved region is logged at startup (see below).
  • Validation in normalizeRedisConfig: when AWSAuth=true, requires REDIS_TLS=true, REDIS_USERNAME, and REDIS_AWS_CACHE_NAME; rejects both REDIS_PASSWORD and URL-embedded passwords (rediss://user:pw@host); silently lowercases AWSCacheName to match AWS-side canonicalization.

Token generation (internal/awsredisauth/):

  • Stateless TokenProvider that SigV4-presigns https://<cache-name>/?Action=connect&User=<user> (plus &ResourceType=ServerlessCache when serverless) against the elasticache service. Token lifetime parameterized (default 900 s — the AWS-documented max). No caching, no background goroutine — each call retrieves credentials from aws.Config.Credentials and re-signs.
  • SharedTokenProvider memoizes one provider per (cacheName, username, region, serverless) fingerprint, so the three wiring sites share a single LoadDefaultConfig, credential cache, and fail-fast Token() startup probe — without changing relay.New's signature. It logs once at startup: ElastiCache IAM auth enabled (cache=%s, region=%s, serverless=%t).

Wiring: all three Redis client construction sites updated. In IAM mode:

  • The redigo data-store path uses the upstream PasswordProvider (callback invoked per Dial) plus DialUsername (ElastiCache IAM requires the ACL AUTH <user> <token> form) plus a jittered MaxConnLifetime.
  • The two go-redis paths set OnConnect to call cn.AuthACL(ctx, username, token) and a jittered MaxConnAge. opts.Password / opts.Username are explicitly zeroed before installing OnConnect as defense-in-depth — go-redis runs a pipeline-AUTH before OnConnect would fire, and a leftover password (e.g., from URL parsing) would otherwise be sent to ElastiCache as static credentials.

Connection recycling / jitter: AWS disconnects IAM-authenticated connections after a hard 12-hour ceiling (active or idle). JitteredMaxConnAge() recycles each pool at a uniformly random age in [10h30m, 11h], computed once per pool. This staggers reconnects across pods (independent random draws) and across the three pools within a pod. It does not stagger connections within a single pool created at the same instant (e.g. all reconnecting after a Redis failover) — that residual is inherent to a single per-pool max-age (neither redigo nor go-redis/v8 supports per-connection max-age) and is resolved by the planned go-redis v9 migration's native CredentialsProvider, which re-auths in place rather than forcing a TCP recycle.

Docs: docs/configuration.md (env-var reference, incl. REDIS_AWS_SERVERLESS / REDIS_AWS_REGION) and docs/persistent-storage.md (the "AWS IAM authentication for ElastiCache" section with prerequisites, example config, region notes, and the 12-hour connection-lifetime note). REDIS_AWS_SERVERLESS is documented as experimental / not functionally supported (auth-handshake-only; Serverless is cluster-mode-only and Relay's clients are not cluster-aware), with a prominent caveat that resolves the existing "clustered Redis is unsupported" contradiction.

Tests:

  • Validation rules incl. URL-embedded-password rejection; serverless and region-override config parsing.
  • Token-provider units incl. ResourceType=ServerlessCache present/absent and region flow-through.
  • SharedTokenProvider memoization (same instance per fingerprint, distinct per fingerprint, probe/log once, test reset hook).
  • JitteredMaxConnAge range bound [10h30m, 11h].
  • Per-site wiring tests (credentials-error propagation across the three Redis sites).
  • Integration test under build tag integration: an in-process TTL-enforcing RESP mock drives the redigo path through 4 simulated token-expiry boundaries with zero command failures, enforcing the two-arg ACL AUTH form so any regression of the DialUsername wiring fails it.

Adds opt-in support for authenticating to AWS ElastiCache for
Redis/Valkey using IAM identities (IRSA / Pod Identity), eliminating
the need for long-lived static Redis passwords.

New configuration:
- REDIS_AWS_AUTH (bool) — enable IAM auth.
- REDIS_AWS_CACHE_NAME (string) — ElastiCache cluster name, signed
  into the SigV4 presigned authentication URL.

When enabled, the relay generates a fresh SigV4-presigned URL on each
new Redis connection and uses it as the AUTH/HELLO password. Token
generation is stateless, per-connection, microseconds of CPU; the AWS
SDK's credential chain handles IAM credential refresh transparently.

To accommodate the AWS-imposed 12-hour connection lifetime for IAM-
authenticated connections, all three Redis client pools internally set
MaxConnAge / MaxConnLifetime to 11 hours in IAM mode.

Validation rules (in normalizeRedisConfig):
- REDIS_AWS_AUTH requires REDIS_TLS=true.
- REDIS_AWS_AUTH requires REDIS_USERNAME.
- REDIS_AWS_AUTH requires REDIS_AWS_CACHE_NAME.
- REDIS_AWS_AUTH and REDIS_PASSWORD are mutually exclusive.
- REDIS_AWS_CACHE_NAME is lowercased silently to match the AWS-side
  canonicalization performed at cache-creation time.

Scope covers all three Redis client construction sites: the SDK
persistent data store (redigo), big segments (go-redis/v8), and the
auto-config cache (go-redis/v8). Each site fails fast at startup if
IAM credentials cannot be retrieved.

Tests:
- Unit tests for new validation rules and config normalization.
- Unit tests for the stateless TokenProvider and credential-error
  propagation.
- Per-site unit tests covering construction-time error paths.
- Integration test (build tag `integration`) using an in-process
  RESP-protocol mock with AUTH-once + expired-token semantics; drives
  the redigo path across multiple simulated 5s token-expiry boundaries
  with zero observed command failures.

NOTE: this branch carries a local `replace` directive in go.mod
pointing at a sibling worktree of go-server-sdk-redis-redigo so that
the upstream PasswordProvider/MaxConnLifetime additions resolve during
development. The replace must be swapped for a pseudo-version pin
before this branch is opened for review against main.
…ct URL-embedded passwords

Three correctness fixes surfaced by the multi-agent code review:

1. Redigo data-store path was missing DialUsername in the IAM branch, so
   redigo sent the single-arg `AUTH <token>` form on the wire instead of
   the two-arg `AUTH <user> <token>` ACL form that ElastiCache IAM auth
   requires. Production would have failed to authenticate.

2. The integration-test mock accepted both single-arg and two-arg AUTH,
   so it would have passed even with fix #1's bug present. The mock now
   enforces the two-arg ACL form (matching real ElastiCache behavior) and
   the test's redigo pool now sends DialUsername alongside DialPassword
   to mirror the production wiring.

3. The AWSAuth + Password mutual-exclusion guard only inspected
   REDIS_PASSWORD; a password embedded in REDIS_URL
   (e.g. `rediss://user:pw@host`) bypassed it and flowed into
   opts.Password, where go-redis would pipeline a static-password AUTH
   before OnConnect could fire the IAM auth. Validation now also rejects
   URL-embedded passwords. As defense-in-depth, the two go-redis wiring
   sites also zero opts.Password and opts.Username before installing the
   OnConnect callback.
…ion of IAM-auth branch

Swaps the local `replace` directive for a pseudo-version pin against the
upstream branch commit (`6b81558`) so CI can resolve the new
`PasswordProvider` and `MaxConnLifetime` builder methods without
needing a sibling worktree on disk.

Tracks the head of `aaronz/SDK-2472/elasticache-iam-auth` on
`go-server-sdk-redis-redigo`. Will be re-pinned to a real release tag
once that branch merges and a `v3.x` release is cut.
@aaron-zeisler aaron-zeisler changed the title feat(redis): add AWS IAM authentication for ElastiCache [proof of concept] feat(redis): add AWS IAM authentication for ElastiCache Jun 8, 2026
…on, shared provider, jitter)

Addresses four review findings on the IAM-auth PoC:

- Serverless: add REDIS_AWS_SERVERLESS to emit ResourceType=ServerlessCache in the signed token; without it serverless caches reject auth with an opaque WRONGPASS.

- Region: add optional REDIS_AWS_REGION override and log the resolved region/cache/serverless once at startup, so a region mismatch (also WRONGPASS) is diagnosable.

- Shared provider: SharedTokenProvider memoizes one provider per config fingerprint, collapsing three LoadDefaultConfig calls, credential caches, and startup STS probes into one without changing relay.New.

- Jitter: JitteredMaxConnAge spreads pool recycle over [10h30m, 11h] to avoid a synchronized reconnect storm at the AWS 12h connection ceiling. Staggers across pods/pools; within-pool residual after a mass reconnect is documented and resolved by the planned go-redis v9 migration.
@aaron-zeisler aaron-zeisler changed the title [proof of concept] feat(redis): add AWS IAM authentication for ElastiCache feat(redis): [proof of concept] add AWS IAM authentication for ElastiCache Jun 8, 2026
Suppress gochecknoglobals on the intentional process-wide shared-provider memo cache and gosec G404 on the non-cryptographic connection-age jitter. Tests/build/vet were green locally; CI gates on make lint, which these now pass (0 issues).
… not functionally supported

The serverless IAM flag only adds ResourceType=ServerlessCache to the SigV4
presigned token, which covers the auth handshake. ElastiCache Serverless is
cluster-mode-only, but Relay's redigo and go-redis/v8 clients are not
cluster-aware, so multi-key data-store and big-segments (Watch/TxPipelined)
operations will hit CROSSSLOT errors. Full serverless support is blocked on
cluster-aware client work that has not shipped.

Resolve the contradiction in docs/persistent-storage.md (clustered Redis is
unsupported vs. the unconditional serverless instructions), add an experimental
caveat to the REDIS_AWS_SERVERLESS row in docs/configuration.md, and document
the limitation in the AWSServerless config godoc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

relay proxy : support IAM auth for AWS Elasticache

1 participant