Skip to content

[NO REVIEW][Cosmos] Share PartitionKeyRangeCache across CosmosClients targeting the same account#49560

Open
xinlian12 wants to merge 1 commit into
Azure:mainfrom
xinlian12:feature/shared-partition-key-range-cache
Open

[NO REVIEW][Cosmos] Share PartitionKeyRangeCache across CosmosClients targeting the same account#49560
xinlian12 wants to merge 1 commit into
Azure:mainfrom
xinlian12:feature/shared-partition-key-range-cache

Conversation

@xinlian12

@xinlian12 xinlian12 commented Jun 18, 2026

Copy link
Copy Markdown
Member

Description

Today every CosmosClient / CosmosAsyncClient owns its own RxPartitionKeyRangeCache, even when many clients in the same JVM target the same Cosmos account (a common pattern for multi-tenant / multi-credential apps and frameworks that recreate clients). The routing-map data is duplicated N times and /pkranges calls fan out N times for the same containers.

This PR moves the routing-map storage to a process-wide, refcounted registry keyed by service endpoint. The fetching path (which depends on the per-client network stack, auth, collection cache, diagnostics) stays per-client.

Design

Split RxPartitionKeyRangeCache into two layers:

  1. StorageAsyncCacheNonBlocking<String, CollectionRoutingMap>. Account-level data, naturally shareable. Now obtained from SharedRoutingMapCacheRegistry (process-wide singleton) keyed by the service endpoint URL string.
  2. Fetcher — issues /pkranges, depends on per-client RxDocumentClientImpl, RxCollectionCache, diagnostics. Unchanged.

Concurrency model

All registry state transitions go through ConcurrentHashMap.compute(...), which provides atomic per-key check-and-update.

Lifecycle

  • RxPartitionKeyRangeCache ctor acquires from the registry (bumps refcount).
  • RxPartitionKeyRangeCache now implements Closeable; close() releases the refcount and is idempotent (guarded by AtomicBoolean).
  • RxDocumentClientImpl.close() calls LifeCycleUtils.closeQuietly(partitionKeyRangeCache).
  • When the last reference is released, the registry entry is evicted so idle endpoints don't pin memory.

Why this is safe

  • Data sensitivity: PK-range data is account-level metadata; not per-credential, not per-consistency-level, not per-preferred-region.
  • Single-flight: AsyncCacheNonBlocking already enforces a single in-flight fetch per key (AsyncLazyWithRefresh). Sharing the instance strengthens that to "one in-flight /pkranges per (account, container) across all clients" — a stronger guarantee than today.
  • Endpoint identity: Two clients only collide when serviceEndpoint.toString() is equal, which means they target the same account.
  • Back-compat: The two-arg constructor still exists; it resolves the endpoint from the client. Existing tests that mock RxDocumentClientImpl (mock returns null endpoint) silently fall back to an isolated cache — identical behaviour to today.

Opt-out

System property COSMOS.SHARED_PARTITION_KEY_RANGE_CACHE_ENABLED=false restores per-client private caches.

Reference

The Python SDK uses essentially this pattern (sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py) — module-level endpoint-keyed dicts with refcounted cleanup. This PR adapts that to Java idioms: ConcurrentHashMap.compute instead of explicit RLock, Closeable instead of __del__.

Files

File Change
caches/SharedRoutingMapCacheRegistry.java NEW — process-wide registry singleton
caches/RxPartitionKeyRangeCache.java Endpoint ctor, registry-backed storage, idempotent close()
Configs.java New system property
RxDocumentClientImpl.java Pass endpoint to ctor, release cache on close
caches/SharedRoutingMapCacheRegistryTest.java NEW — 8 unit tests inc. 32-thread concurrency stress
caches/RxPartitionKeyRangeCacheTest.java +3 tests: cross-client sharing, cross-endpoint isolation, close idempotency
CHANGELOG.md Entry under 4.82.0-beta.1

Test plan

  • mvn compile (azure-cosmos)
  • mvn checkstyle:check spotbugs:check (azure-cosmos)
  • ✅ All cache unit tests pass (18 cache tests, 0 failures)
  • ChangeFeedFetcherTest passes (14 tests)

Key behavioural tests

  • twoCachesForSameEndpointShareRoutingMapStorage — populates routing map from client A, then asserts client B serves the same lookup with clientB.readPartitionKeyRanges invoked zero times.
  • cachesForDifferentEndpointsDoNotShareStorage — both clients invoke their own readPartitionKeyRanges exactly once.
  • closeIsIdempotent — repeated close() calls do not drive refcount negative.
  • concurrentAcquireAndReleaseProducesConsistentRefcount — 32 threads × 200 ops, refcount ends at 0.
  • disabledFlagReturnsIsolatedCachesAndPreservesRegistryEmpty — opt-out preserves pre-sharing behaviour.

Breaking changes

None. The public API of RxPartitionKeyRangeCache (an implementation package class) gains a new ctor overload and now implements Closeable. The two-arg ctor signature is preserved. No customer-visible APIs change.

…ing the same account

Move the partition-key-range routing-map cache from per-CosmosClient to a
process-wide, refcounted registry keyed by service endpoint. Multiple
CosmosClient / CosmosAsyncClient instances in the same JVM targeting the
same Cosmos account now share a single AsyncCacheNonBlocking instance for
collection -> CollectionRoutingMap, eliminating duplicate routing-map
memory and redundant /pkranges fetches.

Design

- New SharedRoutingMapCacheRegistry (process-wide singleton) holds an
  AsyncCacheNonBlocking per endpoint URL plus an AtomicInteger refcount.
  All state transitions go through ConcurrentHashMap.compute, giving
  atomic per-key check-and-update without a global lock.
- RxPartitionKeyRangeCache: new ctor accepts the service endpoint;
  underlying routingMapCache is obtained from the registry. Implements
  Closeable; close() releases this client's reference and is idempotent.
- RxDocumentClientImpl: passes serviceEndpoint to the cache ctor and
  releases the cache reference in its close() path.
- Opt-out: COSMOS.SHARED_PARTITION_KEY_RANGE_CACHE_ENABLED=false restores
  the pre-sharing behaviour (each client owns a private cache).

Why this is safe

- PK-range data is account-level metadata, not credential-bound.
- AsyncCacheNonBlocking already enforces single-flight per key; sharing
  the instance strengthens that to "single in-flight /pkranges per
  (account, container) across all clients".
- The two-arg back-compat ctor resolves the endpoint from the client, so
  existing mocked tests continue to work (mock returns null endpoint ->
  isolated cache, matching today's behaviour).

Tests

- New SharedRoutingMapCacheRegistryTest: acquire/release sharing,
  refcount eviction, idempotent release, null-endpoint isolation,
  opt-out flag, 32-thread concurrent acquire/release stress.
- New RxPartitionKeyRangeCacheTest cases: two caches at same endpoint
  share storage (verified by mock /pkranges call count = 1, not 2),
  caches at different endpoints stay independent, close() is idempotent.
- Existing 7 RxPartitionKeyRangeCacheTest cases unchanged and passing.

Reference

Pattern matches Python (sdk/cosmos/azure-cosmos/azure/cosmos/_routing/
routing_map_provider.py) which uses module-level endpoint-keyed dicts
with refcounted cleanup. Adapted to Java idioms (ConcurrentHashMap.compute
instead of explicit RLock, Closeable instead of __del__).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 18, 2026 18:42
@xinlian12 xinlian12 requested review from a team and kirankumarkolli as code owners June 18, 2026 18:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces duplicated routing-map cache memory and redundant /pkranges requests by sharing the storage layer of RxPartitionKeyRangeCache across CosmosClient / CosmosAsyncClient instances that target the same Cosmos account (keyed by service endpoint), while keeping the per-client fetch path unchanged. The shared cache is managed by a process-wide, refcounted registry and can be disabled via a new system property for opt-out.

Changes:

  • Introduces SharedRoutingMapCacheRegistry (endpoint-keyed, refcounted) to share AsyncCacheNonBlocking<String, CollectionRoutingMap> across clients.
  • Updates RxPartitionKeyRangeCache to acquire shared storage by endpoint and to implement Closeable for refcount release on client shutdown.
  • Wires RxDocumentClientImpl.close() to release the cache reference, adds config flag plumbing, and adds targeted unit tests + changelog entry.
Show a summary per file
File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java Passes endpoint into the cache ctor and releases the cache reference during client close.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java Adds COSMOS.SHARED_PARTITION_KEY_RANGE_CACHE_ENABLED flag (default true).
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/caches/SharedRoutingMapCacheRegistry.java New process-wide singleton registry for shared routing-map cache storage with refcounted eviction.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCache.java Splits “storage” vs “fetcher” by sourcing storage from the shared registry and adding close() ref-release.
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the new sharing behavior and opt-out property.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/caches/SharedRoutingMapCacheRegistryTest.java New unit tests validating sharing, eviction, disabled behavior, and concurrency refcount correctness.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCacheTest.java Adds tests validating cross-client sharing, cross-endpoint isolation, and idempotent close behavior.

Copilot's findings

  • Files reviewed: 7/7 changed files
  • Comments generated: 1

}

if (this.partitionKeyRangeCache != null) {
logger.info("Releasing shared PartitionKeyRangeCache reference ...");
@xinlian12 xinlian12 changed the title [Cosmos] Share PartitionKeyRangeCache across CosmosClients targeting the same account [NO REVIEW][Cosmos] Share PartitionKeyRangeCache across CosmosClients targeting the same account Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants