[NO REVIEW][Cosmos] Share PartitionKeyRangeCache across CosmosClients targeting the same account#49560
Open
xinlian12 wants to merge 1 commit into
Open
Conversation
…ing the same account Move the partition-key-range routing-map cache from per-CosmosClient to a process-wide, refcounted registry keyed by service endpoint. Multiple CosmosClient / CosmosAsyncClient instances in the same JVM targeting the same Cosmos account now share a single AsyncCacheNonBlocking instance for collection -> CollectionRoutingMap, eliminating duplicate routing-map memory and redundant /pkranges fetches. Design - New SharedRoutingMapCacheRegistry (process-wide singleton) holds an AsyncCacheNonBlocking per endpoint URL plus an AtomicInteger refcount. All state transitions go through ConcurrentHashMap.compute, giving atomic per-key check-and-update without a global lock. - RxPartitionKeyRangeCache: new ctor accepts the service endpoint; underlying routingMapCache is obtained from the registry. Implements Closeable; close() releases this client's reference and is idempotent. - RxDocumentClientImpl: passes serviceEndpoint to the cache ctor and releases the cache reference in its close() path. - Opt-out: COSMOS.SHARED_PARTITION_KEY_RANGE_CACHE_ENABLED=false restores the pre-sharing behaviour (each client owns a private cache). Why this is safe - PK-range data is account-level metadata, not credential-bound. - AsyncCacheNonBlocking already enforces single-flight per key; sharing the instance strengthens that to "single in-flight /pkranges per (account, container) across all clients". - The two-arg back-compat ctor resolves the endpoint from the client, so existing mocked tests continue to work (mock returns null endpoint -> isolated cache, matching today's behaviour). Tests - New SharedRoutingMapCacheRegistryTest: acquire/release sharing, refcount eviction, idempotent release, null-endpoint isolation, opt-out flag, 32-thread concurrent acquire/release stress. - New RxPartitionKeyRangeCacheTest cases: two caches at same endpoint share storage (verified by mock /pkranges call count = 1, not 2), caches at different endpoints stay independent, close() is idempotent. - Existing 7 RxPartitionKeyRangeCacheTest cases unchanged and passing. Reference Pattern matches Python (sdk/cosmos/azure-cosmos/azure/cosmos/_routing/ routing_map_provider.py) which uses module-level endpoint-keyed dicts with refcounted cleanup. Adapted to Java idioms (ConcurrentHashMap.compute instead of explicit RLock, Closeable instead of __del__). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces duplicated routing-map cache memory and redundant /pkranges requests by sharing the storage layer of RxPartitionKeyRangeCache across CosmosClient / CosmosAsyncClient instances that target the same Cosmos account (keyed by service endpoint), while keeping the per-client fetch path unchanged. The shared cache is managed by a process-wide, refcounted registry and can be disabled via a new system property for opt-out.
Changes:
- Introduces
SharedRoutingMapCacheRegistry(endpoint-keyed, refcounted) to shareAsyncCacheNonBlocking<String, CollectionRoutingMap>across clients. - Updates
RxPartitionKeyRangeCacheto acquire shared storage by endpoint and to implementCloseablefor refcount release on client shutdown. - Wires
RxDocumentClientImpl.close()to release the cache reference, adds config flag plumbing, and adds targeted unit tests + changelog entry.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java | Passes endpoint into the cache ctor and releases the cache reference during client close. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java | Adds COSMOS.SHARED_PARTITION_KEY_RANGE_CACHE_ENABLED flag (default true). |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/caches/SharedRoutingMapCacheRegistry.java | New process-wide singleton registry for shared routing-map cache storage with refcounted eviction. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCache.java | Splits “storage” vs “fetcher” by sourcing storage from the shared registry and adding close() ref-release. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the new sharing behavior and opt-out property. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/caches/SharedRoutingMapCacheRegistryTest.java | New unit tests validating sharing, eviction, disabled behavior, and concurrency refcount correctness. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/caches/RxPartitionKeyRangeCacheTest.java | Adds tests validating cross-client sharing, cross-endpoint isolation, and idempotent close behavior. |
Copilot's findings
- Files reviewed: 7/7 changed files
- Comments generated: 1
| } | ||
|
|
||
| if (this.partitionKeyRangeCache != null) { | ||
| logger.info("Releasing shared PartitionKeyRangeCache reference ..."); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Today every
CosmosClient/CosmosAsyncClientowns its ownRxPartitionKeyRangeCache, even when many clients in the same JVM target the same Cosmos account (a common pattern for multi-tenant / multi-credential apps and frameworks that recreate clients). The routing-map data is duplicated N times and/pkrangescalls fan out N times for the same containers.This PR moves the routing-map storage to a process-wide, refcounted registry keyed by service endpoint. The fetching path (which depends on the per-client network stack, auth, collection cache, diagnostics) stays per-client.
Design
Split
RxPartitionKeyRangeCacheinto two layers:AsyncCacheNonBlocking<String, CollectionRoutingMap>. Account-level data, naturally shareable. Now obtained fromSharedRoutingMapCacheRegistry(process-wide singleton) keyed by the service endpoint URL string./pkranges, depends on per-clientRxDocumentClientImpl,RxCollectionCache, diagnostics. Unchanged.Concurrency model
All registry state transitions go through
ConcurrentHashMap.compute(...), which provides atomic per-key check-and-update.Lifecycle
RxPartitionKeyRangeCachector acquires from the registry (bumps refcount).RxPartitionKeyRangeCachenow implementsCloseable;close()releases the refcount and is idempotent (guarded byAtomicBoolean).RxDocumentClientImpl.close()callsLifeCycleUtils.closeQuietly(partitionKeyRangeCache).Why this is safe
AsyncCacheNonBlockingalready enforces a single in-flight fetch per key (AsyncLazyWithRefresh). Sharing the instance strengthens that to "one in-flight/pkrangesper(account, container)across all clients" — a stronger guarantee than today.serviceEndpoint.toString()is equal, which means they target the same account.RxDocumentClientImpl(mock returns null endpoint) silently fall back to an isolated cache — identical behaviour to today.Opt-out
System property
COSMOS.SHARED_PARTITION_KEY_RANGE_CACHE_ENABLED=falserestores per-client private caches.Reference
The Python SDK uses essentially this pattern (
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py) — module-level endpoint-keyed dicts with refcounted cleanup. This PR adapts that to Java idioms:ConcurrentHashMap.computeinstead of explicitRLock,Closeableinstead of__del__.Files
caches/SharedRoutingMapCacheRegistry.javacaches/RxPartitionKeyRangeCache.javaclose()Configs.javaRxDocumentClientImpl.javacaches/SharedRoutingMapCacheRegistryTest.javacaches/RxPartitionKeyRangeCacheTest.javaCHANGELOG.mdTest plan
mvn compile(azure-cosmos)mvn checkstyle:check spotbugs:check(azure-cosmos)ChangeFeedFetcherTestpasses (14 tests)Key behavioural tests
twoCachesForSameEndpointShareRoutingMapStorage— populates routing map from client A, then asserts client B serves the same lookup withclientB.readPartitionKeyRangesinvoked zero times.cachesForDifferentEndpointsDoNotShareStorage— both clients invoke their ownreadPartitionKeyRangesexactly once.closeIsIdempotent— repeatedclose()calls do not drive refcount negative.concurrentAcquireAndReleaseProducesConsistentRefcount— 32 threads × 200 ops, refcount ends at 0.disabledFlagReturnsIsolatedCachesAndPreservesRegistryEmpty— opt-out preserves pre-sharing behaviour.Breaking changes
None. The public API of
RxPartitionKeyRangeCache(animplementationpackage class) gains a new ctor overload and now implementsCloseable. The two-arg ctor signature is preserved. No customer-visible APIs change.