Fix #3920: Paginate /glossaryTerms/assets/counts and use bulk aggregation#27934
Fix #3920: Paginate /glossaryTerms/assets/counts and use bulk aggregation#27934sonika-shah wants to merge 1 commit intomainfrom
Conversation
…th bulk aggregation The glossary `/assets/counts` API listed all glossary term FQNs and issued one search count query per term (N+1). For large glossaries this made the call slow and unsuitable for the ontology view, which needs to fetch and render counts in batches. This change: - Replaces the per-term loop with a single search aggregation over `tags.tagFQN` and looks up each term's count from the in-memory result, matching the `DataProductRepository` pattern. - Adds optional `limit` and `offset` query parameters to the API and slices the ordered FQN list before returning, so callers can drive paged loading. - Returns the total number of glossary terms (before slicing) in the `X-Total-Count` response header for clients that need a total count. - Adds a typed `getGlossaryTermsAssetCountsPage` helper on the frontend that reads the `X-Total-Count` header for paginated callers. Closes #3920
| private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn() { | ||
| String queryFilter = QueryFilterBuilder.buildGenericAssetsCountFilter(TAGS_FQN, true); | ||
| return inheritedFieldEntitySearch.getAggregatedCountsByField( | ||
| TAGS_FQN, queryFilter, EntityBuilderConstant.MAX_AGGREGATE_SIZE); | ||
| } |
There was a problem hiding this comment.
⚠️ Performance: Global aggregation fetched on every paginated call ignores parent filter
Every call to getGlossaryTermsAssetsCount invokes fetchAssetCountsByGlossaryTermFqn() which runs an unfiltered aggregation on tags.tagFQN across ALL indexed entities, bounded only by MAX_AGGREGATE_SIZE (10,000 buckets). This has two implications:
- Pagination doesn't reduce ES work — requesting
limit=2still triggers the full aggregation. The perf win comes from replacing N searches with 1, but the single aggregation is still global. - Silent truncation at 10k buckets — the aggregation includes ALL tag FQNs (classification tags + glossary terms). On large deployments with >10k distinct tags, some glossary term FQNs may be missing from the aggregation result, causing them to silently report 0 assets.
Consider scoping the aggregation to only the FQNs in pageFqns (e.g., via a terms filter on the tag FQN field) or at least filtering to the parent prefix. This would both reduce the bucket count and make paginated calls genuinely lighter.
Suggested fix:
private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn(List<String> fqns) {
// Scope the aggregation to only the requested FQNs
String queryFilter = QueryFilterBuilder.buildFilteredAssetsCountFilter(
TAGS_FQN, fqns, true);
return inheritedFieldEntitySearch.getAggregatedCountsByField(
TAGS_FQN, queryFilter, fqns.size());
}
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| description = | ||
| "Maximum number of glossary terms to return. When omitted, all matching terms are returned.") | ||
| @QueryParam("limit") | ||
| @Min(value = 0, message = "limit must be >= 0") |
There was a problem hiding this comment.
💡 Edge Case: limit=0 returns empty result — may confuse callers expecting metadata only
The @Min(0) annotation on the limit parameter allows limit=0. In sliceFqns, this produces an empty sublist, and because pageFqns.isEmpty() is true, the method returns early with an empty counts map (but a valid total). While technically correct, limit=0 is an unusual API contract — some callers might use it expecting to fetch only the total count. If that's intentional, it's fine as-is. If not, consider changing @Min to 1.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| return listAll(getFields("fullyQualifiedName"), new ListFilter(null)).stream() | ||
| .map(GlossaryTerm::getFullyQualifiedName) | ||
| .sorted() | ||
| .collect(Collectors.toList()); |
There was a problem hiding this comment.
💡 Quality: listGlossaryTermFqns loads all entities into memory on every call
When parent is null, listGlossaryTermFqns calls listAll(...) which loads every GlossaryTerm entity in the system into memory just to extract FQNs. For large deployments with thousands of glossary terms, this creates GC pressure on every paginated request. Consider using a lighter DAO query that returns only FQNs (a projection), or caching the sorted FQN list with a short TTL.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
Code Review
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
There was a problem hiding this comment.
Pull request overview
This PR improves performance of the glossary term asset-counts endpoint by replacing per-term search calls with a single search aggregation, and adds optional pagination (limit/offset) while exposing the pre-slice total via the X-Total-Count response header. The UI REST layer is extended with a typed helper that reads X-Total-Count for paginated callers while preserving the existing function signature for current consumers.
Changes:
- Backend:
/v1/glossaryTerms/assets/countsnow supportslimit/offsetand returnsX-Total-Count. - Backend: replaces the per-term loop with a bulk aggregation over
tags.tagFQNand slices results server-side. - UI + ITs: adds a paginated UI helper and an integration test covering page slicing behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| openmetadata-ui/src/main/resources/ui/src/rest/glossaryAPI.ts | Adds typed params/response helpers, including a getGlossaryTermsAssetCountsPage that reads X-Total-Count. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryTermResource.java | Adds limit/offset query params and returns X-Total-Count header for the asset-counts endpoint. |
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryTermRepository.java | Implements paged term list slicing and replaces per-term counting with a single aggregated counts query. |
| openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/GlossaryTermResourceIT.java | Adds an integration test validating pagination slicing behavior for the counts endpoint. |
| private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn() { | ||
| String queryFilter = QueryFilterBuilder.buildGenericAssetsCountFilter(TAGS_FQN, true); | ||
| return inheritedFieldEntitySearch.getAggregatedCountsByField( | ||
| TAGS_FQN, queryFilter, EntityBuilderConstant.MAX_AGGREGATE_SIZE); | ||
| } |
| @@ -440,12 +444,24 @@ public Response getAllGlossaryTermsWithAssetsCount( | |||
| "Filter by parent glossary or glossary term FQN. " | |||
| + "When provided, only returns asset counts for children whose FQN starts with this value.") | |||
| @QueryParam("parent") | |||
| String parent) { | |||
| String parent, | |||
| @Parameter( | |||
| description = | |||
| "Maximum number of glossary terms to return. When omitted, all matching terms are returned.") | |||
| @QueryParam("limit") | |||
| @Min(value = 0, message = "limit must be >= 0") | |||
| @Max(value = 10000, message = "limit must be <= 10000") | |||
| Integer limit, | |||
| @Parameter(description = "Offset into the ordered list of glossary terms to start from.") | |||
| @QueryParam("offset") | |||
| @Min(value = 0, message = "offset must be >= 0") | |||
| Integer offset) { | |||
| OperationContext operationContext = | |||
| new OperationContext(entityType, MetadataOperation.VIEW_ALL); | |||
| authorizer.authorize(securityContext, operationContext, getResourceContext()); | |||
| java.util.Map<String, Integer> result = repository.getAllGlossaryTermsWithAssetsCount(parent); | |||
| return Response.ok(result).build(); | |||
| GlossaryTermRepository.GlossaryTermAssetCountResult result = | |||
| repository.getGlossaryTermsAssetsCount(parent, limit, offset); | |||
| return Response.ok(result.counts()).header(TOTAL_COUNT_HEADER, result.total()).build(); | |||
| @Test | ||
| void get_assetsCountsPaginationSlicesAndPreservesTotal(TestNamespace ns) throws Exception { | ||
| OpenMetadataClient client = SdkClients.adminClient(); | ||
|
|
||
| CreateGlossary createGlossary = | ||
| new CreateGlossary() | ||
| .withName(ns.prefix("asset_count_pagination_glossary")) | ||
| .withDescription("Glossary for asset count pagination test"); | ||
| Glossary glossary = client.glossaries().create(createGlossary); | ||
|
|
||
| int termCount = 5; | ||
| java.util.List<String> createdTermFqns = new java.util.ArrayList<>(); | ||
| for (int i = 0; i < termCount; i++) { | ||
| CreateGlossaryTerm req = | ||
| new CreateGlossaryTerm() | ||
| .withName(ns.prefix("count_pagination_term_" + i)) | ||
| .withGlossary(glossary.getFullyQualifiedName()) | ||
| .withDescription("Pagination term " + i); | ||
| createdTermFqns.add(createEntity(req).getFullyQualifiedName()); | ||
| } | ||
|
|
||
| String fullCounts = getAssetCounts(client, glossary.getFullyQualifiedName()); | ||
| ObjectMapper mapper = new ObjectMapper(); | ||
| JsonNode unpaged = mapper.readTree(fullCounts); | ||
| for (String fqn : createdTermFqns) { | ||
| assertTrue(unpaged.has(fqn), "Unpaged response should contain term " + fqn); | ||
| } | ||
|
|
||
| String firstPageBody = getAssetCountsWithPaging(client, glossary.getFullyQualifiedName(), 2, 0); | ||
| JsonNode firstPage = mapper.readTree(firstPageBody); | ||
| assertEquals( | ||
| 2, | ||
| firstPage.size(), | ||
| "First page should contain at most `limit` glossary terms when limit=2"); | ||
|
|
||
| String secondPageBody = | ||
| getAssetCountsWithPaging(client, glossary.getFullyQualifiedName(), 2, 2); | ||
| JsonNode secondPage = mapper.readTree(secondPageBody); | ||
| assertTrue( | ||
| secondPage.size() <= 2, | ||
| "Second page should contain at most `limit` glossary terms when limit=2"); | ||
|
|
||
| java.util.Set<String> firstPageKeys = new java.util.HashSet<>(); | ||
| firstPage.fieldNames().forEachRemaining(firstPageKeys::add); | ||
| secondPage | ||
| .fieldNames() | ||
| .forEachRemaining( | ||
| key -> | ||
| assertFalse( | ||
| firstPageKeys.contains(key), | ||
| "Pages should not overlap: " + key + " appeared on both pages")); | ||
|
|
||
| String beyondEndBody = | ||
| getAssetCountsWithPaging(client, glossary.getFullyQualifiedName(), 5, 10000); | ||
| JsonNode beyondEnd = mapper.readTree(beyondEndBody); | ||
| assertEquals(0, beyondEnd.size(), "Offset past the end should return an empty asset-count map"); | ||
| } |
|
|
🔴 Playwright Results — 1 failure(s), 13 flaky✅ 3990 passed · ❌ 1 failed · 🟡 13 flaky · ⏭️ 86 skipped
Genuine Failures (failed on all attempts)❌
|



Summary
GlossaryTermResource#getAllGlossaryTermsWithAssetsCountwith a single aggregation overtags.tagFQN(matchingDataProductRepository#getAllDataProductsWithAssetsCount) — fixes the slow response on large glossaries.limitandoffsetquery parameters to/v1/glossaryTerms/assets/counts. The response body keeps its existingMap<fqn, count>shape (sliced when paged) and the total number of glossary terms before slicing is returned in theX-Total-CountHTTP header so the ontology view can drive batched loading.getGlossaryTermsAssetCountsPagehelper inglossaryAPI.tsthat readsX-Total-Countfor paginated callers; the existinggetGlossaryTermsAssetCounts(parent)signature is preserved (now also accepts a{ parent, limit, offset }object) so current callers are unaffected.Issue
Closes open-metadata/openmetadata-collate#3920
Test plan
mvn -pl openmetadata-service compile— passes locally.GlossaryTermResourceIT#get_assetsCountsPaginationSlicesAndPreservesTotal— creates 5 terms in a glossary, verifies that:limit=2, offset=0returns at most 2 terms,limit=2, offset=2returns at most 2 terms with no overlap with page 1,offsetpast the end returns an empty map.get_assetsCountsWithParentFilterandget_assetsCountsWithNoParenttests continue to pass — backwards-compatible default behaviour./v1/glossaryTerms/assets/counts?limit=10&offset=0and confirm theX-Total-Countheader.