Skip to content

Fix #3920: Paginate /glossaryTerms/assets/counts and use bulk aggregation#27934

Open
sonika-shah wants to merge 1 commit intomainfrom
fix/3920-glossary-asset-count-pagination
Open

Fix #3920: Paginate /glossaryTerms/assets/counts and use bulk aggregation#27934
sonika-shah wants to merge 1 commit intomainfrom
fix/3920-glossary-asset-count-pagination

Conversation

@sonika-shah
Copy link
Copy Markdown
Collaborator

@sonika-shah sonika-shah commented May 6, 2026

Summary

  • Replaces the per-term search loop in GlossaryTermResource#getAllGlossaryTermsWithAssetsCount with a single aggregation over tags.tagFQN (matching DataProductRepository#getAllDataProductsWithAssetsCount) — fixes the slow response on large glossaries.
  • Adds optional limit and offset query parameters to /v1/glossaryTerms/assets/counts. The response body keeps its existing Map<fqn, count> shape (sliced when paged) and the total number of glossary terms before slicing is returned in the X-Total-Count HTTP header so the ontology view can drive batched loading.
  • Adds a typed getGlossaryTermsAssetCountsPage helper in glossaryAPI.ts that reads X-Total-Count for paginated callers; the existing getGlossaryTermsAssetCounts(parent) signature is preserved (now also accepts a { parent, limit, offset } object) so current callers are unaffected.

Issue

Closes open-metadata/openmetadata-collate#3920

Currently, the asset count API does not support pagination. As a result, it takes longer to respond when handling large data. Additionally, in the ontology view, we need to display terms along with their asset counts in a batched manner to improve overall performance and load all data.

Test plan

  • mvn -pl openmetadata-service compile — passes locally.
  • New integration test GlossaryTermResourceIT#get_assetsCountsPaginationSlicesAndPreservesTotal — creates 5 terms in a glossary, verifies that:
    • the unpaged response contains every created term,
    • limit=2, offset=0 returns at most 2 terms,
    • limit=2, offset=2 returns at most 2 terms with no overlap with page 1,
    • offset past the end returns an empty map.
  • Existing get_assetsCountsWithParentFilter and get_assetsCountsWithNoParent tests continue to pass — backwards-compatible default behaviour.
  • Manual verification: hit /v1/glossaryTerms/assets/counts?limit=10&offset=0 and confirm the X-Total-Count header.

…th bulk aggregation

The glossary `/assets/counts` API listed all glossary term FQNs and issued one
search count query per term (N+1). For large glossaries this made the call slow
and unsuitable for the ontology view, which needs to fetch and render counts in
batches.

This change:
- Replaces the per-term loop with a single search aggregation over `tags.tagFQN`
  and looks up each term's count from the in-memory result, matching the
  `DataProductRepository` pattern.
- Adds optional `limit` and `offset` query parameters to the API and slices the
  ordered FQN list before returning, so callers can drive paged loading.
- Returns the total number of glossary terms (before slicing) in the
  `X-Total-Count` response header for clients that need a total count.
- Adds a typed `getGlossaryTermsAssetCountsPage` helper on the frontend that
  reads the `X-Total-Count` header for paginated callers.

Closes #3920
Copilot AI review requested due to automatic review settings May 6, 2026 09:28
@sonika-shah sonika-shah requested a review from a team as a code owner May 6, 2026 09:28
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 6, 2026
Comment on lines +249 to 253
private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn() {
String queryFilter = QueryFilterBuilder.buildGenericAssetsCountFilter(TAGS_FQN, true);
return inheritedFieldEntitySearch.getAggregatedCountsByField(
TAGS_FQN, queryFilter, EntityBuilderConstant.MAX_AGGREGATE_SIZE);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance: Global aggregation fetched on every paginated call ignores parent filter

Every call to getGlossaryTermsAssetsCount invokes fetchAssetCountsByGlossaryTermFqn() which runs an unfiltered aggregation on tags.tagFQN across ALL indexed entities, bounded only by MAX_AGGREGATE_SIZE (10,000 buckets). This has two implications:

  1. Pagination doesn't reduce ES work — requesting limit=2 still triggers the full aggregation. The perf win comes from replacing N searches with 1, but the single aggregation is still global.
  2. Silent truncation at 10k buckets — the aggregation includes ALL tag FQNs (classification tags + glossary terms). On large deployments with >10k distinct tags, some glossary term FQNs may be missing from the aggregation result, causing them to silently report 0 assets.

Consider scoping the aggregation to only the FQNs in pageFqns (e.g., via a terms filter on the tag FQN field) or at least filtering to the parent prefix. This would both reduce the bucket count and make paginated calls genuinely lighter.

Suggested fix:

private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn(List<String> fqns) {
  // Scope the aggregation to only the requested FQNs
  String queryFilter = QueryFilterBuilder.buildFilteredAssetsCountFilter(
      TAGS_FQN, fqns, true);
  return inheritedFieldEntitySearch.getAggregatedCountsByField(
      TAGS_FQN, queryFilter, fqns.size());
}

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

description =
"Maximum number of glossary terms to return. When omitted, all matching terms are returned.")
@QueryParam("limit")
@Min(value = 0, message = "limit must be >= 0")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: limit=0 returns empty result — may confuse callers expecting metadata only

The @Min(0) annotation on the limit parameter allows limit=0. In sliceFqns, this produces an empty sublist, and because pageFqns.isEmpty() is true, the method returns early with an empty counts map (but a valid total). While technically correct, limit=0 is an unusual API contract — some callers might use it expecting to fetch only the total count. If that's intentional, it's fine as-is. If not, consider changing @Min to 1.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Comment on lines +230 to +233
return listAll(getFields("fullyQualifiedName"), new ListFilter(null)).stream()
.map(GlossaryTerm::getFullyQualifiedName)
.sorted()
.collect(Collectors.toList());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: listGlossaryTermFqns loads all entities into memory on every call

When parent is null, listGlossaryTermFqns calls listAll(...) which loads every GlossaryTerm entity in the system into memory just to extract FQNs. For large deployments with thousands of glossary terms, this creates GC pressure on every paginated request. Consider using a lighter DAO query that returns only FQNs (a projection), or caching the sorted FQN list with a short TTL.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 6, 2026

Code Review ⚠️ Changes requested 0 resolved / 3 findings

Replaces per-term search with bulk aggregation and adds pagination to asset counts, but the current implementation performs global aggregation regardless of parent filters and loads all glossary terms into memory. Additionally, the limit=0 parameter handling may cause confusion for callers expecting metadata.

⚠️ Performance: Global aggregation fetched on every paginated call ignores parent filter

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryTermRepository.java:249-253

Every call to getGlossaryTermsAssetsCount invokes fetchAssetCountsByGlossaryTermFqn() which runs an unfiltered aggregation on tags.tagFQN across ALL indexed entities, bounded only by MAX_AGGREGATE_SIZE (10,000 buckets). This has two implications:

  1. Pagination doesn't reduce ES work — requesting limit=2 still triggers the full aggregation. The perf win comes from replacing N searches with 1, but the single aggregation is still global.
  2. Silent truncation at 10k buckets — the aggregation includes ALL tag FQNs (classification tags + glossary terms). On large deployments with >10k distinct tags, some glossary term FQNs may be missing from the aggregation result, causing them to silently report 0 assets.

Consider scoping the aggregation to only the FQNs in pageFqns (e.g., via a terms filter on the tag FQN field) or at least filtering to the parent prefix. This would both reduce the bucket count and make paginated calls genuinely lighter.

Suggested fix
private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn(List<String> fqns) {
  // Scope the aggregation to only the requested FQNs
  String queryFilter = QueryFilterBuilder.buildFilteredAssetsCountFilter(
      TAGS_FQN, fqns, true);
  return inheritedFieldEntitySearch.getAggregatedCountsByField(
      TAGS_FQN, queryFilter, fqns.size());
}
💡 Edge Case: limit=0 returns empty result — may confuse callers expecting metadata only

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryTermResource.java:452

The @Min(0) annotation on the limit parameter allows limit=0. In sliceFqns, this produces an empty sublist, and because pageFqns.isEmpty() is true, the method returns early with an empty counts map (but a valid total). While technically correct, limit=0 is an unusual API contract — some callers might use it expecting to fetch only the total count. If that's intentional, it's fine as-is. If not, consider changing @Min to 1.

💡 Quality: listGlossaryTermFqns loads all entities into memory on every call

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryTermRepository.java:230-233

When parent is null, listGlossaryTermFqns calls listAll(...) which loads every GlossaryTerm entity in the system into memory just to extract FQNs. For large deployments with thousands of glossary terms, this creates GC pressure on every paginated request. Consider using a lighter DAO query that returns only FQNs (a projection), or caching the sorted FQN list with a short TTL.

🤖 Prompt for agents
Code Review: Replaces per-term search with bulk aggregation and adds pagination to asset counts, but the current implementation performs global aggregation regardless of parent filters and loads all glossary terms into memory. Additionally, the `limit=0` parameter handling may cause confusion for callers expecting metadata.

1. ⚠️ Performance: Global aggregation fetched on every paginated call ignores parent filter
   Files: openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryTermRepository.java:249-253

   Every call to `getGlossaryTermsAssetsCount` invokes `fetchAssetCountsByGlossaryTermFqn()` which runs an unfiltered aggregation on `tags.tagFQN` across ALL indexed entities, bounded only by `MAX_AGGREGATE_SIZE` (10,000 buckets). This has two implications:
   
   1. **Pagination doesn't reduce ES work** — requesting `limit=2` still triggers the full aggregation. The perf win comes from replacing N searches with 1, but the single aggregation is still global.
   2. **Silent truncation at 10k buckets** — the aggregation includes ALL tag FQNs (classification tags + glossary terms). On large deployments with >10k distinct tags, some glossary term FQNs may be missing from the aggregation result, causing them to silently report 0 assets.
   
   Consider scoping the aggregation to only the FQNs in `pageFqns` (e.g., via a `terms` filter on the tag FQN field) or at least filtering to the parent prefix. This would both reduce the bucket count and make paginated calls genuinely lighter.

   Suggested fix:
   private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn(List<String> fqns) {
     // Scope the aggregation to only the requested FQNs
     String queryFilter = QueryFilterBuilder.buildFilteredAssetsCountFilter(
         TAGS_FQN, fqns, true);
     return inheritedFieldEntitySearch.getAggregatedCountsByField(
         TAGS_FQN, queryFilter, fqns.size());
   }

2. 💡 Edge Case: limit=0 returns empty result — may confuse callers expecting metadata only
   Files: openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryTermResource.java:452

   The `@Min(0)` annotation on the `limit` parameter allows `limit=0`. In `sliceFqns`, this produces an empty sublist, and because `pageFqns.isEmpty()` is true, the method returns early with an empty counts map (but a valid total). While technically correct, `limit=0` is an unusual API contract — some callers might use it expecting to fetch only the total count. If that's intentional, it's fine as-is. If not, consider changing `@Min` to 1.

3. 💡 Quality: listGlossaryTermFqns loads all entities into memory on every call
   Files: openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryTermRepository.java:230-233

   When `parent` is null, `listGlossaryTermFqns` calls `listAll(...)` which loads every `GlossaryTerm` entity in the system into memory just to extract FQNs. For large deployments with thousands of glossary terms, this creates GC pressure on every paginated request. Consider using a lighter DAO query that returns only FQNs (a projection), or caching the sorted FQN list with a short TTL.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves performance of the glossary term asset-counts endpoint by replacing per-term search calls with a single search aggregation, and adds optional pagination (limit/offset) while exposing the pre-slice total via the X-Total-Count response header. The UI REST layer is extended with a typed helper that reads X-Total-Count for paginated callers while preserving the existing function signature for current consumers.

Changes:

  • Backend: /v1/glossaryTerms/assets/counts now supports limit/offset and returns X-Total-Count.
  • Backend: replaces the per-term loop with a bulk aggregation over tags.tagFQN and slices results server-side.
  • UI + ITs: adds a paginated UI helper and an integration test covering page slicing behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
openmetadata-ui/src/main/resources/ui/src/rest/glossaryAPI.ts Adds typed params/response helpers, including a getGlossaryTermsAssetCountsPage that reads X-Total-Count.
openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryTermResource.java Adds limit/offset query params and returns X-Total-Count header for the asset-counts endpoint.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryTermRepository.java Implements paged term list slicing and replaces per-term counting with a single aggregated counts query.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/GlossaryTermResourceIT.java Adds an integration test validating pagination slicing behavior for the counts endpoint.

Comment on lines +249 to 253
private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn() {
String queryFilter = QueryFilterBuilder.buildGenericAssetsCountFilter(TAGS_FQN, true);
return inheritedFieldEntitySearch.getAggregatedCountsByField(
TAGS_FQN, queryFilter, EntityBuilderConstant.MAX_AGGREGATE_SIZE);
}
Comment on lines 428 to +464
@@ -440,12 +444,24 @@ public Response getAllGlossaryTermsWithAssetsCount(
"Filter by parent glossary or glossary term FQN. "
+ "When provided, only returns asset counts for children whose FQN starts with this value.")
@QueryParam("parent")
String parent) {
String parent,
@Parameter(
description =
"Maximum number of glossary terms to return. When omitted, all matching terms are returned.")
@QueryParam("limit")
@Min(value = 0, message = "limit must be >= 0")
@Max(value = 10000, message = "limit must be <= 10000")
Integer limit,
@Parameter(description = "Offset into the ordered list of glossary terms to start from.")
@QueryParam("offset")
@Min(value = 0, message = "offset must be >= 0")
Integer offset) {
OperationContext operationContext =
new OperationContext(entityType, MetadataOperation.VIEW_ALL);
authorizer.authorize(securityContext, operationContext, getResourceContext());
java.util.Map<String, Integer> result = repository.getAllGlossaryTermsWithAssetsCount(parent);
return Response.ok(result).build();
GlossaryTermRepository.GlossaryTermAssetCountResult result =
repository.getGlossaryTermsAssetsCount(parent, limit, offset);
return Response.ok(result.counts()).header(TOTAL_COUNT_HEADER, result.total()).build();
Comment on lines +3013 to +3069
@Test
void get_assetsCountsPaginationSlicesAndPreservesTotal(TestNamespace ns) throws Exception {
OpenMetadataClient client = SdkClients.adminClient();

CreateGlossary createGlossary =
new CreateGlossary()
.withName(ns.prefix("asset_count_pagination_glossary"))
.withDescription("Glossary for asset count pagination test");
Glossary glossary = client.glossaries().create(createGlossary);

int termCount = 5;
java.util.List<String> createdTermFqns = new java.util.ArrayList<>();
for (int i = 0; i < termCount; i++) {
CreateGlossaryTerm req =
new CreateGlossaryTerm()
.withName(ns.prefix("count_pagination_term_" + i))
.withGlossary(glossary.getFullyQualifiedName())
.withDescription("Pagination term " + i);
createdTermFqns.add(createEntity(req).getFullyQualifiedName());
}

String fullCounts = getAssetCounts(client, glossary.getFullyQualifiedName());
ObjectMapper mapper = new ObjectMapper();
JsonNode unpaged = mapper.readTree(fullCounts);
for (String fqn : createdTermFqns) {
assertTrue(unpaged.has(fqn), "Unpaged response should contain term " + fqn);
}

String firstPageBody = getAssetCountsWithPaging(client, glossary.getFullyQualifiedName(), 2, 0);
JsonNode firstPage = mapper.readTree(firstPageBody);
assertEquals(
2,
firstPage.size(),
"First page should contain at most `limit` glossary terms when limit=2");

String secondPageBody =
getAssetCountsWithPaging(client, glossary.getFullyQualifiedName(), 2, 2);
JsonNode secondPage = mapper.readTree(secondPageBody);
assertTrue(
secondPage.size() <= 2,
"Second page should contain at most `limit` glossary terms when limit=2");

java.util.Set<String> firstPageKeys = new java.util.HashSet<>();
firstPage.fieldNames().forEachRemaining(firstPageKeys::add);
secondPage
.fieldNames()
.forEachRemaining(
key ->
assertFalse(
firstPageKeys.contains(key),
"Pages should not overlap: " + key + " appeared on both pages"));

String beyondEndBody =
getAssetCountsWithPaging(client, glossary.getFullyQualifiedName(), 5, 10000);
JsonNode beyondEnd = mapper.readTree(beyondEndBody);
assertEquals(0, beyondEnd.size(), "Offset past the end should return an empty asset-count map");
}
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Jest test Coverage

UI tests summary

Lines Statements Branches Functions
Coverage: 62%
62.43% (62958/100843) 42.75% (33933/79370) 45.75% (10038/21940)

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

🔴 Playwright Results — 1 failure(s), 13 flaky

✅ 3990 passed · ❌ 1 failed · 🟡 13 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 297 0 2 4
🔴 Shard 2 746 1 3 8
🟡 Shard 3 752 0 3 7
🟡 Shard 4 774 0 1 18
🟡 Shard 5 686 0 1 41
🟡 Shard 6 735 0 3 8

Genuine Failures (failed on all attempts)

Features/Container.spec.ts › Container page children pagination (shard 2)
�[31mTest timeout of 180000ms exceeded.�[39m
🟡 13 flaky test(s) (passed on retry)
  • Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataProductPersonaCustomization.spec.ts › Data Product - customize tab label should only render if it's customized by user (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Table.spec.ts › Tags term should be consistent for search (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/DataProductAndSubdomains.spec.ts › Add assets to data product and verify count (shard 4, 1 retry)
  • Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Features/AutoPilot.spec.ts › Agents created by AutoPilot should be deleted (shard 6, 1 retry)
  • Pages/GlossaryImportExport.spec.ts › Glossary CSV import preserves typed relations (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants