Fix #3920: Paginate /glossaryTerms/assets/counts and use bulk aggregation by sonika-shah · Pull Request #27934 · open-metadata/OpenMetadata

sonika-shah · 2026-05-06T09:28:35Z

Summary

Replaces the per-term search loop in GlossaryTermResource#getAllGlossaryTermsWithAssetsCount with a single aggregation over tags.tagFQN (matching DataProductRepository#getAllDataProductsWithAssetsCount) — fixes the slow response on large glossaries.
Adds optional limit and offset query parameters to /v1/glossaryTerms/assets/counts. The response body keeps its existing Map<fqn, count> shape (sliced when paged) and the total number of glossary terms before slicing is returned in the X-Total-Count HTTP header so the ontology view can drive batched loading.
Adds a typed getGlossaryTermsAssetCountsPage helper in glossaryAPI.ts that reads X-Total-Count for paginated callers; the existing getGlossaryTermsAssetCounts(parent) signature is preserved (now also accepts a { parent, limit, offset } object) so current callers are unaffected.

Issue

Closes open-metadata/openmetadata-collate#3920

Currently, the asset count API does not support pagination. As a result, it takes longer to respond when handling large data. Additionally, in the ontology view, we need to display terms along with their asset counts in a batched manner to improve overall performance and load all data.

Test plan

mvn -pl openmetadata-service compile — passes locally.
New integration test GlossaryTermResourceIT#get_assetsCountsPaginationSlicesAndPreservesTotal — creates 5 terms in a glossary, verifies that:
- the unpaged response contains every created term,
- limit=2, offset=0 returns at most 2 terms,
- limit=2, offset=2 returns at most 2 terms with no overlap with page 1,
- offset past the end returns an empty map.
Existing get_assetsCountsWithParentFilter and get_assetsCountsWithNoParent tests continue to pass — backwards-compatible default behaviour.
Manual verification: hit /v1/glossaryTerms/assets/counts?limit=10&offset=0 and confirm the X-Total-Count header.

…th bulk aggregation The glossary `/assets/counts` API listed all glossary term FQNs and issued one search count query per term (N+1). For large glossaries this made the call slow and unsuitable for the ontology view, which needs to fetch and render counts in batches. This change: - Replaces the per-term loop with a single search aggregation over `tags.tagFQN` and looks up each term's count from the in-memory result, matching the `DataProductRepository` pattern. - Adds optional `limit` and `offset` query parameters to the API and slices the ordered FQN list before returning, so callers can drive paged loading. - Returns the total number of glossary terms (before slicing) in the `X-Total-Count` response header for clients that need a total count. - Adds a typed `getGlossaryTermsAssetCountsPage` helper on the frontend that reads the `X-Total-Count` header for paginated callers. Closes #3920

gitar-bot · 2026-05-06T09:31:20Z

+  private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn() {
+    String queryFilter = QueryFilterBuilder.buildGenericAssetsCountFilter(TAGS_FQN, true);
+    return inheritedFieldEntitySearch.getAggregatedCountsByField(
+        TAGS_FQN, queryFilter, EntityBuilderConstant.MAX_AGGREGATE_SIZE);
  }


⚠️ Performance: Global aggregation fetched on every paginated call ignores parent filter

Every call to getGlossaryTermsAssetsCount invokes fetchAssetCountsByGlossaryTermFqn() which runs an unfiltered aggregation on tags.tagFQN across ALL indexed entities, bounded only by MAX_AGGREGATE_SIZE (10,000 buckets). This has two implications:

Pagination doesn't reduce ES work — requesting limit=2 still triggers the full aggregation. The perf win comes from replacing N searches with 1, but the single aggregation is still global.

Silent truncation at 10k buckets — the aggregation includes ALL tag FQNs (classification tags + glossary terms). On large deployments with >10k distinct tags, some glossary term FQNs may be missing from the aggregation result, causing them to silently report 0 assets.

Consider scoping the aggregation to only the FQNs in pageFqns (e.g., via a terms filter on the tag FQN field) or at least filtering to the parent prefix. This would both reduce the bucket count and make paginated calls genuinely lighter.

Suggested fix:

private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn(List<String> fqns) { // Scope the aggregation to only the requested FQNs String queryFilter = QueryFilterBuilder.buildFilteredAssetsCountFilter( TAGS_FQN, fqns, true); return inheritedFieldEntitySearch.getAggregatedCountsByField( TAGS_FQN, queryFilter, fqns.size()); }

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-05-06T09:31:22Z

+              description =
+                  "Maximum number of glossary terms to return. When omitted, all matching terms are returned.")
+          @QueryParam("limit")
+          @Min(value = 0, message = "limit must be >= 0")


💡 Edge Case: limit=0 returns empty result — may confuse callers expecting metadata only

The @Min(0) annotation on the limit parameter allows limit=0. In sliceFqns, this produces an empty sublist, and because pageFqns.isEmpty() is true, the method returns early with an empty counts map (but a valid total). While technically correct, limit=0 is an unusual API contract — some callers might use it expecting to fetch only the total count. If that's intentional, it's fine as-is. If not, consider changing @Min to 1.

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-05-06T09:31:24Z

+    return listAll(getFields("fullyQualifiedName"), new ListFilter(null)).stream()
+        .map(GlossaryTerm::getFullyQualifiedName)
+        .sorted()
+        .collect(Collectors.toList());


💡 Quality: listGlossaryTermFqns loads all entities into memory on every call

When parent is null, listGlossaryTermFqns calls listAll(...) which loads every GlossaryTerm entity in the system into memory just to extract FQNs. For large deployments with thousands of glossary terms, this creates GC pressure on every paginated request. Consider using a lighter DAO query that returns only FQNs (a projection), or caching the sorted FQN list with a short TTL.

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-05-06T09:31:25Z

Code Review ⚠️ Changes requested 0 resolved / 3 findings

Replaces per-term search with bulk aggregation and adds pagination to asset counts, but the current implementation performs global aggregation regardless of parent filters and loads all glossary terms into memory. Additionally, the limit=0 parameter handling may cause confusion for callers expecting metadata.

⚠️

Performance: Global aggregation fetched on every paginated call ignores parent filter

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryTermRepository.java:249-253

Every call to getGlossaryTermsAssetsCount invokes fetchAssetCountsByGlossaryTermFqn() which runs an unfiltered aggregation on tags.tagFQN across ALL indexed entities, bounded only by MAX_AGGREGATE_SIZE (10,000 buckets). This has two implications:

Pagination doesn't reduce ES work — requesting limit=2 still triggers the full aggregation. The perf win comes from replacing N searches with 1, but the single aggregation is still global.
Silent truncation at 10k buckets — the aggregation includes ALL tag FQNs (classification tags + glossary terms). On large deployments with >10k distinct tags, some glossary term FQNs may be missing from the aggregation result, causing them to silently report 0 assets.

Consider scoping the aggregation to only the FQNs in pageFqns (e.g., via a terms filter on the tag FQN field) or at least filtering to the parent prefix. This would both reduce the bucket count and make paginated calls genuinely lighter.

Suggested fix

private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn(List<String> fqns) {
  // Scope the aggregation to only the requested FQNs
  String queryFilter = QueryFilterBuilder.buildFilteredAssetsCountFilter(
      TAGS_FQN, fqns, true);
  return inheritedFieldEntitySearch.getAggregatedCountsByField(
      TAGS_FQN, queryFilter, fqns.size());
}

💡 Edge Case: limit=0 returns empty result — may confuse callers expecting metadata only

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryTermResource.java:452

The @Min(0) annotation on the limit parameter allows limit=0. In sliceFqns, this produces an empty sublist, and because pageFqns.isEmpty() is true, the method returns early with an empty counts map (but a valid total). While technically correct, limit=0 is an unusual API contract — some callers might use it expecting to fetch only the total count. If that's intentional, it's fine as-is. If not, consider changing @Min to 1.

💡 Quality: listGlossaryTermFqns loads all entities into memory on every call

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryTermRepository.java:230-233

When parent is null, listGlossaryTermFqns calls listAll(...) which loads every GlossaryTerm entity in the system into memory just to extract FQNs. For large deployments with thousands of glossary terms, this creates GC pressure on every paginated request. Consider using a lighter DAO query that returns only FQNs (a projection), or caching the sorted FQN list with a short TTL.

🤖 Prompt for agents

Code Review: Replaces per-term search with bulk aggregation and adds pagination to asset counts, but the current implementation performs global aggregation regardless of parent filters and loads all glossary terms into memory. Additionally, the `limit=0` parameter handling may cause confusion for callers expecting metadata.

1. ⚠️ Performance: Global aggregation fetched on every paginated call ignores parent filter
   Files: openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryTermRepository.java:249-253

   Every call to `getGlossaryTermsAssetsCount` invokes `fetchAssetCountsByGlossaryTermFqn()` which runs an unfiltered aggregation on `tags.tagFQN` across ALL indexed entities, bounded only by `MAX_AGGREGATE_SIZE` (10,000 buckets). This has two implications:
   
   1. **Pagination doesn't reduce ES work** — requesting `limit=2` still triggers the full aggregation. The perf win comes from replacing N searches with 1, but the single aggregation is still global.
   2. **Silent truncation at 10k buckets** — the aggregation includes ALL tag FQNs (classification tags + glossary terms). On large deployments with >10k distinct tags, some glossary term FQNs may be missing from the aggregation result, causing them to silently report 0 assets.
   
   Consider scoping the aggregation to only the FQNs in `pageFqns` (e.g., via a `terms` filter on the tag FQN field) or at least filtering to the parent prefix. This would both reduce the bucket count and make paginated calls genuinely lighter.

   Suggested fix:
   private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn(List<String> fqns) {
     // Scope the aggregation to only the requested FQNs
     String queryFilter = QueryFilterBuilder.buildFilteredAssetsCountFilter(
         TAGS_FQN, fqns, true);
     return inheritedFieldEntitySearch.getAggregatedCountsByField(
         TAGS_FQN, queryFilter, fqns.size());
   }

2. 💡 Edge Case: limit=0 returns empty result — may confuse callers expecting metadata only
   Files: openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryTermResource.java:452

   The `@Min(0)` annotation on the `limit` parameter allows `limit=0`. In `sliceFqns`, this produces an empty sublist, and because `pageFqns.isEmpty()` is true, the method returns early with an empty counts map (but a valid total). While technically correct, `limit=0` is an unusual API contract — some callers might use it expecting to fetch only the total count. If that's intentional, it's fine as-is. If not, consider changing `@Min` to 1.

3. 💡 Quality: listGlossaryTermFqns loads all entities into memory on every call
   Files: openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryTermRepository.java:230-233

   When `parent` is null, `listGlossaryTermFqns` calls `listAll(...)` which loads every `GlossaryTerm` entity in the system into memory just to extract FQNs. For large deployments with thousands of glossary terms, this creates GC pressure on every paginated request. Consider using a lighter DAO query that returns only FQNs (a projection), or caching the sorted FQN list with a short TTL.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Copilot

Pull request overview

This PR improves performance of the glossary term asset-counts endpoint by replacing per-term search calls with a single search aggregation, and adds optional pagination (limit/offset) while exposing the pre-slice total via the X-Total-Count response header. The UI REST layer is extended with a typed helper that reads X-Total-Count for paginated callers while preserving the existing function signature for current consumers.

Changes:

Backend: /v1/glossaryTerms/assets/counts now supports limit/offset and returns X-Total-Count.
Backend: replaces the per-term loop with a bulk aggregation over tags.tagFQN and slices results server-side.
UI + ITs: adds a paginated UI helper and an integration test covering page slicing behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
openmetadata-ui/src/main/resources/ui/src/rest/glossaryAPI.ts	Adds typed params/response helpers, including a `getGlossaryTermsAssetCountsPage` that reads `X-Total-Count`.
openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryTermResource.java	Adds `limit`/`offset` query params and returns `X-Total-Count` header for the asset-counts endpoint.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryTermRepository.java	Implements paged term list slicing and replaces per-term counting with a single aggregated counts query.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/GlossaryTermResourceIT.java	Adds an integration test validating pagination slicing behavior for the counts endpoint.

+  private Map<String, Integer> fetchAssetCountsByGlossaryTermFqn() {
+    String queryFilter = QueryFilterBuilder.buildGenericAssetsCountFilter(TAGS_FQN, true);
+    return inheritedFieldEntitySearch.getAggregatedCountsByField(
+        TAGS_FQN, queryFilter, EntityBuilderConstant.MAX_AGGREGATE_SIZE);
  }


@@ -440,12 +444,24 @@ public Response getAllGlossaryTermsWithAssetsCount(
                  "Filter by parent glossary or glossary term FQN. "
                      + "When provided, only returns asset counts for children whose FQN starts with this value.")
          @QueryParam("parent")
-          String parent) {
+          String parent,
+      @Parameter(
+              description =
+                  "Maximum number of glossary terms to return. When omitted, all matching terms are returned.")
+          @QueryParam("limit")
+          @Min(value = 0, message = "limit must be >= 0")
+          @Max(value = 10000, message = "limit must be <= 10000")
+          Integer limit,
+      @Parameter(description = "Offset into the ordered list of glossary terms to start from.")
+          @QueryParam("offset")
+          @Min(value = 0, message = "offset must be >= 0")
+          Integer offset) {
    OperationContext operationContext =
        new OperationContext(entityType, MetadataOperation.VIEW_ALL);
    authorizer.authorize(securityContext, operationContext, getResourceContext());
-    java.util.Map<String, Integer> result = repository.getAllGlossaryTermsWithAssetsCount(parent);
-    return Response.ok(result).build();
+    GlossaryTermRepository.GlossaryTermAssetCountResult result =
+        repository.getGlossaryTermsAssetsCount(parent, limit, offset);
+    return Response.ok(result.counts()).header(TOTAL_COUNT_HEADER, result.total()).build();


+  @Test
+  void get_assetsCountsPaginationSlicesAndPreservesTotal(TestNamespace ns) throws Exception {
+    OpenMetadataClient client = SdkClients.adminClient();
+
+    CreateGlossary createGlossary =
+        new CreateGlossary()
+            .withName(ns.prefix("asset_count_pagination_glossary"))
+            .withDescription("Glossary for asset count pagination test");
+    Glossary glossary = client.glossaries().create(createGlossary);
+
+    int termCount = 5;
+    java.util.List<String> createdTermFqns = new java.util.ArrayList<>();
+    for (int i = 0; i < termCount; i++) {
+      CreateGlossaryTerm req =
+          new CreateGlossaryTerm()
+              .withName(ns.prefix("count_pagination_term_" + i))
+              .withGlossary(glossary.getFullyQualifiedName())
+              .withDescription("Pagination term " + i);
+      createdTermFqns.add(createEntity(req).getFullyQualifiedName());
+    }
+
+    String fullCounts = getAssetCounts(client, glossary.getFullyQualifiedName());
+    ObjectMapper mapper = new ObjectMapper();
+    JsonNode unpaged = mapper.readTree(fullCounts);
+    for (String fqn : createdTermFqns) {
+      assertTrue(unpaged.has(fqn), "Unpaged response should contain term " + fqn);
+    }
+
+    String firstPageBody = getAssetCountsWithPaging(client, glossary.getFullyQualifiedName(), 2, 0);
+    JsonNode firstPage = mapper.readTree(firstPageBody);
+    assertEquals(
+        2,
+        firstPage.size(),
+        "First page should contain at most `limit` glossary terms when limit=2");
+
+    String secondPageBody =
+        getAssetCountsWithPaging(client, glossary.getFullyQualifiedName(), 2, 2);
+    JsonNode secondPage = mapper.readTree(secondPageBody);
+    assertTrue(
+        secondPage.size() <= 2,
+        "Second page should contain at most `limit` glossary terms when limit=2");
+
+    java.util.Set<String> firstPageKeys = new java.util.HashSet<>();
+    firstPage.fieldNames().forEachRemaining(firstPageKeys::add);
+    secondPage
+        .fieldNames()
+        .forEachRemaining(
+            key ->
+                assertFalse(
+                    firstPageKeys.contains(key),
+                    "Pages should not overlap: " + key + " appeared on both pages"));
+
+    String beyondEndBody =
+        getAssetCountsWithPaging(client, glossary.getFullyQualifiedName(), 5, 10000);
+    JsonNode beyondEnd = mapper.readTree(beyondEndBody);
+    assertEquals(0, beyondEnd.size(), "Offset past the end should return an empty asset-count map");
+  }


github-actions · 2026-05-06T09:51:20Z

Jest test Coverage

UI tests summary

Lines	Statements	Branches	Functions
	62.43% (62958/100843)	42.75% (33933/79370)	45.75% (10038/21940)

sonarqubecloud · 2026-05-06T09:57:50Z

Quality Gate passed for 'open-metadata-ui'

Issues
1068 New issues
0 Accepted issues

Measures
17 Security Hotspots
0.0% Coverage on New Code
3.8% Duplication on New Code

See analysis details on SonarQube Cloud

sonarqubecloud · 2026-05-06T10:30:56Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-05-06T11:52:02Z

🔴 Playwright Results — 1 failure(s), 13 flaky

✅ 3990 passed · ❌ 1 failed · 🟡 13 flaky · ⏭️ 86 skipped

Shard	Passed	Failed	Flaky	Skipped
🟡 Shard 1	297	0	2	4
🔴 Shard 2	746	1	3	8
🟡 Shard 3	752	0	3	7
🟡 Shard 4	774	0	1	18
🟡 Shard 5	686	0	1	41
🟡 Shard 6	735	0	3	8

Genuine Failures (failed on all attempts)

❌ Features/Container.spec.ts › Container page children pagination (shard 2)

�[31mTest timeout of 180000ms exceeded.�[39m

🟡 13 flaky test(s) (passed on retry)

Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1, 1 retry)
Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
Features/DataProductPersonaCustomization.spec.ts › Data Product - customize tab label should only render if it's customized by user (shard 2, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Features/Table.spec.ts › Tags term should be consistent for search (shard 3, 1 retry)
Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
Pages/DataProductAndSubdomains.spec.ts › Add assets to data product and verify count (shard 4, 1 retry)
Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
Features/AutoPilot.spec.ts › Agents created by AutoPilot should be deleted (shard 6, 1 retry)
Pages/GlossaryImportExport.spec.ts › Glossary CSV import preserves typed relations (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI review requested due to automatic review settings May 6, 2026 09:28

sonika-shah requested a review from a team as a code owner May 6, 2026 09:28

github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 6, 2026

Copilot started reviewing on behalf of sonika-shah May 6, 2026 09:29 View session

gitar-bot Bot reviewed May 6, 2026

View reviewed changes

Copilot AI reviewed May 6, 2026

View reviewed changes

sonika-shah temporarily deployed to test May 6, 2026 09:38 — with GitHub Actions Inactive

sonika-shah had a problem deploying to test May 6, 2026 09:38 — with GitHub Actions Failure

sonika-shah temporarily deployed to test May 6, 2026 09:38 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #3920: Paginate /glossaryTerms/assets/counts and use bulk aggregation#27934

Fix #3920: Paginate /glossaryTerms/assets/counts and use bulk aggregation#27934
sonika-shah wants to merge 1 commit intomainfrom
fix/3920-glossary-asset-count-pagination

sonika-shah commented May 6, 2026 •

edited

Loading

Uh oh!

gitar-bot Bot May 6, 2026

Uh oh!

gitar-bot Bot May 6, 2026

Uh oh!

gitar-bot Bot May 6, 2026

Uh oh!

gitar-bot Bot commented May 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

sonarqubecloud Bot commented May 6, 2026

Uh oh!

sonarqubecloud Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sonika-shah commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issue

Test plan

Uh oh!

gitar-bot Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented May 6, 2026

Jest test Coverage

UI tests summary

Uh oh!

sonarqubecloud Bot commented May 6, 2026

Quality Gate passed for 'open-metadata-ui'

Uh oh!

sonarqubecloud Bot commented May 6, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

github-actions Bot commented May 6, 2026

🔴 Playwright Results — 1 failure(s), 13 flaky

Genuine Failures (failed on all attempts)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sonika-shah commented May 6, 2026 •

edited

Loading

gitar-bot Bot commented May 6, 2026 •

edited

Loading