Fixes #27950: [Datalake] JSON columns incorrectly typed as STRING for empty dict values by mohittilala · Pull Request #27951 · open-metadata/OpenMetadata

mohittilala · 2026-05-07T03:17:18Z

Describe your changes:

Fixes #27950

Changes in OpenMetadata submodule (datalake_utils.py):

Empty dict/list columns now correctly typed as JSON/ARRAY instead of STRING
Skip ast.literal_eval round-trip for already-parsed dict/list values
get_children handles parsed dicts and JSON strings independently — no more TypeError log spam

Tests added (tests/unit/utils/test_datalake.py):

Unit tests for fetch_col_types and get_children with parsed objects, empty containers, mixed types
End-to-end tests reading real fixture files through the full _read_json_object → _get_columns pipeline

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Bug fix

I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.

… values

Copilot

Pull request overview

Fixes an ingestion bug in the Datalake connector where JSON-like columns (especially empty {} / [] values coming from single-object JSON files) were incorrectly inferred as STRING, and where parsing children could emit repeated TypeError debug logs.

Changes:

Update column type inference to treat non-null object columns as candidates even when values are falsy containers, and avoid unnecessary ast.literal_eval for already-parsed dict/list values.
Rework JSON children extraction to handle mixed parsed-dict and JSON-string values without TypeError noise.
Add unit + fixture-based tests covering parsed objects, empty containers, and single-object JSON ingestion behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`ingestion/src/metadata/utils/datalake/datalake_utils.py`	Fixes type inference for empty dict/list values and makes `get_children` robust to parsed objects vs JSON strings.
`ingestion/tests/unit/utils/test_datalake.py`	Adds targeted tests for `fetch_col_types`/`get_children` and fixture-driven single-object JSON parsing.
`ingestion/tests/unit/resources/datalake/dbt_manifest.json`	Adds a representative single-object dbt manifest fixture with multiple empty-object fields.
`ingestion/tests/unit/resources/datalake/dbt_catalog.json`	Adds a representative single-object dbt catalog fixture with nested dicts and nulls.

…fy type checker

github-actions · 2026-05-07T05:56:52Z

🟡 Playwright Results — all passed (10 flaky)

✅ 4002 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 86 skipped

Shard	Passed	Flaky	Skipped
✅ Shard 1	299	0	4
🟡 Shard 2	748	4	8
🟡 Shard 3	758	3	7
✅ Shard 4	775	0	18
✅ Shard 5	687	0	41
🟡 Shard 6	735	3	8

🟡 10 flaky test(s) (passed on retry)

Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 2 retries)
Features/IncidentManager.spec.ts › Next, Previous and page indicator (shard 2, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Flow/AddRoleAndAssignToUser.spec.ts › Verify assigned role to new user (shard 3, 1 retry)
Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
Features/AutoPilot.spec.ts › Agents created by AutoPilot should be deleted (shard 6, 1 retry)
Pages/GlossaryImportExport.spec.ts › Glossary CSV import rejects unknown relation type (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

… utils

gitar-bot · 2026-05-08T04:52:07Z

Code Review ✅ Approved

Updates datalake utilities to correctly type empty dictionaries and lists as JSON or ARRAY instead of STRING. The implementation includes refined type handling and added unit tests to ensure accurate column inference.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

mohittilala · 2026-05-08T06:35:59Z

+                    df_row_val_list = col_non_null.values[:1000]
                    parsed_object_datatype_list = []
                    for df_row_val in df_row_val_list:
                        try:
-                            parsed_object_datatype_list.append(type(ast.literal_eval(str(df_row_val))).__name__.lower())
+                            if isinstance(df_row_val, (dict, list)):
+                                parsed_object_datatype_list.append(type(df_row_val).__name__.lower())
+                            else:
+                                parsed_object_datatype_list.append(
+                                    type(ast.literal_eval(str(df_row_val))).__name__.lower()
+                                )


pre-existing behavior, out of scope for this PR (which targets the empty-dict typing bug). Worth a follow-up issue.

sonarqubecloud · 2026-05-08T05:53:56Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
87.5% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

fix: datalake JSON columns incorrectly typed as STRING for empty dict…

c14e05b

… values

mohittilala self-assigned this May 7, 2026

Copilot AI review requested due to automatic review settings May 7, 2026 03:17

mohittilala requested a review from a team as a code owner May 7, 2026 03:17

mohittilala added bug Something isn't working Ingestion safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch python Pull requests that update python code labels May 7, 2026

Copilot started reviewing on behalf of mohittilala May 7, 2026 03:17 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread ingestion/src/metadata/utils/datalake/datalake_utils.py Outdated

Comment thread ingestion/tests/unit/utils/test_datalake.py

mohittilala had a problem deploying to test May 7, 2026 03:28 — with GitHub Actions Failure

mohittilala temporarily deployed to test May 7, 2026 03:28 — with GitHub Actions Inactive

fix: wrap df_row_val with str() for strptime and parse calls to satis…

f17df88

…fy type checker

fix: address static check type errors and review comments in datalake…

afd7898

… utils

mohittilala temporarily deployed to test May 7, 2026 06:58 — with GitHub Actions Inactive

mohittilala had a problem deploying to test May 7, 2026 06:58 — with GitHub Actions Failure

mohittilala temporarily deployed to test May 7, 2026 06:58 — with GitHub Actions Inactive

mohittilala temporarily deployed to test May 7, 2026 13:24 — with GitHub Actions Inactive

Restore debug logging, fix dead-code fallback, strengthen tests

a0f0f01

Copilot AI review requested due to automatic review settings May 8, 2026 04:50

Copilot started reviewing on behalf of mohittilala May 8, 2026 04:51 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

mohittilala temporarily deployed to test May 8, 2026 05:00 — with GitHub Actions Inactive

mohittilala had a problem deploying to test May 8, 2026 05:00 — with GitHub Actions Failure

mohittilala temporarily deployed to test May 8, 2026 05:00 — with GitHub Actions Inactive

mohittilala temporarily deployed to test May 8, 2026 07:22 — with GitHub Actions Inactive

TeddyCr approved these changes May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #27950: [Datalake] JSON columns incorrectly typed as STRING for empty dict values#27951

Fixes #27950: [Datalake] JSON columns incorrectly typed as STRING for empty dict values#27951
mohittilala wants to merge 4 commits intomainfrom
fix/datalake-json-column-type-detection

mohittilala commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026 •

edited

Loading

Uh oh!

gitar-bot Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

mohittilala May 8, 2026

Uh oh!

sonarqubecloud Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mohittilala commented May 7, 2026

Describe your changes:

Type of change:

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟡 Playwright Results — all passed (10 flaky)

Uh oh!

gitar-bot Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

mohittilala May 8, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 8, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 7, 2026 •

edited

Loading

gitar-bot Bot commented May 8, 2026 •

edited

Loading