Fixes #27950: [Datalake] JSON columns incorrectly typed as STRING for empty dict values#27951
Fixes #27950: [Datalake] JSON columns incorrectly typed as STRING for empty dict values#27951mohittilala wants to merge 4 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes an ingestion bug in the Datalake connector where JSON-like columns (especially empty {} / [] values coming from single-object JSON files) were incorrectly inferred as STRING, and where parsing children could emit repeated TypeError debug logs.
Changes:
- Update column type inference to treat non-null object columns as candidates even when values are falsy containers, and avoid unnecessary
ast.literal_evalfor already-parseddict/listvalues. - Rework JSON children extraction to handle mixed parsed-
dictand JSON-string values withoutTypeErrornoise. - Add unit + fixture-based tests covering parsed objects, empty containers, and single-object JSON ingestion behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ingestion/src/metadata/utils/datalake/datalake_utils.py |
Fixes type inference for empty dict/list values and makes get_children robust to parsed objects vs JSON strings. |
ingestion/tests/unit/utils/test_datalake.py |
Adds targeted tests for fetch_col_types/get_children and fixture-driven single-object JSON parsing. |
ingestion/tests/unit/resources/datalake/dbt_manifest.json |
Adds a representative single-object dbt manifest fixture with multiple empty-object fields. |
ingestion/tests/unit/resources/datalake/dbt_catalog.json |
Adds a representative single-object dbt catalog fixture with nested dicts and nulls. |
🟡 Playwright Results — all passed (10 flaky)✅ 4002 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 86 skipped
🟡 10 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
Code Review ✅ ApprovedUpdates datalake utilities to correctly type empty dictionaries and lists as JSON or ARRAY instead of STRING. The implementation includes refined type handling and added unit tests to ensure accurate column inference. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
| df_row_val_list = col_non_null.values[:1000] | ||
| parsed_object_datatype_list = [] | ||
| for df_row_val in df_row_val_list: | ||
| try: | ||
| parsed_object_datatype_list.append(type(ast.literal_eval(str(df_row_val))).__name__.lower()) | ||
| if isinstance(df_row_val, (dict, list)): | ||
| parsed_object_datatype_list.append(type(df_row_val).__name__.lower()) | ||
| else: | ||
| parsed_object_datatype_list.append( | ||
| type(ast.literal_eval(str(df_row_val))).__name__.lower() | ||
| ) |
There was a problem hiding this comment.
pre-existing behavior, out of scope for this PR (which targets the empty-dict typing bug). Worth a follow-up issue.
|



Describe your changes:
Fixes #27950
Changes in OpenMetadata submodule (
datalake_utils.py):JSON/ARRAYinstead ofSTRINGast.literal_evalround-trip for already-parseddict/listvaluesget_childrenhandles parsed dicts and JSON strings independently — no moreTypeErrorlog spamTests added (
tests/unit/utils/test_datalake.py):fetch_col_typesandget_childrenwith parsed objects, empty containers, mixed types_read_json_object → _get_columnspipelineType of change:
Checklist:
Fixes <issue-number>: <short explanation>Bug fix