Skip to content

Fixes #27950: [Datalake] JSON columns incorrectly typed as STRING for empty dict values#27951

Open
mohittilala wants to merge 4 commits intomainfrom
fix/datalake-json-column-type-detection
Open

Fixes #27950: [Datalake] JSON columns incorrectly typed as STRING for empty dict values#27951
mohittilala wants to merge 4 commits intomainfrom
fix/datalake-json-column-type-detection

Conversation

@mohittilala
Copy link
Copy Markdown
Contributor

Describe your changes:

Fixes #27950

Changes in OpenMetadata submodule (datalake_utils.py):

  • Empty dict/list columns now correctly typed as JSON/ARRAY instead of STRING
  • Skip ast.literal_eval round-trip for already-parsed dict/list values
  • get_children handles parsed dicts and JSON strings independently — no more TypeError log spam

Tests added (tests/unit/utils/test_datalake.py):

  • Unit tests for fetch_col_types and get_children with parsed objects, empty containers, mixed types
  • End-to-end tests reading real fixture files through the full _read_json_object → _get_columns pipeline

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Bug fix

  • I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.

@mohittilala mohittilala self-assigned this May 7, 2026
Copilot AI review requested due to automatic review settings May 7, 2026 03:17
@mohittilala mohittilala requested a review from a team as a code owner May 7, 2026 03:17
@mohittilala mohittilala added bug Something isn't working Ingestion safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch python Pull requests that update python code labels May 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an ingestion bug in the Datalake connector where JSON-like columns (especially empty {} / [] values coming from single-object JSON files) were incorrectly inferred as STRING, and where parsing children could emit repeated TypeError debug logs.

Changes:

  • Update column type inference to treat non-null object columns as candidates even when values are falsy containers, and avoid unnecessary ast.literal_eval for already-parsed dict/list values.
  • Rework JSON children extraction to handle mixed parsed-dict and JSON-string values without TypeError noise.
  • Add unit + fixture-based tests covering parsed objects, empty containers, and single-object JSON ingestion behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
ingestion/src/metadata/utils/datalake/datalake_utils.py Fixes type inference for empty dict/list values and makes get_children robust to parsed objects vs JSON strings.
ingestion/tests/unit/utils/test_datalake.py Adds targeted tests for fetch_col_types/get_children and fixture-driven single-object JSON parsing.
ingestion/tests/unit/resources/datalake/dbt_manifest.json Adds a representative single-object dbt manifest fixture with multiple empty-object fields.
ingestion/tests/unit/resources/datalake/dbt_catalog.json Adds a representative single-object dbt catalog fixture with nested dicts and nulls.

Comment thread ingestion/src/metadata/utils/datalake/datalake_utils.py Outdated
Comment thread ingestion/tests/unit/utils/test_datalake.py
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🟡 Playwright Results — all passed (10 flaky)

✅ 4002 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🟡 Shard 2 748 0 4 8
🟡 Shard 3 758 0 3 7
✅ Shard 4 775 0 0 18
✅ Shard 5 687 0 0 41
🟡 Shard 6 735 0 3 8
🟡 10 flaky test(s) (passed on retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 2 retries)
  • Features/IncidentManager.spec.ts › Next, Previous and page indicator (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Flow/AddRoleAndAssignToUser.spec.ts › Verify assigned role to new user (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Features/AutoPilot.spec.ts › Agents created by AutoPilot should be deleted (shard 6, 1 retry)
  • Pages/GlossaryImportExport.spec.ts › Glossary CSV import rejects unknown relation type (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI review requested due to automatic review settings May 8, 2026 04:50
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 8, 2026

Code Review ✅ Approved

Updates datalake utilities to correctly type empty dictionaries and lists as JSON or ARRAY instead of STRING. The implementation includes refined type handling and added unit tests to ensure accurate column inference.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment on lines +341 to +350
df_row_val_list = col_non_null.values[:1000]
parsed_object_datatype_list = []
for df_row_val in df_row_val_list:
try:
parsed_object_datatype_list.append(type(ast.literal_eval(str(df_row_val))).__name__.lower())
if isinstance(df_row_val, (dict, list)):
parsed_object_datatype_list.append(type(df_row_val).__name__.lower())
else:
parsed_object_datatype_list.append(
type(ast.literal_eval(str(df_row_val))).__name__.lower()
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-existing behavior, out of scope for this PR (which targets the empty-dict typing bug). Worth a follow-up issue.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 8, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Ingestion python Pull requests that update python code safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Datalake connector: JSON columns incorrectly typed as STRING and TypeError logged when ingesting single-object JSON files

3 participants