Skip to content

Mark JSON-bearing string fields with is_json metadata#111

Merged
adriangb merged 4 commits intodatafusion-contrib:mainfrom
pydantic:array-json
Apr 25, 2026
Merged

Mark JSON-bearing string fields with is_json metadata#111
adriangb merged 4 commits intodatafusion-contrib:mainfrom
pydantic:array-json

Conversation

@adriangb
Copy link
Copy Markdown
Collaborator

Summary

  • json_get_array returns List<Utf8> whose items are raw JSON strings, but the inner field was missing the is_json: true field metadata. This adds it on both the declared return type and the produced ListArray (via ListBuilder::with_field), so downstream consumers can detect that the items are JSON.
  • json_get_json returns Utf8 containing a raw JSON sub-document; added is_json: true on the top-level return field via return_field_from_args.
  • Centralized the metadata into a shared is_json_metadata() helper in common.rs and reused it in common_union.rs (was duplicated inline).

Audit notes

While auditing, I checked every UDF for missing JSON metadata:

UDF Returns JSON? Status
json_get_array Inner Utf8 items are JSON Fixed
json_get_json Top-level Utf8 is raw JSON Fixed
json_get / json_from_scalar JsonUnion (array/object members are JSON) Already correct via union_fields()
json_as_text Mixed — strings are unwrapped, only objects/arrays return JSON Intentionally not marked (would be misleading)
json_object_keys List of object keys (plain strings) N/A
json_get_str/int/float/bool, json_length, json_contains Primitives N/A

json_as_text is intentionally left unmarked because for Peek::String it returns the unescaped string value (no surrounding quotes), so the output is not consistently valid JSON. Happy to revisit if you'd prefer it marked.

Test plan

  • cargo test — all 154 tests pass
  • cargo clippy --all-targets — clean
  • New test test_json_get_array_inner_field_is_json_metadata verifies the inner field metadata both in schema and in the produced ListArray
  • New test test_json_get_json_is_json_metadata verifies the top-level field metadata

🤖 Generated with Claude Code

`json_get_array` returned `List<Utf8>` whose items are raw JSON strings,
but the inner field had no `is_json: true` metadata, so downstream
consumers could not detect that the items were JSON.

Also mark `json_get_json`'s top-level `Utf8` return field (it returns
a raw JSON sub-document). Centralizes the metadata construction in a
shared `is_json_metadata()` helper and reuses it from `common_union`.

`json_as_text` is intentionally not marked, since for `Peek::String`
it returns the unescaped string value (no surrounding quotes), so the
output is not consistently valid JSON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes how JSON-bearing string outputs are annotated across the JSON UDFs by ensuring Arrow Field metadata includes is_json: true wherever the returned Utf8 (or inner Utf8 list items) contains raw JSON, enabling downstream consumers to reliably detect JSON strings.

Changes:

  • Add a shared is_json_metadata() helper and reuse it for JSON-bearing Field metadata.
  • Mark json_get_array’s inner Utf8 list item field as JSON in both the declared return type and the produced ListArray builder field.
  • Mark json_get_json’s returned Utf8 field as JSON via return_field_from_args, and add regression tests for both UDFs.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/common.rs Introduces is_json_metadata() helper for consistent is_json field metadata creation.
src/common_union.rs Reuses is_json_metadata() instead of duplicating inline metadata construction for union fields.
src/json_get_array.rs Ensures the list item Field for json_get_array is marked is_json and that produced ListArrays carry the same metadata.
src/json_get_json.rs Sets is_json metadata on the returned field via return_field_from_args.
tests/main.rs Adds tests asserting JSON metadata is present on both schema fields and produced arrays.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/common.rs Outdated
adriangb and others added 2 commits April 25, 2026 13:15
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
`is_json_metadata()` was ambiguous about direction (check vs. mark).
The helper is a constructor of metadata that marks a field as containing
JSON-encoded data. Renaming to `json_field_metadata()` makes that
unambiguous and stays correct if more keys (e.g. canonical Arrow
extension keys) are added to the returned map in the future.

Move it to `src/common_union.rs`, which already houses JSON-typing
concerns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb merged commit a3d9f62 into datafusion-contrib:main Apr 25, 2026
7 checks passed
@adriangb adriangb deleted the array-json branch April 25, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants