Skip to content

Wire JSON ingestion schema extension modules#1215

Open
jwils wants to merge 1 commit into
joshuaw/json-ingestion-extension-modulesfrom
joshuaw/json-ingestion-api-polish
Open

Wire JSON ingestion schema extension modules#1215
jwils wants to merge 1 commit into
joshuaw/json-ingestion-extension-modulesfrom
joshuaw/json-ingestion-api-polish

Conversation

@jwils
Copy link
Copy Markdown
Collaborator

@jwils jwils commented May 27, 2026

Why

Introduce the JSON ingestion schema-definition extension modules after the indexing extension points exist, but before making JSON ingestion the default implementation.

What

  • Add ElasticGraph::JSONIngestion::SchemaDefinition::APIExtension and supporting factory/results/artifact/state/schema-element extension modules
  • Wire JSON ingestion through the factory so core indexing field references and field types are extended in place
  • Keep the existing core JSON Schema behavior active in this intermediate layer
  • Add doctest support for JSON ingestion schema-definition examples without making the extension the default yet

Risk Assessment

Medium - this adds new extension code, but the default behavior remains the existing core JSON Schema implementation in this PR.

References

  • Stacked on Add JSON ingestion indexing extensions #1204.
  • bundle exec rspec elasticgraph-schema_definition/spec/unit/elastic_graph/schema_definition/json_schema_spec.rb elasticgraph-schema_definition/spec/unit/elastic_graph/schema_definition/json_schema_field_metadata_spec.rb elasticgraph-schema_definition/spec/unit/elastic_graph/schema_definition/indexing/json_schema_with_metadata_spec.rb elasticgraph-schema_definition/spec/unit/elastic_graph/schema_definition/factory_spec.rb passed.
  • script/type_check passed.
  • script/lint passed.

Stack

Current PR is marked with ->.

@jwils jwils force-pushed the joshuaw/json-ingestion-api-polish branch from 3436470 to ea857eb Compare May 27, 2026 16:22
@jwils jwils force-pushed the joshuaw/json-ingestion-extension-modules branch from b80bef4 to bce83e3 Compare May 27, 2026 16:22
@jwils jwils force-pushed the joshuaw/json-ingestion-api-polish branch from ea857eb to 7af1788 Compare May 27, 2026 18:43
@jwils jwils force-pushed the joshuaw/json-ingestion-extension-modules branch from bce83e3 to 38d7488 Compare May 27, 2026 18:43
@jwils jwils force-pushed the joshuaw/json-ingestion-extension-modules branch from 38d7488 to 65c8468 Compare May 28, 2026 18:35
@jwils jwils force-pushed the joshuaw/json-ingestion-api-polish branch from 7af1788 to 20ee5dd Compare May 28, 2026 18:35
@jwils jwils force-pushed the joshuaw/json-ingestion-extension-modules branch from 65c8468 to d780944 Compare May 28, 2026 18:44
@jwils jwils force-pushed the joshuaw/json-ingestion-api-polish branch 2 times, most recently from 5b67eda to 28d3b58 Compare May 28, 2026 19:01
@jwils jwils force-pushed the joshuaw/json-ingestion-extension-modules branch 2 times, most recently from 08fa741 to 66a50ff Compare May 28, 2026 19:13
@jwils jwils force-pushed the joshuaw/json-ingestion-api-polish branch 2 times, most recently from 4f85649 to fd4651f Compare May 30, 2026 14:07
@jwils jwils force-pushed the joshuaw/json-ingestion-extension-modules branch 2 times, most recently from e904b45 to 7cd0f7d Compare May 30, 2026 14:26
@jwils jwils force-pushed the joshuaw/json-ingestion-api-polish branch 3 times, most recently from 2e3770f to 9ed0d47 Compare May 30, 2026 14:38
@jwils jwils force-pushed the joshuaw/json-ingestion-extension-modules branch from 3d879aa to 01d66b2 Compare May 30, 2026 20:03
@jwils jwils force-pushed the joshuaw/json-ingestion-api-polish branch from 9ed0d47 to a02256c Compare May 30, 2026 20:03
@jwils jwils force-pushed the joshuaw/json-ingestion-extension-modules branch from 01d66b2 to 4d549ec Compare May 30, 2026 20:17
@jwils jwils force-pushed the joshuaw/json-ingestion-api-polish branch from a02256c to 180c7d3 Compare May 30, 2026 20:17
@jwils jwils marked this pull request as ready for review May 31, 2026 03:58
@jwils jwils force-pushed the joshuaw/json-ingestion-api-polish branch from 180c7d3 to cd6256d Compare May 31, 2026 04:17
Copy link
Copy Markdown
Collaborator

@myronmarston myronmarston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished reviewing but wanted to submit my feedback so far.

Comment thread config/site/support/doctest_helper.rb Outdated
@jwils jwils force-pushed the joshuaw/json-ingestion-api-polish branch 2 times, most recently from 55175c5 to 16255f2 Compare May 31, 2026 13:28
Copy link
Copy Markdown
Collaborator

@myronmarston myronmarston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not done reviewing but here's my next round of feedback.

Copy link
Copy Markdown
Collaborator

@myronmarston myronmarston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next set of feedback (still not done reviewing!).

# end
def json_schema(**options)
super
self.runtime_metadata = runtime_metadata.with(grouping_missing_value_placeholder: inferred_grouping_missing_value_placeholder) unless grouping_missing_value_placeholder_overridden
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is super long.

...but I also I think we can implement this functionality more simply if we follow the old pattern we had:

if (placeholder = inferred_grouping_missing_value_placeholder)
self.runtime_metadata = runtime_metadata.with(grouping_missing_value_placeholder: placeholder)
end

Previously it happened in initialize after the scalar type has been yielded. Can we do a similar thing from new_scalar_type in factory_extension.rb?

  • In the block, extend the ScalarType with this module
  • Then yield
  • Then call something on this module to do the "post-yield" processing:
    • Validate that the json schema got set
    • Update grouping_missing_value_placeholder

With that approach you wouldn't need to override json_schema. Thoughts?

super(name) do |type|
extended_type = type.extend(SchemaElements::ScalarTypeExtension) # : ::ElasticGraph::SchemaDefinition::SchemaElements::ScalarType & SchemaElements::ScalarTypeExtension
yield extended_type if block_given?
extended_type.validate_json_schema_configuration! unless state.initially_registered_built_in_types.empty?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unless state.initially_registered_built_in_types.empty? seems suspect--previously, ScalarType unconditionally validated the json schema after yielding:

yield self
missing = [
("`mapping`" if mapping_options.empty?),
("`json_schema`" if json_schema_options.empty?)
].compact
if missing.any?
raise Errors::SchemaError, "Scalar types require `mapping` and `json_schema` to be configured, but `#{name}` lacks #{missing.join(" and ")}."
end

Also, when elasticgraph-json_ingestion is used, it's important that every built-in scalar type has its JSON schema configured.

Can we do away with the unless state.initially_registered_built_in_types.empty? check?

Edit: I think I'm realizing why you did it this way--the JSON schema for the built in types gets configured later via the on_built_in_types hook which runs later, after all built-in types get defined. It could lead to subtle differences in behavior: previously, logic executed as part of evaluating the user-defined schema definition could query the json_schema of the built-in scalar types and do computation based on it. Now the json_schema_options won't be set on built-in scalar types while the schema definition is evaluated. Subtle changes in behavior could result.

An alternative to consider: instead of configuring the json_schema of each scalar type via the on_built_in_types hook, configure it here:

def new_scalar_type(name)
  super(name) do |type|
    extended_type = type.extend(SchemaElements::ScalarTypeExtension) # : ::ElasticGraph::SchemaDefinition::SchemaElements::ScalarType & SchemaElements::ScalarTypeExtension

    # if `name` is one of the built in types, configure `extended_type.json_schema` here, before yielding
    yield extended_type if block_given?
    extended_type.validate_json_schema_configuration! 
  end
end

Then the validation can be unconditional, and the JSON schema is configured on the built-in types when they are first created like has always been the case.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a better way to implement this logic. Extension modules are a powerful technique but should ideally only be used when needed. They have some downsides (e.g. modifying the ancestor chain of existing objects you don't own, potential conflicts with multiple extension modules applied on the same object which define conflicting methods with the same names, etc).

Generally speaking, I only reach for an extension module when I need to do one of these things:

  • Offer an additional API to users as part of an existing object. Example: offering t.json_schema inside a schema.scalar_type block.
  • Modify the behavior of existing call paths by overriding existing methods. Example: defining IndexExtension#rollover to hook into what happens when rollover is called.

The logic here doesn't fall into either category. It's just internal logic that previously existed on TypeReference for reasons of convenience. There's no reason it still needs to exist on TypeReference, though, particularly since TypeReferene isn't part of the EG public API.

Really, we just need a spot for the json_schema_layers logic to live. I believe the computation of json_schema_layers is only needed from FieldExtension#to_indexing_field_reference. Instead of needing a TypeReferenceExtension, we could move this into a JSONSchemaLayers object, e.g. JSONSchemaLayers.for(type) or something.

Thoughts?


# Returns the API's `state` narrowed to include this gem's `StateExtension`. Centralizes
# the Steep cast that's needed because Steep can't see the `extend(StateExtension)` applied
# at runtime in `extended`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever!

# @param version [Integer] current version number of the JSON schema artifact
# @return [void]
# @see #enforce_json_schema_version
def json_schema_version(version)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YARD docs dropped an example that used to be there:

# @example Set the JSON schema version to 1
# ElasticGraph.define_schema do |schema|
# schema.json_schema_version 1
# end

Can you bring that back?

# accidentally provides it as `parent_id`, ElasticGraph would happily ignore the `parent_id` field entirely, because `parentId`
# is allowed to be omitted and `parent_id` would be treated as an extra field. Therefore, we recommend that you only set one of
# these to `true` (or none).
def json_schema_strictness(allow_omitted_fields: false, allow_extra_fields: true)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YARD docs dropped an example that used to be here:

# @example Allow omitted fields and disallow extra fields
# ElasticGraph.define_schema do |schema|
# schema.json_schema_strictness allow_omitted_fields: true, allow_extra_fields: false
# end

Can you bring it back?

end

# @private
def new_enum_indexing_field_type(...)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def new_enum_indexing_field_type(...)
def new_enum_indexing_field_type(enum_value_names)

Using ... is nice when you have a larger list of arguments, particularly if that list may grow over time...but here it's just one argument and it obscures what's going on.

On new_object_indexing_field_type below I think ... is fine because there's a long list of args.

Comment on lines +44 to +49
def new_field(**kwargs, &block)
super(**kwargs) do |field|
extended_field = field.extend(SchemaElements::FieldExtension) # : ::ElasticGraph::SchemaDefinition::SchemaElements::Field & SchemaElements::FieldExtension
block&.call(extended_field)
end
end
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def new_field(**kwargs, &block)
super(**kwargs) do |field|
extended_field = field.extend(SchemaElements::FieldExtension) # : ::ElasticGraph::SchemaDefinition::SchemaElements::Field & SchemaElements::FieldExtension
block&.call(extended_field)
end
end
def new_field(**kwargs)
super(**kwargs) do |field|
extended_field = field.extend(SchemaElements::FieldExtension) # : ::ElasticGraph::SchemaDefinition::SchemaElements::Field & SchemaElements::FieldExtension
yield extended_field if block_given?
end
end

IIRC, there's a bit of extra overhead inherent in the &block syntax as it forces Ruby to allocate a block object, which isn't required for yield/block_given?. My rule of thumb is to use &block when I'm just passing it through to another method, like we do here:

def build_bool_hash(&block)
bool_node = Hash.new { |h, k| h[k] = [] } # : stringOrSymbolHash
bool_node.tap(&block)

...but to use yield instead of block.call and yield if block_given? instead of block&.call(...).

Can you also apply this below?

end

# @private
def new_scalar_indexing_field_type(...)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def new_scalar_indexing_field_type(...)
def new_scalar_indexing_field_type(scalar_type:)

end

# @private
def new_union_indexing_field_type(...)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def new_union_indexing_field_type(...)
def new_union_indexing_field_type(subtypes_by_name)

@jwils jwils force-pushed the joshuaw/json-ingestion-api-polish branch 3 times, most recently from a1e2bb3 to e28ff9c Compare June 1, 2026 18:39
@jwils jwils force-pushed the joshuaw/json-ingestion-api-polish branch from e28ff9c to 83f6c2a Compare June 1, 2026 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants