Skip to content

[FLINK-39623][table-planner] Make CAST(BYTES AS STRING) strict on invalid UTF-8#28125

Open
gustavodemorais wants to merge 3 commits intoapache:masterfrom
confluentinc:cast-fix
Open

[FLINK-39623][table-planner] Make CAST(BYTES AS STRING) strict on invalid UTF-8#28125
gustavodemorais wants to merge 3 commits intoapache:masterfrom
confluentinc:cast-fix

Conversation

@gustavodemorais
Copy link
Copy Markdown
Contributor

@gustavodemorais gustavodemorais commented May 7, 2026

What is the purpose of the change

CAST from BINARY/VARBINARY/BYTES to a CHAR/VARCHAR/STRING type now validates UTF-8 and throws on invalid input instead of silently substituting U+FFFD. A new ExecutionConfigOption restores the prior behaviour for users who need it. Part of FLIP-568.

Brief change log

  • Add ExecutionConfigOptions.TABLE_EXEC_LEGACY_BYTES_TO_STRING_CAST (default false).
  • Update BinaryToStringCastRule with a strict UTF-8 fast path for unbounded targets and a strict decode in the round-trip path; override canFail.
  • Add BINARY_STRING_DATA_FROM_UTF8_BYTES to BuiltInMethods.
  • Update CAST and MAKE_VALID_UTF8 entries in sql_functions.yml and sql_functions_zh.yml.
  • Added release notes to jira ticket

Verifying this change

  • CastRulesTest
  • CastFunctionMiscITCase

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes - new ConfigOption on ExecutionConfigOptions
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): yes - per-record CAST(BYTES AS STRING) now validates UTF-8; the fast path wraps bytes with one validating walk
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? sql_functions.yml, sql_functions_zh.yml, execution_config_configuration.html, flink-2.4 release notes

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

2.1.117 (Claude Code)

@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented May 7, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Comment thread docs/content/release-notes/flink-2.4.md Outdated
@@ -0,0 +1,41 @@
---
title: "Release Notes - Flink 2.4"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done by release managers

the release notes for the PR/jira should be put in Release notes of jira issue field and RM could incorporate them into final release notes doc

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But does it harm to start early? I actually like this idea, it makes a RM work easier later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess the main issue is that when RM prepares release he/she goes through steps defined in https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release

there is a step to look into release notes in jira and nothing about this....
i'm not against this approach if it will be documented

Copy link
Copy Markdown
Contributor Author

@gustavodemorais gustavodemorais May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the note, @snuyanzin. I've added the release notes to the jira ticket and removed from the PR. Also added a short note with this info in Agents.md f860662

I think having it in the PR is nice to give the chance for the release notes to be reviewed directly. However, it's of course a pain to change an existing working process. Do you personally have a preference as someone who has had more experience with releases?

Comment thread docs/data/sql_functions.yml Outdated
- sql: CAST(value AS type)
table: ANY.cast(TYPE)
description: Returns a new value being cast to type type. A CAST error throws an exception and fails the job. When performing a cast operation that may fail, like STRING to INT, one should rather use TRY_CAST, in order to handle errors. If "table.exec.legacy-cast-behaviour" is enabled, CAST behaves like TRY_CAST. E.g., CAST('42' AS INT) returns 42; CAST(NULL AS STRING) returns NULL of type STRING; CAST('non-number' AS INT) throws an exception and fails the job.
description: Returns a new value being cast to type type. A CAST error throws an exception and fails the job. When performing a cast operation that may fail, like STRING to INT, one should rather use TRY_CAST, in order to handle errors. If "table.exec.legacy-cast-behaviour" is enabled, CAST behaves like TRY_CAST. E.g., CAST('42' AS INT) returns 42; CAST(NULL AS STRING) returns NULL of type STRING; CAST('non-number' AS INT) throws an exception and fails the job. Casting BINARY/VARBINARY/BYTES to a CHAR/VARCHAR/STRING type validates that the input is well-formed UTF-8 and throws on invalid sequences. Use MAKE_VALID_UTF8 to substitute the Unicode replacement character `U+FFFD` for invalid bytes, TRY_CAST to return NULL, or set "table.exec.legacy-bytes-to-string-cast" to "true" to restore the prior silent-substitution behavior.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have sql more consistent: in some places it is with back ticks, in some without

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

@twalthr twalthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % Sergey's comments

@gustavodemorais
Copy link
Copy Markdown
Contributor Author

Thanks for the reviews, @snuyanzin and @twalthr. Addressed the comments 🙂

@gustavodemorais gustavodemorais marked this pull request as ready for review May 8, 2026 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants