diff --git a/AGENTS.md b/AGENTS.md index daa34497a6628..1c4199d50b8c3 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -196,6 +196,7 @@ Every module from the root pom.xml, organized by function. Flink provides three Key separations: - **Planner vs Runtime:** The table planner generates code and execution plans; the runtime executes them. Changes to planning logic live in `flink-table-planner`; changes to runtime operators live in `flink-table-runtime` or `flink-streaming-java`. +- **Codegen vs hand-written operators:** Per-record expression logic (casts, projections, filters, function calls) is generated at planning time by cast rules in `flink-table-planner/.../functions/casting/` and call generators in `flink-table-planner/.../codegen/calls/`, then compiled by Janino into the surrounding operator class. Operators with fixed structure (joins, aggregations, source/sink runtime) are hand-written Java in `flink-table-runtime` or `flink-streaming-java`. New scalar functions usually only need a `BuiltInFunctionDefinitions` entry plus a `BuiltInScalarFunction` subclass - the planner wires up codegen automatically. New cast behaviour or a custom call shape needs a cast rule or call generator. - **API vs Implementation:** Public API surfaces (`flink-core-api`, `flink-datastream-api`, `flink-table-api-java`) are separate from implementation modules. API stability annotations control what users can depend on. - **ArchUnit enforcement:** `flink-architecture-tests/` contains ArchUnit tests that enforce module boundaries. New violations should be avoided; if unavoidable, follow the freeze procedure in `flink-architecture-tests/README.md`. @@ -294,6 +295,7 @@ This section maps common types of Flink changes to the modules they touch and th - Ensure `./mvnw clean verify` passes before opening a PR - Always push to your fork, not directly to `apache/flink` - Rebase onto the latest target branch before submitting +- For user-visible behaviour changes, breaking changes, new SQL features, or new config options: fill in the **Release Notes** field on the JIRA ticket. The release manager consolidates these when cutting a release. The next version's `docs/content/release-notes/flink-X.Y.md` will be generated based of the jira tickets, so make sure to fill them in properly. ### AI-assisted contributions diff --git a/docs/data/sql_functions.yml b/docs/data/sql_functions.yml index 4c60a96746c80..53a1ad5dd1897 100644 --- a/docs/data/sql_functions.yml +++ b/docs/data/sql_functions.yml @@ -794,7 +794,7 @@ conditional: conversion: - sql: CAST(value AS type) table: ANY.cast(TYPE) - description: Returns a new value being cast to type type. A CAST error throws an exception and fails the job. When performing a cast operation that may fail, like STRING to INT, one should rather use TRY_CAST, in order to handle errors. If "table.exec.legacy-cast-behaviour" is enabled, CAST behaves like TRY_CAST. E.g., CAST('42' AS INT) returns 42; CAST(NULL AS STRING) returns NULL of type STRING; CAST('non-number' AS INT) throws an exception and fails the job. + description: Returns a new value being cast to type type. A CAST error throws an exception and fails the job. When performing a cast operation that may fail, like STRING to INT, one should rather use TRY_CAST, in order to handle errors. If "table.exec.legacy-cast-behaviour" is enabled, CAST behaves like TRY_CAST. E.g., CAST('42' AS INT) returns 42; CAST(NULL AS STRING) returns NULL of type STRING; CAST('non-number' AS INT) throws an exception and fails the job. Casting BINARY/VARBINARY/BYTES to a CHAR/VARCHAR/STRING type validates that the input is well-formed UTF-8 and throws on invalid sequences. Use MAKE_VALID_UTF8 to substitute the Unicode replacement character U+FFFD for invalid bytes, TRY_CAST to return NULL, or set "table.exec.legacy-bytes-to-string-cast" to "true" to restore the prior silent-substitution behavior. - sql: TRY_CAST(value AS type) table: ANY.tryCast(TYPE) description: Like CAST, but in case of error, returns NULL rather than failing the job. E.g., TRY_CAST('42' AS INT) returns 42; TRY_CAST(NULL AS STRING) returns NULL of type STRING; TRY_CAST('non-number' AS INT) returns NULL of type INT; COALESCE(TRY_CAST('non-number' AS INT), 0) returns 0 of type INT. @@ -818,6 +818,8 @@ conversion: description: | Decodes the input as UTF-8, replacing each invalid sequence with the Unicode replacement character `U+FFFD` (rendered as `�`). The substitution is lossy and irreversible. Returns `NULL` if the input is `NULL`. + `MAKE_VALID_UTF8()` can fully replace a `CAST(bytes AS STRING)` which would error in case of invalid UTF-8. + E.g., `MAKE_VALID_UTF8(x'48656C6C6F')` returns `'Hello'`; `MAKE_VALID_UTF8(x'80')` returns `'�'` (the `U+FFFD` replacement character). collection: diff --git a/docs/data/sql_functions_zh.yml b/docs/data/sql_functions_zh.yml index 5cfcac4879137..12a952bd9687a 100644 --- a/docs/data/sql_functions_zh.yml +++ b/docs/data/sql_functions_zh.yml @@ -922,8 +922,10 @@ conversion: description: | 返回 value 被转换为类型 type 的新值。CAST错误会抛出异常并导致作业失败。为了处理错误,在使用可能失败的 CAST 操作时,例如 STRING 转换为 INT,建议使用 TRY_CAST 替代。 如果开启了 "table.exec.legacy-cast-behaviour",CAST 行为将变得与 TRY_CAST 一致。 - + 例如, CAST('42' AS INT) 返回 42; CAST(NULL AS STRING) 返回字符串类型的 `NULL`; CAST('non-number' AS INT) 抛出异常且作业失败。 + + Casting BINARY/VARBINARY/BYTES to a CHAR/VARCHAR/STRING type validates that the input is well-formed UTF-8 and throws on invalid sequences. Use MAKE_VALID_UTF8 to substitute the Unicode replacement character U+FFFD for invalid bytes, TRY_CAST to return NULL, or set "table.exec.legacy-bytes-to-string-cast" to "true" to restore the prior silent-substitution behavior. - sql: TRY_CAST(value AS type) table: ANY.tryCast(TYPE) description: | @@ -948,6 +950,8 @@ conversion: description: | Decodes the input as UTF-8, replacing each invalid sequence with the Unicode replacement character `U+FFFD` (rendered as `�`). The substitution is lossy and irreversible. Returns `NULL` if the input is `NULL`. + `MAKE_VALID_UTF8()` can fully replace a `CAST(bytes AS STRING)` which would error in case of invalid UTF-8. + E.g., `MAKE_VALID_UTF8(x'48656C6C6F')` returns `'Hello'`; `MAKE_VALID_UTF8(x'80')` returns `'�'` (the `U+FFFD` replacement character). collection: diff --git a/docs/layouts/shortcodes/generated/execution_config_configuration.html b/docs/layouts/shortcodes/generated/execution_config_configuration.html index d043932d56e43..e71bf3d7673a9 100644 --- a/docs/layouts/shortcodes/generated/execution_config_configuration.html +++ b/docs/layouts/shortcodes/generated/execution_config_configuration.html @@ -182,6 +182,12 @@
Strict UTF-8 mode is the default: invalid input bytes throw a {@code TableRuntimeException}.
+ * Setting {@link ExecutionConfigOptions#TABLE_EXEC_LEGACY_BYTES_TO_STRING_CAST} to {@code true}
+ * restores the prior behavior, where invalid sequences are silently replaced by {@code U+FFFD}.
*/
class BinaryToStringCastRule extends AbstractNullAwareCodeGeneratorCastRule