diff --git a/docs/source/about/gluten_comparison.md b/docs/source/about/gluten_comparison.md index 40c6c2741a..52425d5c0d 100644 --- a/docs/source/about/gluten_comparison.md +++ b/docs/source/about/gluten_comparison.md @@ -64,7 +64,7 @@ Comet relies on the full Spark SQL test suite (consisting of more than 24,000 te integration tests to ensure compatibility with Spark. Features that are known to have compatibility differences with Spark are disabled by default, but users can opt in. See the [Comet Compatibility Guide] for more information. -[Comet Compatibility Guide]: /user-guide/latest/compatibility.md +[Comet Compatibility Guide]: /user-guide/latest/compatibility/index.md Gluten also aims to provide compatibility with Spark, and includes a subset of the Spark SQL tests in its own test suite. See the [Gluten Compatibility Guide] for more information. diff --git a/docs/source/contributor-guide/adding_a_new_operator.md b/docs/source/contributor-guide/adding_a_new_operator.md index 82c237f830..de2a73da88 100644 --- a/docs/source/contributor-guide/adding_a_new_operator.md +++ b/docs/source/contributor-guide/adding_a_new_operator.md @@ -414,7 +414,7 @@ mod tests { ### Step 7: Update Documentation -Add your operator to the supported operators list in `docs/source/user-guide/latest/compatibility.md` or similar documentation. +Add your operator to the supported operators list in `docs/source/user-guide/latest/compatibility/operators.md` or similar documentation. ## Implementing a Sink Operator diff --git a/docs/source/contributor-guide/index.md b/docs/source/contributor-guide/index.md index 0e46f5c821..ba0fc27473 100644 --- a/docs/source/contributor-guide/index.md +++ b/docs/source/contributor-guide/index.md @@ -28,7 +28,6 @@ Comet Plugin Overview Arrow FFI JVM Shuffle Native Shuffle -Parquet Scans Development Guide Debugging Guide Benchmarking Guide diff --git a/docs/source/contributor-guide/parquet_scans.md b/docs/source/contributor-guide/parquet_scans.md deleted file mode 100644 index cb94c28514..0000000000 --- a/docs/source/contributor-guide/parquet_scans.md +++ /dev/null @@ -1,143 +0,0 @@ - - -# Comet Parquet Scan Implementations - -Comet currently has two distinct implementations of the Parquet scan operator. - -| Scan Implementation | Notes | -| ----------------------- | ---------------------- | -| `native_datafusion` | Fully native scan | -| `native_iceberg_compat` | Hybrid JVM/native scan | - -The configuration property -`spark.comet.scan.impl` is used to select an implementation. The default setting is `spark.comet.scan.impl=auto`, which -attempts to use `native_datafusion` first, and falls back to Spark if the scan cannot be converted -(e.g., due to unsupported features). Most users should not need to change this setting. -However, it is possible to force Comet to use a particular implementation for all scan operations by setting -this configuration property to one of the following implementations. For example: `--conf spark.comet.scan.impl=native_datafusion`. - -The following features are not supported by either scan implementation, and Comet will fall back to Spark in these scenarios: - -- `ShortType` columns, by default. When reading Parquet files written by systems other than Spark that contain - columns with the logical type `UINT_8` (unsigned 8-bit integers), Comet may produce different results than Spark. - Spark maps `UINT_8` to `ShortType`, but Comet's Arrow-based readers respect the unsigned type and read the data as - unsigned rather than signed. Since Comet cannot distinguish `ShortType` columns that came from `UINT_8` versus - signed `INT16`, by default Comet falls back to Spark when scanning Parquet files containing `ShortType` columns. - This behavior can be disabled by setting `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType` - columns are always safe because they can only come from signed `INT8`, where truncation preserves the signed value. -- Default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. -- Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not include `parquet`, Spark uses the - V2 API for Parquet scans. The DataFusion-based implementations only support the V1 API. -- Spark metadata columns (e.g., `_metadata.file_path`) -- No support for AQE Dynamic Partition Pruning (DPP). Non-AQE DPP is supported. - -The following shared limitation may produce incorrect results without falling back to Spark: - -- No support for datetime rebasing. When reading Parquet files containing dates or timestamps written before - Spark 3.0 (which used a hybrid Julian/Gregorian calendar), dates/timestamps will be read as if they were - written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before - October 15, 1582. - -The `native_datafusion` scan has some additional limitations, mostly related to Parquet metadata. All of these -cause Comet to fall back to Spark (including when using `auto` mode). Note that the `native_datafusion` scan -requires `spark.comet.exec.enabled=true` because the scan node must be wrapped by `CometExecRule`. - -- No support for row indexes -- No support for reading Parquet field IDs -- No support for `input_file_name()`, `input_file_block_start()`, or `input_file_block_length()` SQL functions. - The `native_datafusion` scan does not use Spark's `FileScanRDD`, so these functions cannot populate their values. -- No support for `ignoreMissingFiles` or `ignoreCorruptFiles` being set to `true` -- Duplicate field names in case-insensitive mode (e.g., a Parquet file with both `B` and `b` columns) - are detected at read time and raise a `SparkRuntimeException` with error class `_LEGACY_ERROR_TEMP_2093`, - matching Spark's behavior. - -The `native_iceberg_compat` scan has the following additional limitation that may produce incorrect results -without falling back to Spark: - -- Some Spark configuration values are hard-coded to their defaults rather than respecting user-specified values. - This may produce incorrect results when non-default values are set. The affected configurations are - `spark.sql.parquet.binaryAsString`, `spark.sql.parquet.int96AsTimestamp`, `spark.sql.caseSensitive`, - `spark.sql.parquet.inferTimestampNTZ.enabled`, and `spark.sql.legacy.parquet.nanosAsLong`. See - [issue #1816](https://github.com/apache/datafusion-comet/issues/1816) for more details. - -## S3 Support - -The `native_datafusion` and `native_iceberg_compat` Parquet scan implementations completely offload data loading -to native code. They use the [`object_store` crate](https://crates.io/crates/object_store) to read data from S3 and -support configuring S3 access using standard [Hadoop S3A configurations](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#General_S3A_Client_configuration) by translating them to -the `object_store` crate's format. - -This implementation maintains compatibility with existing Hadoop S3A configurations, so existing code will -continue to work as long as the configurations are supported and can be translated without loss of functionality. - -#### Additional S3 Configuration Options - -Beyond credential providers, the `native_datafusion` and `native_iceberg_compat` implementations support additional -S3 configuration options: - -| Option | Description | -| ------------------------------- | -------------------------------------------------------------------------------------------------- | -| `fs.s3a.endpoint` | The endpoint of the S3 service | -| `fs.s3a.endpoint.region` | The AWS region for the S3 service. If not specified, the region will be auto-detected. | -| `fs.s3a.path.style.access` | Whether to use path style access for the S3 service (true/false, defaults to virtual hosted style) | -| `fs.s3a.requester.pays.enabled` | Whether to enable requester pays for S3 requests (true/false) | - -All configuration options support bucket-specific overrides using the pattern `fs.s3a.bucket.{bucket-name}.{option}`. - -#### Examples - -The following examples demonstrate how to configure S3 access with the `native_datafusion` and `native_iceberg_compat` -Parquet scan implementations using different authentication methods. - -**Example 1: Simple Credentials** - -This example shows how to access a private S3 bucket using an access key and secret key. The `fs.s3a.aws.credentials.provider` configuration can be omitted since `org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider` is included in Hadoop S3A's default credential provider chain. - -```shell -$SPARK_HOME/bin/spark-shell \ -... ---conf spark.comet.scan.impl=native_datafusion \ ---conf spark.hadoop.fs.s3a.access.key=my-access-key \ ---conf spark.hadoop.fs.s3a.secret.key=my-secret-key -... -``` - -**Example 2: Assume Role with Web Identity Token** - -This example demonstrates using an assumed role credential to access a private S3 bucket, where the base credential for assuming the role is provided by a web identity token credentials provider. - -```shell -$SPARK_HOME/bin/spark-shell \ -... ---conf spark.comet.scan.impl=native_datafusion \ ---conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider \ ---conf spark.hadoop.fs.s3a.assumed.role.arn=arn:aws:iam::123456789012:role/my-role \ ---conf spark.hadoop.fs.s3a.assumed.role.session.name=my-session \ ---conf spark.hadoop.fs.s3a.assumed.role.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider -... -``` - -#### Limitations - -The S3 support of `native_datafusion` and `native_iceberg_compat` has the following limitations: - -1. **Partial Hadoop S3A configuration support**: Not all Hadoop S3A configurations are currently supported. Only the configurations listed in the tables above are translated and applied to the underlying `object_store` crate. - -2. **Custom credential providers**: Custom implementations of AWS credential providers are not supported. The implementation only supports the standard credential providers listed in the table above. We are planning to add support for custom credential providers through a JNI-based adapter that will allow calling Java credential providers from native code. See [issue #1829](https://github.com/apache/datafusion-comet/issues/1829) for more details. diff --git a/docs/source/user-guide/latest/compatibility.md b/docs/source/user-guide/latest/compatibility.md deleted file mode 100644 index affcfec6d8..0000000000 --- a/docs/source/user-guide/latest/compatibility.md +++ /dev/null @@ -1,184 +0,0 @@ - - -# Compatibility Guide - -Comet aims to provide consistent results with the version of Apache Spark that is being used. - -This guide offers information about areas of functionality where there are known differences. - -## Parquet - -Comet has the following limitations when reading Parquet files: - -- Comet does not support reading decimals encoded in binary format. -- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. - -## ANSI Mode - -Comet will fall back to Spark for the following expressions when ANSI mode is enabled. These expressions can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. - -- Average (supports all numeric inputs except decimal types) -- Cast (in some cases) - -There is an [epic](https://github.com/apache/datafusion-comet/issues/313) where we are tracking the work to fully implement ANSI support. - -## Floating-point Number Comparison - -Spark normalizes NaN and zero for floating point numbers for several cases. See `NormalizeFloatingNumbers` optimization rule in Spark. -However, one exception is comparison. Spark does not normalize NaN and zero when comparing values -because they are handled well in Spark (e.g., `SQLOrderingUtil.compareFloats`). But the comparison -functions of arrow-rs used by DataFusion do not normalize NaN and zero (e.g., [arrow::compute::kernels::cmp::eq](https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.eq.html#)). -So Comet adds additional normalization expression of NaN and zero for comparisons, and may still have differences -to Spark in some cases, especially when the data contains both positive and negative zero. This is likely an edge -case that is not of concern for many users. If it is a concern, setting `spark.comet.exec.strictFloatingPoint=true` -will make relevant operations fall back to Spark. - -## Incompatible Expressions - -Expressions that are not 100% Spark-compatible will fall back to Spark by default and can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. - -### Aggregate Expressions - -- **CollectSet**: Comet deduplicates NaN values (treats `NaN == NaN`) while Spark treats each NaN as a distinct value. - When `spark.comet.exec.strictFloatingPoint=true`, `collect_set` on floating-point types falls back to Spark unless - `spark.comet.expression.CollectSet.allowIncompatible=true` is set. - -### Array Expressions - -- **SortArray**: Nested arrays with `Struct` or `Null` child values are not supported natively and will fall back to Spark. - -### Date/Time Expressions - -- **Hour, Minute, Second**: Incorrectly apply timezone conversion to TimestampNTZ inputs. TimestampNTZ stores local - time without timezone, so no conversion should be applied. These expressions work correctly with Timestamp inputs. - [#3180](https://github.com/apache/datafusion-comet/issues/3180) -- **TruncTimestamp (date_trunc)**: Produces incorrect results when used with non-UTC timezones. Compatible when - timezone is UTC. - [#2649](https://github.com/apache/datafusion-comet/issues/2649) - -### Struct Expressions - -- **StructsToJson (to_json)**: Does not support `+Infinity` and `-Infinity` for numeric types (float, double). - [#3016](https://github.com/apache/datafusion-comet/issues/3016) - -## Regular Expressions - -Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's -regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but -this can be overridden by setting `spark.comet.expression.regexp.allowIncompatible=true`. - -## Window Functions - -Comet's support for window functions is incomplete and known to be incorrect. It is disabled by default and -should not be used in production. The feature will be enabled in a future release. Tracking issue: [#2721](https://github.com/apache/datafusion-comet/issues/2721). - -## Round-Robin Partitioning - -Comet's native shuffle implementation of round-robin partitioning (`df.repartition(n)`) is not compatible with -Spark's implementation and is disabled by default. It can be enabled by setting -`spark.comet.native.shuffle.partitioning.roundrobin.enabled=true`. - -**Why the incompatibility exists:** - -Spark's round-robin partitioning sorts rows by their binary `UnsafeRow` representation before assigning them to -partitions. This ensures deterministic output for fault tolerance (task retries produce identical results). -Comet uses Arrow format internally, which has a completely different binary layout than `UnsafeRow`, making it -impossible to match Spark's exact partition assignments. - -**Comet's approach:** - -Instead of true round-robin assignment, Comet implements round-robin as hash partitioning on ALL columns. This -achieves the same semantic goals: - -- **Even distribution**: Rows are distributed evenly across partitions (as long as the hash varies sufficiently - - in some cases there could be skew) -- **Deterministic**: Same input always produces the same partition assignments (important for fault tolerance) -- **No semantic grouping**: Unlike hash partitioning on specific columns, this doesn't group related rows together - -The only difference is that Comet's partition assignments will differ from Spark's. When results are sorted, -they will be identical to Spark. Unsorted results may have different row ordering. - -## Cast - -Cast operations in Comet fall into three levels of support: - -- **C (Compatible)**: The results match Apache Spark -- **I (Incompatible)**: The results may match Apache Spark for some inputs, but there are known issues where some inputs - will result in incorrect results or exceptions. The query stage will fall back to Spark by default. Setting - `spark.comet.expression.Cast.allowIncompatible=true` will allow all incompatible casts to run natively in Comet, but this is not - recommended for production use. -- **U (Unsupported)**: Comet does not provide a native version of this cast expression and the query stage will fall back to - Spark. -- **N/A**: Spark does not support this cast. - -### String to Decimal - -Comet's native `CAST(string AS DECIMAL)` implementation matches Apache Spark's behavior, -including: - -- Leading and trailing ASCII whitespace is trimmed before parsing. -- Null bytes (`\u0000`) at the start or end of a string are trimmed, matching Spark's - `UTF8String` behavior. Null bytes embedded in the middle of a string produce `NULL`. -- Fullwidth Unicode digits (U+FF10–U+FF19, e.g. `123.45`) are treated as their ASCII - equivalents, so `CAST('123.45' AS DECIMAL(10,2))` returns `123.45`. -- Scientific notation (e.g. `1.23E+5`) is supported. -- Special values (`inf`, `infinity`, `nan`) produce `NULL`. - -### String to Timestamp - -Comet's native `CAST(string AS TIMESTAMP)` implementation supports all timestamp formats accepted -by Apache Spark, including ISO 8601 date-time strings, date-only strings, time-only strings -(`HH:MM:SS`), embedded timezone offsets (e.g. `+07:30`, `GMT-01:00`, `UTC`), named timezone -suffixes (e.g. `Europe/Moscow`), and the full Spark timestamp year range -(-290308 to 294247). Note that `CAST(string AS DATE)` is only compatible for years between -262143 BC and 262142 AD due to an underlying library limitation. - -### Decimal with Negative Scale to String - -Casting a `DecimalType` with a negative scale to `StringType` is marked as incompatible when -`spark.sql.legacy.allowNegativeScaleOfDecimal` is `false` (the default). When that config is -disabled, Spark cannot create negative-scale decimals, so Comet falls back to avoid running -native execution on unexpected inputs. - -When `spark.sql.legacy.allowNegativeScaleOfDecimal=true`, the cast is compatible. Comet matches -Spark's behavior of using Java `BigDecimal.toString()` semantics, which produces scientific -notation (e.g. a value of 12300 stored as `Decimal(7,-2)` with unscaled value 123 is rendered -as `"1.23E+4"`). - -### Legacy Mode - - - - -### Try Mode - - - - -### ANSI Mode - - - - -See the [tracking issue](https://github.com/apache/datafusion-comet/issues/286) for more details. diff --git a/docs/source/user-guide/latest/compatibility/expressions/aggregate.md b/docs/source/user-guide/latest/compatibility/expressions/aggregate.md new file mode 100644 index 0000000000..f017d899d2 --- /dev/null +++ b/docs/source/user-guide/latest/compatibility/expressions/aggregate.md @@ -0,0 +1,34 @@ + + +# Aggregate Expressions + +## Incompatible Aggregates + +- **CollectSet**: Comet deduplicates NaN values (treats `NaN == NaN`) while Spark treats each NaN as a distinct value. + When `spark.comet.exec.strictFloatingPoint=true`, `collect_set` on floating-point types falls back to Spark unless + `spark.comet.expression.CollectSet.allowIncompatible=true` is set. + +## ANSI Mode + +Comet will fall back to Spark for the following aggregate expressions when ANSI mode is enabled. These can be enabled by setting `spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See the [Comet Supported Expressions Guide](../../expressions.md) for more information on this configuration setting. + +- Average (supports all numeric inputs except decimal types) + +There is an [epic](https://github.com/apache/datafusion-comet/issues/313) where we are tracking the work to fully implement ANSI support. diff --git a/docs/source/user-guide/latest/compatibility/expressions/array.md b/docs/source/user-guide/latest/compatibility/expressions/array.md new file mode 100644 index 0000000000..71e911288c --- /dev/null +++ b/docs/source/user-guide/latest/compatibility/expressions/array.md @@ -0,0 +1,22 @@ + + +# Array Expressions + +- **SortArray**: Nested arrays with `Struct` or `Null` child values are not supported natively and will fall back to Spark. diff --git a/docs/source/user-guide/latest/compatibility/expressions/cast.md b/docs/source/user-guide/latest/compatibility/expressions/cast.md new file mode 100644 index 0000000000..76457db8c9 --- /dev/null +++ b/docs/source/user-guide/latest/compatibility/expressions/cast.md @@ -0,0 +1,88 @@ + + +# Cast + +Cast operations in Comet fall into three levels of support: + +- **C (Compatible)**: The results match Apache Spark +- **I (Incompatible)**: The results may match Apache Spark for some inputs, but there are known issues where some inputs + will result in incorrect results or exceptions. The query stage will fall back to Spark by default. Setting + `spark.comet.expression.Cast.allowIncompatible=true` will allow all incompatible casts to run natively in Comet, but this is not + recommended for production use. +- **U (Unsupported)**: Comet does not provide a native version of this cast expression and the query stage will fall back to + Spark. +- **N/A**: Spark does not support this cast. + +## ANSI Mode Fallback + +Cast will fall back to Spark in some cases when ANSI mode is enabled. This can be enabled by setting `spark.comet.expression.Cast.allowIncompatible=true`. See the [Comet Supported Expressions Guide](../../expressions.md) for more information on this configuration setting. + +There is an [epic](https://github.com/apache/datafusion-comet/issues/313) where we are tracking the work to fully implement ANSI support. + +## String to Decimal + +Comet's native `CAST(string AS DECIMAL)` implementation matches Apache Spark's behavior, +including: + +- Leading and trailing ASCII whitespace is trimmed before parsing. +- Null bytes (`\u0000`) at the start or end of a string are trimmed, matching Spark's + `UTF8String` behavior. Null bytes embedded in the middle of a string produce `NULL`. +- Fullwidth Unicode digits (U+FF10–U+FF19, e.g. `123.45`) are treated as their ASCII + equivalents, so `CAST('123.45' AS DECIMAL(10,2))` returns `123.45`. +- Scientific notation (e.g. `1.23E+5`) is supported. +- Special values (`inf`, `infinity`, `nan`) produce `NULL`. + +## String to Timestamp + +Comet's native `CAST(string AS TIMESTAMP)` implementation supports all timestamp formats accepted +by Apache Spark, including ISO 8601 date-time strings, date-only strings, time-only strings +(`HH:MM:SS`), embedded timezone offsets (e.g. `+07:30`, `GMT-01:00`, `UTC`), named timezone +suffixes (e.g. `Europe/Moscow`), and the full Spark timestamp year range +(-290308 to 294247). Note that `CAST(string AS DATE)` is only compatible for years between +262143 BC and 262142 AD due to an underlying library limitation. + +## Decimal with Negative Scale to String + +Casting a `DecimalType` with a negative scale to `StringType` is marked as incompatible when +`spark.sql.legacy.allowNegativeScaleOfDecimal` is `false` (the default). When that config is +disabled, Spark cannot create negative-scale decimals, so Comet falls back to avoid running +native execution on unexpected inputs. + +When `spark.sql.legacy.allowNegativeScaleOfDecimal=true`, the cast is compatible. Comet matches +Spark's behavior of using Java `BigDecimal.toString()` semantics, which produces scientific +notation (e.g. a value of 12300 stored as `Decimal(7,-2)` with unscaled value 123 is rendered +as `"1.23E+4"`). + +## Legacy Mode + + + + +## Try Mode + + + + +## ANSI Mode + + + + +See the [tracking issue](https://github.com/apache/datafusion-comet/issues/286) for more details. diff --git a/docs/source/user-guide/latest/compatibility/expressions/datetime.md b/docs/source/user-guide/latest/compatibility/expressions/datetime.md new file mode 100644 index 0000000000..b18e6f723e --- /dev/null +++ b/docs/source/user-guide/latest/compatibility/expressions/datetime.md @@ -0,0 +1,27 @@ + + +# Date/Time Expressions + +- **Hour, Minute, Second**: Incorrectly apply timezone conversion to TimestampNTZ inputs. TimestampNTZ stores local + time without timezone, so no conversion should be applied. These expressions work correctly with Timestamp inputs. + [#3180](https://github.com/apache/datafusion-comet/issues/3180) +- **TruncTimestamp (date_trunc)**: Produces incorrect results when used with non-UTC timezones. Compatible when + timezone is UTC. + [#2649](https://github.com/apache/datafusion-comet/issues/2649) diff --git a/docs/source/user-guide/latest/compatibility/expressions/index.md b/docs/source/user-guide/latest/compatibility/expressions/index.md new file mode 100644 index 0000000000..3084a3930b --- /dev/null +++ b/docs/source/user-guide/latest/compatibility/expressions/index.md @@ -0,0 +1,36 @@ + + +# Expression Compatibility + +Expressions that are not 100% Spark-compatible will fall back to Spark by default and can be enabled by setting +`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See +the [Comet Supported Expressions Guide](../../expressions.md) for more information on this configuration setting. + +Compatibility notes are grouped by expression category: + +```{toctree} +:maxdepth: 1 + +aggregate +array +datetime +struct +cast +``` diff --git a/docs/source/user-guide/latest/compatibility/expressions/struct.md b/docs/source/user-guide/latest/compatibility/expressions/struct.md new file mode 100644 index 0000000000..2a207894cf --- /dev/null +++ b/docs/source/user-guide/latest/compatibility/expressions/struct.md @@ -0,0 +1,23 @@ + + +# Struct Expressions + +- **StructsToJson (to_json)**: Does not support `+Infinity` and `-Infinity` for numeric types (float, double). + [#3016](https://github.com/apache/datafusion-comet/issues/3016) diff --git a/docs/source/user-guide/latest/compatibility/floating-point.md b/docs/source/user-guide/latest/compatibility/floating-point.md new file mode 100644 index 0000000000..ffab055060 --- /dev/null +++ b/docs/source/user-guide/latest/compatibility/floating-point.md @@ -0,0 +1,29 @@ + + +# Floating-point Number Comparison + +Spark normalizes NaN and zero for floating point numbers for several cases. See `NormalizeFloatingNumbers` optimization rule in Spark. +However, one exception is comparison. Spark does not normalize NaN and zero when comparing values +because they are handled well in Spark (e.g., `SQLOrderingUtil.compareFloats`). But the comparison +functions of arrow-rs used by DataFusion do not normalize NaN and zero (e.g., [arrow::compute::kernels::cmp::eq](https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.eq.html#)). +So Comet adds additional normalization expression of NaN and zero for comparisons, and may still have differences +to Spark in some cases, especially when the data contains both positive and negative zero. This is likely an edge +case that is not of concern for many users. If it is a concern, setting `spark.comet.exec.strictFloatingPoint=true` +will make relevant operations fall back to Spark. diff --git a/docs/source/user-guide/latest/compatibility/index.md b/docs/source/user-guide/latest/compatibility/index.md new file mode 100644 index 0000000000..e178472dc2 --- /dev/null +++ b/docs/source/user-guide/latest/compatibility/index.md @@ -0,0 +1,40 @@ + + +# Compatibility Guide + +Comet aims to provide consistent results with the version of Apache Spark that is being used. + +This guide documents areas where Comet's behavior is known to differ from Spark. Topics are grouped by subsystem: + +- **Parquet**: limitations when reading Parquet files (both scan implementations, shared and per-implementation). +- **Floating-point comparison**: NaN and signed-zero handling in comparisons. +- **Regular expressions**: differences between the Rust regexp crate and Java's regex engine. +- **Operators**: operator-level compatibility notes, including window functions and round-robin partitioning. +- **Expressions**: per-expression compatibility notes, including cast. + +```{toctree} +:maxdepth: 1 + +scans +floating-point +regex +operators +expressions/index +``` diff --git a/docs/source/user-guide/latest/compatibility/operators.md b/docs/source/user-guide/latest/compatibility/operators.md new file mode 100644 index 0000000000..a25d1d1db9 --- /dev/null +++ b/docs/source/user-guide/latest/compatibility/operators.md @@ -0,0 +1,51 @@ + + +# Operator Compatibility + +## Window Functions + +Comet's support for window functions is incomplete and known to be incorrect. It is disabled by default and +should not be used in production. The feature will be enabled in a future release. Tracking issue: [#2721](https://github.com/apache/datafusion-comet/issues/2721). + +## Round-Robin Partitioning + +Comet's native shuffle implementation of round-robin partitioning (`df.repartition(n)`) is not compatible with +Spark's implementation and is disabled by default. It can be enabled by setting +`spark.comet.native.shuffle.partitioning.roundrobin.enabled=true`. + +**Why the incompatibility exists:** + +Spark's round-robin partitioning sorts rows by their binary `UnsafeRow` representation before assigning them to +partitions. This ensures deterministic output for fault tolerance (task retries produce identical results). +Comet uses Arrow format internally, which has a completely different binary layout than `UnsafeRow`, making it +impossible to match Spark's exact partition assignments. + +**Comet's approach:** + +Instead of true round-robin assignment, Comet implements round-robin as hash partitioning on ALL columns. This +achieves the same semantic goals: + +- **Even distribution**: Rows are distributed evenly across partitions (as long as the hash varies sufficiently - + in some cases there could be skew) +- **Deterministic**: Same input always produces the same partition assignments (important for fault tolerance) +- **No semantic grouping**: Unlike hash partitioning on specific columns, this doesn't group related rows together + +The only difference is that Comet's partition assignments will differ from Spark's. When results are sorted, +they will be identical to Spark. Unsorted results may have different row ordering. diff --git a/docs/source/user-guide/latest/compatibility/regex.md b/docs/source/user-guide/latest/compatibility/regex.md new file mode 100644 index 0000000000..4d9d5b650c --- /dev/null +++ b/docs/source/user-guide/latest/compatibility/regex.md @@ -0,0 +1,24 @@ + + +# Regular Expressions + +Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's +regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but +this can be overridden by setting `spark.comet.expression.regexp.allowIncompatible=true`. diff --git a/docs/source/user-guide/latest/compatibility/scans.md b/docs/source/user-guide/latest/compatibility/scans.md new file mode 100644 index 0000000000..27ed20c19e --- /dev/null +++ b/docs/source/user-guide/latest/compatibility/scans.md @@ -0,0 +1,84 @@ + + +# Parquet Compatibility + +Comet currently has two distinct implementations of the Parquet scan operator. + +| Scan Implementation | Notes | +| ----------------------- | ---------------------- | +| `native_datafusion` | Fully native scan | +| `native_iceberg_compat` | Hybrid JVM/native scan | + +The configuration property `spark.comet.scan.impl` is used to select an implementation. The default setting is +`spark.comet.scan.impl=auto`, which attempts to use `native_datafusion` first, and falls back to Spark if the scan +cannot be converted (e.g., due to unsupported features). Most users should not need to change this setting. However, +it is possible to force Comet to use a particular implementation for all scan operations by setting this +configuration property to one of the following implementations. For example: `--conf spark.comet.scan.impl=native_datafusion`. + +## Shared Limitations + +The following features are not supported by either scan implementation, and Comet will fall back to Spark in these scenarios: + +- Decimals encoded in binary format. +- `ShortType` columns, by default. When reading Parquet files written by systems other than Spark that contain + columns with the logical type `UINT_8` (unsigned 8-bit integers), Comet may produce different results than Spark. + Spark maps `UINT_8` to `ShortType`, but Comet's Arrow-based readers respect the unsigned type and read the data as + unsigned rather than signed. Since Comet cannot distinguish `ShortType` columns that came from `UINT_8` versus + signed `INT16`, by default Comet falls back to Spark when scanning Parquet files containing `ShortType` columns. + This behavior can be disabled by setting `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType` + columns are always safe because they can only come from signed `INT8`, where truncation preserves the signed value. +- Default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. +- Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not include `parquet`, Spark uses the + V2 API for Parquet scans. The DataFusion-based implementations only support the V1 API. +- Spark metadata columns (e.g., `_metadata.file_path`) +- No support for AQE Dynamic Partition Pruning (DPP). Non-AQE DPP is supported. + +The following shared limitation may produce incorrect results without falling back to Spark: + +- No support for datetime rebasing. When reading Parquet files containing dates or timestamps written before + Spark 3.0 (which used a hybrid Julian/Gregorian calendar), dates/timestamps will be read as if they were + written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before + October 15, 1582. + +## `native_datafusion` Limitations + +The `native_datafusion` scan has some additional limitations, mostly related to Parquet metadata. All of these +cause Comet to fall back to Spark (including when using `auto` mode). Note that the `native_datafusion` scan +requires `spark.comet.exec.enabled=true` because the scan node must be wrapped by `CometExecRule`. + +- No support for row indexes +- No support for reading Parquet field IDs +- No support for `input_file_name()`, `input_file_block_start()`, or `input_file_block_length()` SQL functions. + The `native_datafusion` scan does not use Spark's `FileScanRDD`, so these functions cannot populate their values. +- No support for `ignoreMissingFiles` or `ignoreCorruptFiles` being set to `true` +- Duplicate field names in case-insensitive mode (e.g., a Parquet file with both `B` and `b` columns) + are detected at read time and raise a `SparkRuntimeException` with error class `_LEGACY_ERROR_TEMP_2093`, + matching Spark's behavior. + +## `native_iceberg_compat` Limitations + +The `native_iceberg_compat` scan has the following additional limitation that may produce incorrect results +without falling back to Spark: + +- Some Spark configuration values are hard-coded to their defaults rather than respecting user-specified values. + This may produce incorrect results when non-default values are set. The affected configurations are + `spark.sql.parquet.binaryAsString`, `spark.sql.parquet.int96AsTimestamp`, `spark.sql.caseSensitive`, + `spark.sql.parquet.inferTimestampNTZ.enabled`, and `spark.sql.legacy.parquet.nanosAsLong`. See + [issue #1816](https://github.com/apache/datafusion-comet/issues/1816) for more details. diff --git a/docs/source/user-guide/latest/datasources.md b/docs/source/user-guide/latest/datasources.md index daae6955dd..b79831d804 100644 --- a/docs/source/user-guide/latest/datasources.md +++ b/docs/source/user-guide/latest/datasources.md @@ -174,6 +174,14 @@ Or use `spark-shell` with HDFS support as described [above](#using-experimental- ## S3 +The `native_datafusion` and `native_iceberg_compat` Parquet scan implementations completely offload data loading +to native code. They use the [`object_store` crate](https://crates.io/crates/object_store) to read data from S3 and +support configuring S3 access using standard [Hadoop S3A configurations](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#General_S3A_Client_configuration) by translating them to +the `object_store` crate's format. + +This implementation maintains compatibility with existing Hadoop S3A configurations, so existing code will +continue to work as long as the configurations are supported and can be translated without loss of functionality. + #### Root CA Certificates One major difference between Spark and Comet is the mechanism for discovering Root @@ -199,3 +207,58 @@ AWS credential providers can be configured using the `fs.s3a.aws.credentials.pro | `com.amazonaws.auth.WebIdentityTokenCredentialsProvider`
`software.amazon.awssdk.auth.credentials.WebIdentityTokenFileCredentialsProvider` | Authenticate using web identity token file | None | Multiple credential providers can be specified in a comma-separated list using the `fs.s3a.aws.credentials.provider` configuration, just as Hadoop AWS supports. If `fs.s3a.aws.credentials.provider` is not configured, Hadoop S3A's default credential provider chain will be used. All configuration options also support bucket-specific overrides using the pattern `fs.s3a.bucket.{bucket-name}.{option}`. + +#### Additional S3 Configuration Options + +Beyond credential providers, the `native_datafusion` and `native_iceberg_compat` implementations support additional +S3 configuration options: + +| Option | Description | +| ------------------------------- | -------------------------------------------------------------------------------------------------- | +| `fs.s3a.endpoint` | The endpoint of the S3 service | +| `fs.s3a.endpoint.region` | The AWS region for the S3 service. If not specified, the region will be auto-detected. | +| `fs.s3a.path.style.access` | Whether to use path style access for the S3 service (true/false, defaults to virtual hosted style) | +| `fs.s3a.requester.pays.enabled` | Whether to enable requester pays for S3 requests (true/false) | + +All configuration options support bucket-specific overrides using the pattern `fs.s3a.bucket.{bucket-name}.{option}`. + +#### Examples + +The following examples demonstrate how to configure S3 access with the `native_datafusion` and `native_iceberg_compat` +Parquet scan implementations using different authentication methods. + +**Example 1: Simple Credentials** + +This example shows how to access a private S3 bucket using an access key and secret key. The `fs.s3a.aws.credentials.provider` configuration can be omitted since `org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider` is included in Hadoop S3A's default credential provider chain. + +```shell +$SPARK_HOME/bin/spark-shell \ +... +--conf spark.comet.scan.impl=native_datafusion \ +--conf spark.hadoop.fs.s3a.access.key=my-access-key \ +--conf spark.hadoop.fs.s3a.secret.key=my-secret-key +... +``` + +**Example 2: Assume Role with Web Identity Token** + +This example demonstrates using an assumed role credential to access a private S3 bucket, where the base credential for assuming the role is provided by a web identity token credentials provider. + +```shell +$SPARK_HOME/bin/spark-shell \ +... +--conf spark.comet.scan.impl=native_datafusion \ +--conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider \ +--conf spark.hadoop.fs.s3a.assumed.role.arn=arn:aws:iam::123456789012:role/my-role \ +--conf spark.hadoop.fs.s3a.assumed.role.session.name=my-session \ +--conf spark.hadoop.fs.s3a.assumed.role.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider +... +``` + +#### Limitations + +The S3 support of `native_datafusion` and `native_iceberg_compat` has the following limitations: + +1. **Partial Hadoop S3A configuration support**: Not all Hadoop S3A configurations are currently supported. Only the configurations listed in the tables above are translated and applied to the underlying `object_store` crate. + +2. **Custom credential providers**: Custom implementations of AWS credential providers are not supported. The implementation only supports the standard credential providers listed in the table above. We are planning to add support for custom credential providers through a JNI-based adapter that will allow calling Java credential providers from native code. See [issue #1829](https://github.com/apache/datafusion-comet/issues/1829) for more details. diff --git a/docs/source/user-guide/latest/datatypes.md b/docs/source/user-guide/latest/datatypes.md index 6d79393485..7bc4d66168 100644 --- a/docs/source/user-guide/latest/datatypes.md +++ b/docs/source/user-guide/latest/datatypes.md @@ -22,7 +22,7 @@ Comet supports the following Spark data types. Refer to the [Comet Compatibility Guide] for information about data type support in scans and other operators. -[Comet Compatibility Guide]: compatibility.md +[Comet Compatibility Guide]: compatibility/index.md diff --git a/docs/source/user-guide/latest/expressions.md b/docs/source/user-guide/latest/expressions.md index dacd916454..a90070892f 100644 --- a/docs/source/user-guide/latest/expressions.md +++ b/docs/source/user-guide/latest/expressions.md @@ -316,4 +316,4 @@ Comet supports using the following aggregate functions within window contexts wi | UnscaledValue | Yes | | [Comet Configuration Guide]: configs.md -[Comet Compatibility Guide]: compatibility.md +[Comet Compatibility Guide]: compatibility/expressions/index.md diff --git a/docs/source/user-guide/latest/index.rst b/docs/source/user-guide/latest/index.rst index d5df4d014a..95806f1dd3 100644 --- a/docs/source/user-guide/latest/index.rst +++ b/docs/source/user-guide/latest/index.rst @@ -34,7 +34,7 @@ Comet $COMET_VERSION User Guide Supported Operators Supported Expressions Configuration Settings - Compatibility Guide + Compatibility Guide Tuning Guide Metrics Guide Iceberg Guide diff --git a/docs/source/user-guide/latest/operators.md b/docs/source/user-guide/latest/operators.md index 77ba84e4f7..1b8f78d9c6 100644 --- a/docs/source/user-guide/latest/operators.md +++ b/docs/source/user-guide/latest/operators.md @@ -45,4 +45,4 @@ not supported by Comet will fall back to regular Spark execution. | UnionExec | Yes | | | WindowExec | No | Disabled by default due to known correctness issues. | -[Comet Compatibility Guide]: compatibility.md +[Comet Compatibility Guide]: compatibility/operators.md diff --git a/spark/src/main/scala/org/apache/comet/GenerateDocs.scala b/spark/src/main/scala/org/apache/comet/GenerateDocs.scala index 18cda2037e..ce3b42e78d 100644 --- a/spark/src/main/scala/org/apache/comet/GenerateDocs.scala +++ b/spark/src/main/scala/org/apache/comet/GenerateDocs.scala @@ -42,7 +42,7 @@ object GenerateDocs { def main(args: Array[String]): Unit = { val userGuideLocation = args(0) generateConfigReference(s"$userGuideLocation/configs.md") - generateCompatibilityGuide(s"$userGuideLocation/compatibility.md") + generateCompatibilityGuide(s"$userGuideLocation/compatibility/expressions/cast.md") } private def generateConfigReference(filename: String): Unit = {