Skip to content

Commit b8d8fbe

Browse files
andygroveclaudembutrovichcomphead
authored
docs: Update Parquet scan documentation (#3433)
* docs: remove all mentions of native_comet scan * update * prettier * docs: improve parquet_scans.md accuracy and completeness Fix grammar, add encryption fallback and native_iceberg_compat hard-coded config limitations, clarify S3 section applies to both scan implementations, and remove orphaned link references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * update config docs * prettier * docs: clarify parquet scan limitations and fallback behavior Clarify which limitations fall back to Spark vs which may produce incorrect results. Add missing documented limitations for native_datafusion (DPP, input_file_name, metadata columns). Fix misleading wording for ignoreCorruptFiles/ignoreMissingFiles. Note that auto mode currently always selects native_iceberg_compat. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: remove redundant fallback language in native_datafusion section The section intro already states all limitations fall back to Spark, so individual bullet points don't need to repeat it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: separate fallback limitations from incorrect-results limitations Restructure shared and per-scan limitation lists into two clear categories: features that fall back to Spark (safe) and issues that may produce incorrect results without falling back. Remove redundant "Comet falls back to Spark" from individual bullets where the section intro already states it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix * update * remove encryption from unsupported list, move DPP to common list * Update docs/source/contributor-guide/parquet_scans.md Co-authored-by: Oleks V <comphead@users.noreply.github.com> * address feedback * address feedback --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Matt Butrovich <mbutrovich@users.noreply.github.com> Co-authored-by: Oleks V <comphead@users.noreply.github.com>
1 parent e77c9bd commit b8d8fbe

4 files changed

Lines changed: 64 additions & 91 deletions

File tree

common/src/main/scala/org/apache/comet/CometConf.scala

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -125,16 +125,14 @@ object CometConf extends ShimCometConf {
125125
val SCAN_AUTO = "auto"
126126

127127
val COMET_NATIVE_SCAN_IMPL: ConfigEntry[String] = conf("spark.comet.scan.impl")
128-
.category(CATEGORY_SCAN)
128+
.category(CATEGORY_PARQUET)
129129
.doc(
130-
"The implementation of Comet Native Scan to use. Available modes are " +
130+
"The implementation of Comet's Parquet scan to use. Available scans are " +
131131
s"`$SCAN_NATIVE_DATAFUSION`, and `$SCAN_NATIVE_ICEBERG_COMPAT`. " +
132-
s"`$SCAN_NATIVE_DATAFUSION` is a fully native implementation of scan based on " +
133-
"DataFusion. " +
134-
s"`$SCAN_NATIVE_ICEBERG_COMPAT` is the recommended native implementation that " +
135-
"exposes apis to read parquet columns natively and supports complex types. " +
136-
s"`$SCAN_AUTO` (default) chooses the best scan.")
137-
.internal()
132+
s"`$SCAN_NATIVE_DATAFUSION` is a fully native implementation, and " +
133+
s"`$SCAN_NATIVE_ICEBERG_COMPAT` is a hybrid implementation that supports some " +
134+
"additional features, such as row indexes and field ids. " +
135+
s"`$SCAN_AUTO` (default) chooses the best available scan based on the scan schema.")
138136
.stringConf
139137
.transform(_.toLowerCase(Locale.ROOT))
140138
.checkValues(Set(SCAN_NATIVE_DATAFUSION, SCAN_NATIVE_ICEBERG_COMPAT, SCAN_AUTO))

docs/source/contributor-guide/ffi.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -177,9 +177,10 @@ message Scan {
177177

178178
#### When ownership is NOT transferred to native:
179179

180-
If the data originates from `native_comet` scan (deprecated, will be removed in a future release) or from
181-
`native_iceberg_compat` in some cases, then ownership is not transferred to native and the JVM may re-use the
182-
underlying buffers in the future.
180+
If the data originates from a scan that uses mutable buffers (such as Iceberg scans using the [hybrid Iceberg reader]),
181+
then ownership is not transferred to native and the JVM may re-use the underlying buffers in the future.
182+
183+
[hybrid Iceberg reader]: https://datafusion.apache.org/comet/user-guide/latest/iceberg.html#hybrid-reader
183184

184185
It is critical that the native code performs a deep copy of the arrays if the arrays are to be buffered by
185186
operators such as `SortExec` or `ShuffleWriterExec`, otherwise data corruption is likely to occur.

docs/source/contributor-guide/parquet_scans.md

Lines changed: 54 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -19,71 +19,60 @@ under the License.
1919

2020
# Comet Parquet Scan Implementations
2121

22-
Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
23-
`spark.comet.scan.impl` is used to select an implementation. The default setting is `spark.comet.scan.impl=auto`, and
24-
Comet will choose the most appropriate implementation based on the Parquet schema and other Comet configuration
25-
settings. Most users should not need to change this setting. However, it is possible to force Comet to try and use
26-
a particular implementation for all scan operations by setting this configuration property to one of the following
27-
implementations.
28-
29-
| Implementation | Description |
30-
| ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
31-
| `native_comet` | **Deprecated.** This implementation provides strong compatibility with Spark but does not support complex types. This is the original scan implementation in Comet and will be removed in a future release. |
32-
| `native_iceberg_compat` | This implementation delegates to DataFusion's `DataSourceExec` but uses a hybrid approach of JVM and native code. This scan is designed to be integrated with Iceberg in the future. |
33-
| `native_datafusion` | This experimental implementation delegates to DataFusion's `DataSourceExec` for full native execution. There are known compatibility issues when using this scan. |
34-
35-
The `native_datafusion` and `native_iceberg_compat` scans provide the following benefits over the `native_comet`
36-
implementation:
37-
38-
- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
39-
- Provides support for reading complex types (structs, arrays, and maps)
40-
- Delegates Parquet decoding to native Rust code rather than JVM-side decoding
41-
- Improves performance
42-
43-
> **Note on mutable buffers:** Both `native_comet` and `native_iceberg_compat` use reusable mutable buffers
44-
> when transferring data from JVM to native code via Arrow FFI. The `native_iceberg_compat` implementation uses DataFusion's native Parquet reader for data columns, bypassing Comet's mutable buffer infrastructure entirely. However, partition columns still use `ConstantColumnReader`, which relies on Comet's mutable buffers that are reused across batches. This means native operators that buffer data (such as `SortExec` or `ShuffleWriterExec`) must perform deep copies to avoid data corruption.
45-
> See the [FFI documentation](ffi.md) for details on the `arrow_ffi_safe` flag and ownership semantics.
46-
47-
The `native_datafusion` and `native_iceberg_compat` scans share the following limitations:
48-
49-
- When reading Parquet files written by systems other than Spark that contain columns with the logical type `UINT_8`
50-
(unsigned 8-bit integers), Comet may produce different results than Spark. Spark maps `UINT_8` to `ShortType`, but
51-
Comet's Arrow-based readers respect the unsigned type and read the data as unsigned rather than signed. Since Comet
52-
cannot distinguish `ShortType` columns that came from `UINT_8` versus signed `INT16`, by default Comet falls back to
53-
Spark when scanning Parquet files containing `ShortType` columns. This behavior can be disabled by setting
54-
`spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType` columns are always safe because they can
55-
only come from signed `INT8`, where truncation preserves the signed value.
56-
- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported.
57-
- No support for datetime rebasing detection or the `spark.comet.exceptionOnDatetimeRebase` configuration. When reading
58-
Parquet files containing dates or timestamps written before Spark 3.0 (which used a hybrid Julian/Gregorian calendar),
59-
the `native_comet` implementation can detect these legacy values and either throw an exception or read them without
60-
rebasing. The DataFusion-based implementations do not have this detection capability and will read all dates/timestamps
61-
as if they were written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before
62-
October 15, 1582.
63-
- No support for Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not include `parquet`,
64-
Spark uses the V2 API for Parquet scans. The DataFusion-based implementations only support the V1 API, so Comet
65-
will fall back to `native_comet` when V2 is enabled.
66-
67-
The `native_datafusion` scan has some additional limitations:
22+
Comet currently has two distinct implementations of the Parquet scan operator.
23+
24+
| Scan Implementation | Notes |
25+
| ----------------------- | ---------------------- |
26+
| `native_datafusion` | Fully native scan |
27+
| `native_iceberg_compat` | Hybrid JVM/native scan |
28+
29+
The configuration property
30+
`spark.comet.scan.impl` is used to select an implementation. The default setting is `spark.comet.scan.impl=auto`, which
31+
currently always uses the `native_iceberg_compat` implementation. Most users should not need to change this setting.
32+
However, it is possible to force Comet to use a particular implementation for all scan operations by setting
33+
this configuration property to one of the following implementations. For example: `--conf spark.comet.scan.impl=native_datafusion`.
34+
35+
The following features are not supported by either scan implementation, and Comet will fall back to Spark in these scenarios:
36+
37+
- `ShortType` columns, by default. When reading Parquet files written by systems other than Spark that contain
38+
columns with the logical type `UINT_8` (unsigned 8-bit integers), Comet may produce different results than Spark.
39+
Spark maps `UINT_8` to `ShortType`, but Comet's Arrow-based readers respect the unsigned type and read the data as
40+
unsigned rather than signed. Since Comet cannot distinguish `ShortType` columns that came from `UINT_8` versus
41+
signed `INT16`, by default Comet falls back to Spark when scanning Parquet files containing `ShortType` columns.
42+
This behavior can be disabled by setting `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType`
43+
columns are always safe because they can only come from signed `INT8`, where truncation preserves the signed value.
44+
- Default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported.
45+
- Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not include `parquet`, Spark uses the
46+
V2 API for Parquet scans. The DataFusion-based implementations only support the V1 API.
47+
- Spark metadata columns (e.g., `_metadata.file_path`)
48+
- No support for Dynamic Partition Pruning (DPP)
49+
50+
The following shared limitation may produce incorrect results without falling back to Spark:
51+
52+
- No support for datetime rebasing detection or the `spark.comet.exceptionOnDatetimeRebase` configuration. When
53+
reading Parquet files containing dates or timestamps written before Spark 3.0 (which used a hybrid
54+
Julian/Gregorian calendar), dates/timestamps will be read as if they were written using the Proleptic Gregorian
55+
calendar. This may produce incorrect results for dates before October 15, 1582.
56+
57+
The `native_datafusion` scan has some additional limitations, mostly related to Parquet metadata. All of these
58+
cause Comet to fall back to Spark.
6859

6960
- No support for row indexes
70-
- `PARQUET_FIELD_ID_READ_ENABLED` is not respected [#1758]
71-
- There are failures in the Spark SQL test suite [#1545]
72-
- Setting Spark configs `ignoreMissingFiles` or `ignoreCorruptFiles` to `true` is not compatible with Spark
61+
- No support for reading Parquet field IDs
62+
- No support for `input_file_name()`, `input_file_block_start()`, or `input_file_block_length()` SQL functions.
63+
The `native_datafusion` scan does not use Spark's `FileScanRDD`, so these functions cannot populate their values.
64+
- No support for `ignoreMissingFiles` or `ignoreCorruptFiles` being set to `true`
7365

74-
## S3 Support
75-
76-
There are some differences in S3 support between the scan implementations.
77-
78-
### `native_comet` (Deprecated)
66+
The `native_iceberg_compat` scan has the following additional limitation that may produce incorrect results
67+
without falling back to Spark:
7968

80-
> **Note:** The `native_comet` scan implementation is deprecated and will be removed in a future release.
69+
- Some Spark configuration values are hard-coded to their defaults rather than respecting user-specified values.
70+
This may produce incorrect results when non-default values are set. The affected configurations are
71+
`spark.sql.parquet.binaryAsString`, `spark.sql.parquet.int96AsTimestamp`, `spark.sql.caseSensitive`,
72+
`spark.sql.parquet.inferTimestampNTZ.enabled`, and `spark.sql.legacy.parquet.nanosAsLong`. See
73+
[issue #1816](https://github.com/apache/datafusion-comet/issues/1816) for more details.
8174

82-
The `native_comet` Parquet scan implementation reads data from S3 using the [Hadoop-AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html), which
83-
is identical to the approach commonly used with vanilla Spark. AWS credential configuration and other Hadoop S3A
84-
configurations works the same way as in vanilla Spark.
85-
86-
### `native_datafusion` and `native_iceberg_compat`
75+
## S3 Support
8776

8877
The `native_datafusion` and `native_iceberg_compat` Parquet scan implementations completely offload data loading
8978
to native code. They use the [`object_store` crate](https://crates.io/crates/object_store) to read data from S3 and
@@ -95,7 +84,8 @@ continue to work as long as the configurations are supported and can be translat
9584

9685
#### Additional S3 Configuration Options
9786

98-
Beyond credential providers, the `native_datafusion` implementation supports additional S3 configuration options:
87+
Beyond credential providers, the `native_datafusion` and `native_iceberg_compat` implementations support additional
88+
S3 configuration options:
9989

10090
| Option | Description |
10191
| ------------------------------- | -------------------------------------------------------------------------------------------------- |
@@ -108,7 +98,8 @@ All configuration options support bucket-specific overrides using the pattern `f
10898

10999
#### Examples
110100

111-
The following examples demonstrate how to configure S3 access with the `native_datafusion` Parquet scan implementation using different authentication methods.
101+
The following examples demonstrate how to configure S3 access with the `native_datafusion` and `native_iceberg_compat`
102+
Parquet scan implementations using different authentication methods.
112103

113104
**Example 1: Simple Credentials**
114105

@@ -140,11 +131,8 @@ $SPARK_HOME/bin/spark-shell \
140131

141132
#### Limitations
142133

143-
The S3 support of `native_datafusion` has the following limitations:
134+
The S3 support of `native_datafusion` and `native_iceberg_compat` has the following limitations:
144135

145136
1. **Partial Hadoop S3A configuration support**: Not all Hadoop S3A configurations are currently supported. Only the configurations listed in the tables above are translated and applied to the underlying `object_store` crate.
146137

147138
2. **Custom credential providers**: Custom implementations of AWS credential providers are not supported. The implementation only supports the standard credential providers listed in the table above. We are planning to add support for custom credential providers through a JNI-based adapter that will allow calling Java credential providers from native code. See [issue #1829](https://github.com/apache/datafusion-comet/issues/1829) for more details.
148-
149-
[#1545]: https://github.com/apache/datafusion-comet/issues/1545
150-
[#1758]: https://github.com/apache/datafusion-comet/issues/1758

docs/source/contributor-guide/roadmap.md

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -51,20 +51,6 @@ with benchmarks that benefit from this feature like TPC-DS. This effort can be t
5151
[#3349]: https://github.com/apache/datafusion-comet/pull/3349
5252
[#3510]: https://github.com/apache/datafusion-comet/issues/3510
5353

54-
### Removing the native_comet scan implementation
55-
56-
The `native_comet` scan implementation is now deprecated and will be removed in a future release ([#2186], [#2177]).
57-
This is the original scan implementation that uses mutable buffers (which is incompatible with best practices around
58-
Arrow FFI) and does not support complex types.
59-
60-
Now that the default `auto` scan mode uses `native_iceberg_compat` (which is based on DataFusion's `DataSourceExec`),
61-
we can proceed with removing the `native_comet` scan implementation, and then improve the efficiency of our use of
62-
Arrow FFI ([#2171]).
63-
64-
[#2186]: https://github.com/apache/datafusion-comet/issues/2186
65-
[#2171]: https://github.com/apache/datafusion-comet/issues/2171
66-
[#2177]: https://github.com/apache/datafusion-comet/issues/2177
67-
6854
## Ongoing Improvements
6955

7056
In addition to the major initiatives above, we have the following ongoing areas of work:

0 commit comments

Comments
 (0)