Skip to content

Commit 3f572eb

Browse files
authored
docs: Generate expression compatibility docs from code (#4057)
1 parent 7dd3c4d commit 3f572eb

16 files changed

Lines changed: 327 additions & 99 deletions

File tree

docs/source/contributor-guide/adding_a_new_expression.md

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,10 +70,12 @@ object CometUnhex extends CometExpressionSerde[Unhex] {
7070
}
7171
```
7272

73-
The `CometExpressionSerde` trait provides three methods you can override:
73+
The `CometExpressionSerde` trait provides several methods you can override:
7474

7575
- `convert(expr: T, inputs: Seq[Attribute], binding: Boolean): Option[Expr]` - **Required**. Converts the Spark expression to protobuf. Return `None` if the expression cannot be converted.
76-
- `getSupportLevel(expr: T): SupportLevel` - Optional. Returns the level of support for the expression. See "Using getSupportLevel" section below for details.
76+
- `getSupportLevel(expr: T): SupportLevel` - Optional. Returns the level of support for the expression at planning time, based on a specific expression instance. See "Using getSupportLevel" section below for details.
77+
- `getIncompatibleReasons(): Seq[String]` - Optional. Returns reasons why this expression may produce different results than Spark. Used to generate the Compatibility Guide. See "Documenting Incompatible and Unsupported Reasons" below.
78+
- `getUnsupportedReasons(): Seq[String]` - Optional. Returns reasons why this expression may not be supported by Comet (for example, unsupported data types or format strings). Used to generate the Compatibility Guide. See "Documenting Incompatible and Unsupported Reasons" below.
7779
- `getExprConfigName(expr: T): String` - Optional. Returns a short name for configuration keys. Defaults to the Spark class name.
7880

7981
For simple scalar functions that map directly to a DataFusion function, you can use the built-in `CometScalarFunction` implementation:
@@ -208,6 +210,65 @@ When the query planner encounters an expression:
208210

209211
Any notes provided will be logged to help with debugging and understanding why an expression was not used.
210212

213+
#### Documenting Incompatible and Unsupported Reasons
214+
215+
In addition to `getSupportLevel`, which governs runtime planning decisions, the serde trait exposes two static documentation methods:
216+
217+
- `getIncompatibleReasons(): Seq[String]` - Reasons the expression may produce different results than Spark.
218+
- `getUnsupportedReasons(): Seq[String]` - Reasons the expression, or certain usages of it, may not be supported by Comet.
219+
220+
These methods do not affect runtime behavior. They are called by `GenerateDocs` (`spark/src/main/scala/org/apache/comet/GenerateDocs.scala`) when building the user-facing Compatibility Guide pages under `docs/source/user-guide/latest/compatibility/expressions/` (for example, `math.md`, `datetime.md`, `array.md`, `aggregate.md`, `struct.md`). Each reason is rendered as a bullet in the corresponding page.
221+
222+
Key differences from `getSupportLevel`:
223+
224+
- **No expression instance.** Both methods take no arguments, so they describe the expression in general rather than a specific call site. Use `getSupportLevel` for checks that depend on data types, argument values, or other per-instance details.
225+
- **Markdown-friendly.** Each returned string is written to a Markdown document, so you can embed backticks, links, and line breaks. Keep each reason self-contained, since they are rendered as separate bullets.
226+
- **Regenerated by CI.** The lists are collected by `GenerateDocs` and published by CI on every merge to `main`. The generated Markdown is not committed to the repo, so you do not need to regenerate or commit it yourself. The reasons do not have to match the `notes` passed to `Compatible`, `Incompatible`, or `Unsupported`, but keeping them consistent avoids confusing users.
227+
228+
##### Example: Incompatibility note
229+
230+
```scala
231+
object CometStructsToJson extends CometExpressionSerde[StructsToJson] {
232+
233+
override def getIncompatibleReasons(): Seq[String] = Seq(
234+
"Does not support `+Infinity` and `-Infinity` for numeric types (float, double)." +
235+
" (https://github.com/apache/datafusion-comet/issues/3016)")
236+
237+
override def getSupportLevel(expr: StructsToJson): SupportLevel =
238+
Incompatible(
239+
Some(
240+
"Does not support Infinity/-Infinity for numeric types" +
241+
" (https://github.com/apache/datafusion-comet/issues/3016)"))
242+
243+
// ... convert ...
244+
}
245+
```
246+
247+
##### Example: Unsupported note
248+
249+
```scala
250+
object CometSortArray extends CometExpressionSerde[SortArray] {
251+
252+
override def getUnsupportedReasons(): Seq[String] = Seq(
253+
"Nested arrays with `Struct` or `Null` child values are not supported natively and will" +
254+
" fall back to Spark.")
255+
256+
// ... convert ...
257+
}
258+
```
259+
260+
##### Example: Enumerating supported values
261+
262+
When the expression only supports a known set of argument values, build the reason from that set so the documentation stays in sync with the code:
263+
264+
```scala
265+
override def getUnsupportedReasons(): Seq[String] = Seq(
266+
"Only the following formats are supported:" +
267+
supportedFormats.keys.toSeq.sorted
268+
.map(k => s"`$k`")
269+
.mkString("\n - ", "\n - ", ""))
270+
```
271+
211272
#### Adding Spark-side Tests for the New Expression
212273

213274
It is important to verify that the new expression is correctly recognized by the native execution engine and matches the expected Spark behavior. The preferred way to add test coverage is to write a SQL test file using the SQL file test framework. This approach is simpler than writing Scala test code and makes it easy to cover many input combinations and edge cases.

docs/source/user-guide/latest/compatibility/expressions/aggregate.md

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -19,16 +19,5 @@ under the License.
1919

2020
# Aggregate Expressions
2121

22-
## Incompatible Aggregates
23-
24-
- **CollectSet**: Comet deduplicates NaN values (treats `NaN == NaN`) while Spark treats each NaN as a distinct value.
25-
When `spark.comet.exec.strictFloatingPoint=true`, `collect_set` on floating-point types falls back to Spark unless
26-
`spark.comet.expression.CollectSet.allowIncompatible=true` is set.
27-
28-
## ANSI Mode
29-
30-
Comet will fall back to Spark for the following aggregate expressions when ANSI mode is enabled. These can be enabled by setting `spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See the [Comet Supported Expressions Guide](../../expressions.md) for more information on this configuration setting.
31-
32-
- Average (supports all numeric inputs except decimal types)
33-
34-
There is an [epic](https://github.com/apache/datafusion-comet/issues/313) where we are tracking the work to fully implement ANSI support.
22+
<!--BEGIN:EXPR_COMPAT[aggregate]-->
23+
<!--END:EXPR_COMPAT-->

docs/source/user-guide/latest/compatibility/expressions/array.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,5 @@ under the License.
1919

2020
# Array Expressions
2121

22-
- **SortArray**: Nested arrays with `Struct` or `Null` child values are not supported natively and will fall back to Spark.
22+
<!--BEGIN:EXPR_COMPAT[array]-->
23+
<!--END:EXPR_COMPAT-->

docs/source/user-guide/latest/compatibility/expressions/datetime.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,5 @@ under the License.
1919

2020
# Date/Time Expressions
2121

22-
- **Hour, Minute, Second**: Incorrectly apply timezone conversion to TimestampNTZ inputs. TimestampNTZ stores local
23-
time without timezone, so no conversion should be applied. These expressions work correctly with Timestamp inputs.
24-
[#3180](https://github.com/apache/datafusion-comet/issues/3180)
25-
- **TruncTimestamp (date_trunc)**: Produces incorrect results when used with non-UTC timezones. Compatible when
26-
timezone is UTC.
27-
[#2649](https://github.com/apache/datafusion-comet/issues/2649)
22+
<!--BEGIN:EXPR_COMPAT[datetime]-->
23+
<!--END:EXPR_COMPAT-->

docs/source/user-guide/latest/compatibility/expressions/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ Compatibility notes are grouped by expression category:
3131
aggregate
3232
array
3333
datetime
34+
math
3435
struct
3536
cast
3637
```
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Math Expressions
21+
22+
<!--BEGIN:EXPR_COMPAT[math]-->
23+
<!--END:EXPR_COMPAT-->

docs/source/user-guide/latest/compatibility/expressions/struct.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,5 +19,5 @@ under the License.
1919

2020
# Struct Expressions
2121

22-
- **StructsToJson (to_json)**: Does not support `+Infinity` and `-Infinity` for numeric types (float, double).
23-
[#3016](https://github.com/apache/datafusion-comet/issues/3016)
22+
<!--BEGIN:EXPR_COMPAT[struct]-->
23+
<!--END:EXPR_COMPAT-->

spark/src/main/scala/org/apache/comet/GenerateDocs.scala

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,10 +39,48 @@ object GenerateDocs {
3939

4040
private val publicConfigs: Set[ConfigEntry[_]] = CometConf.allConfs.filter(_.isPublic).toSet
4141

42+
/** (expression class simple name, incompatible reasons, unsupported reasons) */
43+
private type CategoryNotes = Seq[(String, Seq[String], Seq[String])]
44+
45+
/**
46+
* Mapping from expression category to the compatibility guide page where that category's
47+
* auto-generated notes should be written, along with a function that produces the notes for
48+
* that category from the serde maps in `QueryPlanSerde`.
49+
*/
50+
private def categoryPages: Map[String, (String, () => CategoryNotes)] = Map(
51+
"array" -> ("compatibility/expressions/array.md",
52+
() =>
53+
QueryPlanSerde.arrayExpressions.toSeq.map { case (cls, serde) =>
54+
(cls.getSimpleName, serde.getIncompatibleReasons(), serde.getUnsupportedReasons())
55+
}),
56+
"datetime" -> ("compatibility/expressions/datetime.md",
57+
() =>
58+
QueryPlanSerde.temporalExpressions.toSeq.map { case (cls, serde) =>
59+
(cls.getSimpleName, serde.getIncompatibleReasons(), serde.getUnsupportedReasons())
60+
}),
61+
"math" -> ("compatibility/expressions/math.md",
62+
() =>
63+
QueryPlanSerde.mathExpressions.toSeq.map { case (cls, serde) =>
64+
(cls.getSimpleName, serde.getIncompatibleReasons(), serde.getUnsupportedReasons())
65+
}),
66+
"struct" -> ("compatibility/expressions/struct.md",
67+
() =>
68+
QueryPlanSerde.structExpressions.toSeq.map { case (cls, serde) =>
69+
(cls.getSimpleName, serde.getIncompatibleReasons(), serde.getUnsupportedReasons())
70+
}),
71+
"aggregate" -> ("compatibility/expressions/aggregate.md",
72+
() =>
73+
QueryPlanSerde.aggrSerdeMap.toSeq.map { case (cls, serde) =>
74+
(cls.getSimpleName, serde.getIncompatibleReasons(), serde.getUnsupportedReasons())
75+
}))
76+
4277
def main(args: Array[String]): Unit = {
4378
val userGuideLocation = args(0)
4479
generateConfigReference(s"$userGuideLocation/configs.md")
4580
generateCompatibilityGuide(s"$userGuideLocation/compatibility/expressions/cast.md")
81+
for ((category, (page, notesFn)) <- categoryPages) {
82+
generateExpressionCompatNotes(s"$userGuideLocation/$page", category, notesFn())
83+
}
4684
}
4785

4886
private def generateConfigReference(filename: String): Unit = {
@@ -121,6 +159,46 @@ object GenerateDocs {
121159
w.close()
122160
}
123161

162+
private def generateExpressionCompatNotes(
163+
filename: String,
164+
category: String,
165+
notes: CategoryNotes): Unit = {
166+
val beginTag = s"<!--BEGIN:EXPR_COMPAT[$category]-->"
167+
val lines = readFile(filename)
168+
val w = new BufferedOutputStream(new FileOutputStream(filename))
169+
for (line <- lines) {
170+
w.write(s"${line.stripTrailing()}\n".getBytes)
171+
if (line.trim == beginTag) {
172+
writeExpressionCompatNotes(w, notes)
173+
}
174+
}
175+
w.close()
176+
}
177+
178+
private def writeExpressionCompatNotes(w: BufferedOutputStream, notes: CategoryNotes): Unit = {
179+
val sorted = notes.sortBy(_._1).filter { case (_, incompat, unsupported) =>
180+
incompat.nonEmpty || unsupported.nonEmpty
181+
}
182+
for ((name, incompat, unsupported) <- sorted) {
183+
w.write(s"\n### $name\n".getBytes)
184+
if (incompat.nonEmpty) {
185+
w.write(
186+
(s"\nThe following incompatibilities cause `$name` to fall back to Spark by default." +
187+
s" Set `spark.comet.expression.$name.allowIncompatible=true` to enable Comet" +
188+
" acceleration despite these differences.\n\n").getBytes)
189+
for (reason <- incompat) {
190+
w.write(s"- $reason\n".getBytes)
191+
}
192+
}
193+
if (unsupported.nonEmpty) {
194+
w.write("\nThe following cases are not supported by Comet:\n\n".getBytes)
195+
for (reason <- unsupported) {
196+
w.write(s"- $reason\n".getBytes)
197+
}
198+
}
199+
}
200+
}
201+
124202
private def writeCastMatrixForMode(w: BufferedOutputStream, mode: CometEvalMode.Value): Unit = {
125203
val sortedTypes = CometCast.supportedTypes.sortBy(_.typeName)
126204
val typeNames = sortedTypes.map(_.typeName.replace("(10,2)", ""))

spark/src/main/scala/org/apache/comet/serde/CometAggregateExpressionSerde.scala

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,26 @@ trait CometAggregateExpressionSerde[T <: AggregateFunction] {
3939
*/
4040
def getExprConfigName(expr: T): String = expr.getClass.getSimpleName
4141

42+
/**
43+
* Get documentation for usages where this expression may be incompatible with Spark. This is
44+
* called from GenerateDocs when generating the Compatibility Guide. Each reason should be
45+
* written in Markdown and may span multiple lines.
46+
*
47+
* @return
48+
* List of reasons, defaulting to an empty list.
49+
*/
50+
def getIncompatibleReasons(): Seq[String] = Seq.empty
51+
52+
/**
53+
* Get documentation for usages where this expression is unsupported with Spark. This is called
54+
* from GenerateDocs when generating the Compatibility Guide. Each reason should be written in
55+
* Markdown and may span multiple lines.
56+
*
57+
* @return
58+
* List of reasons, defaulting to an empty list.
59+
*/
60+
def getUnsupportedReasons(): Seq[String] = Seq.empty
61+
4262
/**
4363
* Determine the support level of the expression based on its attributes.
4464
*

spark/src/main/scala/org/apache/comet/serde/CometExpressionSerde.scala

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,26 @@ trait CometExpressionSerde[T <: Expression] {
3737
*/
3838
def getExprConfigName(expr: T): String = expr.getClass.getSimpleName
3939

40+
/**
41+
* Get documentation for usages where this expression may be incompatible with Spark. This is
42+
* called from GenerateDocs when generating the Compatibility Guide. Each reason should be
43+
* written in Markdown and may span multiple lines.
44+
*
45+
* @return
46+
* List of reasons, defaulting to an empty list.
47+
*/
48+
def getIncompatibleReasons(): Seq[String] = Seq.empty
49+
50+
/**
51+
* Get documentation for usages where this expression is unsupported with Spark. This is called
52+
* from GenerateDocs when generating the Compatibility Guide. Each reason should be written in
53+
* Markdown and may span multiple lines.
54+
*
55+
* @return
56+
* List of reasons, defaulting to an empty list.
57+
*/
58+
def getUnsupportedReasons(): Seq[String] = Seq.empty
59+
4060
/**
4161
* Determine the support level of the expression based on its attributes.
4262
*

0 commit comments

Comments
 (0)