diff --git a/docs/source/contributor-guide/adding_a_new_expression.md b/docs/source/contributor-guide/adding_a_new_expression.md index 10af50e069..26a520b40d 100644 --- a/docs/source/contributor-guide/adding_a_new_expression.md +++ b/docs/source/contributor-guide/adding_a_new_expression.md @@ -70,10 +70,12 @@ object CometUnhex extends CometExpressionSerde[Unhex] { } ``` -The `CometExpressionSerde` trait provides three methods you can override: +The `CometExpressionSerde` trait provides several methods you can override: - `convert(expr: T, inputs: Seq[Attribute], binding: Boolean): Option[Expr]` - **Required**. Converts the Spark expression to protobuf. Return `None` if the expression cannot be converted. -- `getSupportLevel(expr: T): SupportLevel` - Optional. Returns the level of support for the expression. See "Using getSupportLevel" section below for details. +- `getSupportLevel(expr: T): SupportLevel` - Optional. Returns the level of support for the expression at planning time, based on a specific expression instance. See "Using getSupportLevel" section below for details. +- `getIncompatibleReasons(): Seq[String]` - Optional. Returns reasons why this expression may produce different results than Spark. Used to generate the Compatibility Guide. See "Documenting Incompatible and Unsupported Reasons" below. +- `getUnsupportedReasons(): Seq[String]` - Optional. Returns reasons why this expression may not be supported by Comet (for example, unsupported data types or format strings). Used to generate the Compatibility Guide. See "Documenting Incompatible and Unsupported Reasons" below. - `getExprConfigName(expr: T): String` - Optional. Returns a short name for configuration keys. Defaults to the Spark class name. For simple scalar functions that map directly to a DataFusion function, you can use the built-in `CometScalarFunction` implementation: @@ -208,6 +210,65 @@ When the query planner encounters an expression: Any notes provided will be logged to help with debugging and understanding why an expression was not used. +#### Documenting Incompatible and Unsupported Reasons + +In addition to `getSupportLevel`, which governs runtime planning decisions, the serde trait exposes two static documentation methods: + +- `getIncompatibleReasons(): Seq[String]` - Reasons the expression may produce different results than Spark. +- `getUnsupportedReasons(): Seq[String]` - Reasons the expression, or certain usages of it, may not be supported by Comet. + +These methods do not affect runtime behavior. They are called by `GenerateDocs` (`spark/src/main/scala/org/apache/comet/GenerateDocs.scala`) when building the user-facing Compatibility Guide pages under `docs/source/user-guide/latest/compatibility/expressions/` (for example, `math.md`, `datetime.md`, `array.md`, `aggregate.md`, `struct.md`). Each reason is rendered as a bullet in the corresponding page. + +Key differences from `getSupportLevel`: + +- **No expression instance.** Both methods take no arguments, so they describe the expression in general rather than a specific call site. Use `getSupportLevel` for checks that depend on data types, argument values, or other per-instance details. +- **Markdown-friendly.** Each returned string is written to a Markdown document, so you can embed backticks, links, and line breaks. Keep each reason self-contained, since they are rendered as separate bullets. +- **Regenerated by CI.** The lists are collected by `GenerateDocs` and published by CI on every merge to `main`. The generated Markdown is not committed to the repo, so you do not need to regenerate or commit it yourself. The reasons do not have to match the `notes` passed to `Compatible`, `Incompatible`, or `Unsupported`, but keeping them consistent avoids confusing users. + +##### Example: Incompatibility note + +```scala +object CometStructsToJson extends CometExpressionSerde[StructsToJson] { + + override def getIncompatibleReasons(): Seq[String] = Seq( + "Does not support `+Infinity` and `-Infinity` for numeric types (float, double)." + + " (https://github.com/apache/datafusion-comet/issues/3016)") + + override def getSupportLevel(expr: StructsToJson): SupportLevel = + Incompatible( + Some( + "Does not support Infinity/-Infinity for numeric types" + + " (https://github.com/apache/datafusion-comet/issues/3016)")) + + // ... convert ... +} +``` + +##### Example: Unsupported note + +```scala +object CometSortArray extends CometExpressionSerde[SortArray] { + + override def getUnsupportedReasons(): Seq[String] = Seq( + "Nested arrays with `Struct` or `Null` child values are not supported natively and will" + + " fall back to Spark.") + + // ... convert ... +} +``` + +##### Example: Enumerating supported values + +When the expression only supports a known set of argument values, build the reason from that set so the documentation stays in sync with the code: + +```scala +override def getUnsupportedReasons(): Seq[String] = Seq( + "Only the following formats are supported:" + + supportedFormats.keys.toSeq.sorted + .map(k => s"`$k`") + .mkString("\n - ", "\n - ", "")) +``` + #### Adding Spark-side Tests for the New Expression It is important to verify that the new expression is correctly recognized by the native execution engine and matches the expected Spark behavior. The preferred way to add test coverage is to write a SQL test file using the SQL file test framework. This approach is simpler than writing Scala test code and makes it easy to cover many input combinations and edge cases. diff --git a/docs/source/user-guide/latest/compatibility/expressions/aggregate.md b/docs/source/user-guide/latest/compatibility/expressions/aggregate.md index f017d899d2..8d15eea43d 100644 --- a/docs/source/user-guide/latest/compatibility/expressions/aggregate.md +++ b/docs/source/user-guide/latest/compatibility/expressions/aggregate.md @@ -19,16 +19,5 @@ under the License. # Aggregate Expressions -## Incompatible Aggregates - -- **CollectSet**: Comet deduplicates NaN values (treats `NaN == NaN`) while Spark treats each NaN as a distinct value. - When `spark.comet.exec.strictFloatingPoint=true`, `collect_set` on floating-point types falls back to Spark unless - `spark.comet.expression.CollectSet.allowIncompatible=true` is set. - -## ANSI Mode - -Comet will fall back to Spark for the following aggregate expressions when ANSI mode is enabled. These can be enabled by setting `spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See the [Comet Supported Expressions Guide](../../expressions.md) for more information on this configuration setting. - -- Average (supports all numeric inputs except decimal types) - -There is an [epic](https://github.com/apache/datafusion-comet/issues/313) where we are tracking the work to fully implement ANSI support. + + diff --git a/docs/source/user-guide/latest/compatibility/expressions/array.md b/docs/source/user-guide/latest/compatibility/expressions/array.md index 71e911288c..c7f2569b40 100644 --- a/docs/source/user-guide/latest/compatibility/expressions/array.md +++ b/docs/source/user-guide/latest/compatibility/expressions/array.md @@ -19,4 +19,5 @@ under the License. # Array Expressions -- **SortArray**: Nested arrays with `Struct` or `Null` child values are not supported natively and will fall back to Spark. + + diff --git a/docs/source/user-guide/latest/compatibility/expressions/datetime.md b/docs/source/user-guide/latest/compatibility/expressions/datetime.md index b18e6f723e..e07bde3c23 100644 --- a/docs/source/user-guide/latest/compatibility/expressions/datetime.md +++ b/docs/source/user-guide/latest/compatibility/expressions/datetime.md @@ -19,9 +19,5 @@ under the License. # Date/Time Expressions -- **Hour, Minute, Second**: Incorrectly apply timezone conversion to TimestampNTZ inputs. TimestampNTZ stores local - time without timezone, so no conversion should be applied. These expressions work correctly with Timestamp inputs. - [#3180](https://github.com/apache/datafusion-comet/issues/3180) -- **TruncTimestamp (date_trunc)**: Produces incorrect results when used with non-UTC timezones. Compatible when - timezone is UTC. - [#2649](https://github.com/apache/datafusion-comet/issues/2649) + + diff --git a/docs/source/user-guide/latest/compatibility/expressions/index.md b/docs/source/user-guide/latest/compatibility/expressions/index.md index 3084a3930b..2c4742f36d 100644 --- a/docs/source/user-guide/latest/compatibility/expressions/index.md +++ b/docs/source/user-guide/latest/compatibility/expressions/index.md @@ -31,6 +31,7 @@ Compatibility notes are grouped by expression category: aggregate array datetime +math struct cast ``` diff --git a/docs/source/user-guide/latest/compatibility/expressions/math.md b/docs/source/user-guide/latest/compatibility/expressions/math.md new file mode 100644 index 0000000000..6d8905adf4 --- /dev/null +++ b/docs/source/user-guide/latest/compatibility/expressions/math.md @@ -0,0 +1,23 @@ + + +# Math Expressions + + + diff --git a/docs/source/user-guide/latest/compatibility/expressions/struct.md b/docs/source/user-guide/latest/compatibility/expressions/struct.md index 2a207894cf..1eaaf4a5e2 100644 --- a/docs/source/user-guide/latest/compatibility/expressions/struct.md +++ b/docs/source/user-guide/latest/compatibility/expressions/struct.md @@ -19,5 +19,5 @@ under the License. # Struct Expressions -- **StructsToJson (to_json)**: Does not support `+Infinity` and `-Infinity` for numeric types (float, double). - [#3016](https://github.com/apache/datafusion-comet/issues/3016) + + diff --git a/spark/src/main/scala/org/apache/comet/GenerateDocs.scala b/spark/src/main/scala/org/apache/comet/GenerateDocs.scala index ce3b42e78d..32c8c0fdcd 100644 --- a/spark/src/main/scala/org/apache/comet/GenerateDocs.scala +++ b/spark/src/main/scala/org/apache/comet/GenerateDocs.scala @@ -39,10 +39,48 @@ object GenerateDocs { private val publicConfigs: Set[ConfigEntry[_]] = CometConf.allConfs.filter(_.isPublic).toSet + /** (expression class simple name, incompatible reasons, unsupported reasons) */ + private type CategoryNotes = Seq[(String, Seq[String], Seq[String])] + + /** + * Mapping from expression category to the compatibility guide page where that category's + * auto-generated notes should be written, along with a function that produces the notes for + * that category from the serde maps in `QueryPlanSerde`. + */ + private def categoryPages: Map[String, (String, () => CategoryNotes)] = Map( + "array" -> ("compatibility/expressions/array.md", + () => + QueryPlanSerde.arrayExpressions.toSeq.map { case (cls, serde) => + (cls.getSimpleName, serde.getIncompatibleReasons(), serde.getUnsupportedReasons()) + }), + "datetime" -> ("compatibility/expressions/datetime.md", + () => + QueryPlanSerde.temporalExpressions.toSeq.map { case (cls, serde) => + (cls.getSimpleName, serde.getIncompatibleReasons(), serde.getUnsupportedReasons()) + }), + "math" -> ("compatibility/expressions/math.md", + () => + QueryPlanSerde.mathExpressions.toSeq.map { case (cls, serde) => + (cls.getSimpleName, serde.getIncompatibleReasons(), serde.getUnsupportedReasons()) + }), + "struct" -> ("compatibility/expressions/struct.md", + () => + QueryPlanSerde.structExpressions.toSeq.map { case (cls, serde) => + (cls.getSimpleName, serde.getIncompatibleReasons(), serde.getUnsupportedReasons()) + }), + "aggregate" -> ("compatibility/expressions/aggregate.md", + () => + QueryPlanSerde.aggrSerdeMap.toSeq.map { case (cls, serde) => + (cls.getSimpleName, serde.getIncompatibleReasons(), serde.getUnsupportedReasons()) + })) + def main(args: Array[String]): Unit = { val userGuideLocation = args(0) generateConfigReference(s"$userGuideLocation/configs.md") generateCompatibilityGuide(s"$userGuideLocation/compatibility/expressions/cast.md") + for ((category, (page, notesFn)) <- categoryPages) { + generateExpressionCompatNotes(s"$userGuideLocation/$page", category, notesFn()) + } } private def generateConfigReference(filename: String): Unit = { @@ -121,6 +159,46 @@ object GenerateDocs { w.close() } + private def generateExpressionCompatNotes( + filename: String, + category: String, + notes: CategoryNotes): Unit = { + val beginTag = s"" + val lines = readFile(filename) + val w = new BufferedOutputStream(new FileOutputStream(filename)) + for (line <- lines) { + w.write(s"${line.stripTrailing()}\n".getBytes) + if (line.trim == beginTag) { + writeExpressionCompatNotes(w, notes) + } + } + w.close() + } + + private def writeExpressionCompatNotes(w: BufferedOutputStream, notes: CategoryNotes): Unit = { + val sorted = notes.sortBy(_._1).filter { case (_, incompat, unsupported) => + incompat.nonEmpty || unsupported.nonEmpty + } + for ((name, incompat, unsupported) <- sorted) { + w.write(s"\n### $name\n".getBytes) + if (incompat.nonEmpty) { + w.write( + (s"\nThe following incompatibilities cause `$name` to fall back to Spark by default." + + s" Set `spark.comet.expression.$name.allowIncompatible=true` to enable Comet" + + " acceleration despite these differences.\n\n").getBytes) + for (reason <- incompat) { + w.write(s"- $reason\n".getBytes) + } + } + if (unsupported.nonEmpty) { + w.write("\nThe following cases are not supported by Comet:\n\n".getBytes) + for (reason <- unsupported) { + w.write(s"- $reason\n".getBytes) + } + } + } + } + private def writeCastMatrixForMode(w: BufferedOutputStream, mode: CometEvalMode.Value): Unit = { val sortedTypes = CometCast.supportedTypes.sortBy(_.typeName) val typeNames = sortedTypes.map(_.typeName.replace("(10,2)", "")) diff --git a/spark/src/main/scala/org/apache/comet/serde/CometAggregateExpressionSerde.scala b/spark/src/main/scala/org/apache/comet/serde/CometAggregateExpressionSerde.scala index 0a5a2770b4..ba220dcce8 100644 --- a/spark/src/main/scala/org/apache/comet/serde/CometAggregateExpressionSerde.scala +++ b/spark/src/main/scala/org/apache/comet/serde/CometAggregateExpressionSerde.scala @@ -39,6 +39,26 @@ trait CometAggregateExpressionSerde[T <: AggregateFunction] { */ def getExprConfigName(expr: T): String = expr.getClass.getSimpleName + /** + * Get documentation for usages where this expression may be incompatible with Spark. This is + * called from GenerateDocs when generating the Compatibility Guide. Each reason should be + * written in Markdown and may span multiple lines. + * + * @return + * List of reasons, defaulting to an empty list. + */ + def getIncompatibleReasons(): Seq[String] = Seq.empty + + /** + * Get documentation for usages where this expression is unsupported with Spark. This is called + * from GenerateDocs when generating the Compatibility Guide. Each reason should be written in + * Markdown and may span multiple lines. + * + * @return + * List of reasons, defaulting to an empty list. + */ + def getUnsupportedReasons(): Seq[String] = Seq.empty + /** * Determine the support level of the expression based on its attributes. * diff --git a/spark/src/main/scala/org/apache/comet/serde/CometExpressionSerde.scala b/spark/src/main/scala/org/apache/comet/serde/CometExpressionSerde.scala index 20c0343037..afad6f6bb2 100644 --- a/spark/src/main/scala/org/apache/comet/serde/CometExpressionSerde.scala +++ b/spark/src/main/scala/org/apache/comet/serde/CometExpressionSerde.scala @@ -37,6 +37,26 @@ trait CometExpressionSerde[T <: Expression] { */ def getExprConfigName(expr: T): String = expr.getClass.getSimpleName + /** + * Get documentation for usages where this expression may be incompatible with Spark. This is + * called from GenerateDocs when generating the Compatibility Guide. Each reason should be + * written in Markdown and may span multiple lines. + * + * @return + * List of reasons, defaulting to an empty list. + */ + def getIncompatibleReasons(): Seq[String] = Seq.empty + + /** + * Get documentation for usages where this expression is unsupported with Spark. This is called + * from GenerateDocs when generating the Compatibility Guide. Each reason should be written in + * Markdown and may span multiple lines. + * + * @return + * List of reasons, defaulting to an empty list. + */ + def getUnsupportedReasons(): Seq[String] = Seq.empty + /** * Determine the support level of the expression based on its attributes. * diff --git a/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala b/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala index 768e4e0ed6..8114e54b4d 100644 --- a/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala +++ b/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala @@ -46,7 +46,7 @@ import org.apache.comet.shims.CometExprShim */ object QueryPlanSerde extends Logging with CometExprShim { - private val arrayExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = Map( + private[comet] val arrayExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = Map( classOf[ArrayAppend] -> CometArrayAppend, classOf[ArrayCompact] -> CometArrayCompact, classOf[ArrayContains] -> CometArrayContains, @@ -88,7 +88,7 @@ object QueryPlanSerde extends Logging with CometExprShim { classOf[Not] -> CometNot, classOf[Or] -> CometOr) - private val mathExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = Map( + private[comet] val mathExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = Map( classOf[Acos] -> CometScalarFunction("acos"), classOf[Add] -> CometAdd, classOf[Asin] -> CometScalarFunction("asin"), @@ -136,13 +136,14 @@ object QueryPlanSerde extends Logging with CometExprShim { classOf[MapContainsKey] -> CometMapContainsKey, classOf[MapFromEntries] -> CometMapFromEntries) - private val structExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = Map( - classOf[CreateNamedStruct] -> CometCreateNamedStruct, - classOf[GetArrayStructFields] -> CometGetArrayStructFields, - classOf[GetStructField] -> CometGetStructField, - classOf[JsonToStructs] -> CometJsonToStructs, - classOf[StructsToJson] -> CometStructsToJson, - classOf[StructsToCsv] -> CometStructsToCsv) + private[comet] val structExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = + Map( + classOf[CreateNamedStruct] -> CometCreateNamedStruct, + classOf[GetArrayStructFields] -> CometGetArrayStructFields, + classOf[GetStructField] -> CometGetStructField, + classOf[JsonToStructs] -> CometJsonToStructs, + classOf[StructsToJson] -> CometStructsToJson, + classOf[StructsToCsv] -> CometStructsToCsv) private val hashExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = Map( classOf[Crc32] -> CometScalarFunction("crc32"), @@ -152,40 +153,41 @@ object QueryPlanSerde extends Logging with CometExprShim { classOf[XxHash64] -> CometXxHash64, classOf[Sha1] -> CometSha1) - private val stringExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = Map( - classOf[Ascii] -> CometScalarFunction("ascii"), - classOf[BitLength] -> CometScalarFunction("bit_length"), - classOf[Chr] -> CometScalarFunction("char"), - classOf[ConcatWs] -> CometConcatWs, - classOf[Concat] -> CometConcat, - classOf[Contains] -> CometScalarFunction("contains"), - classOf[EndsWith] -> CometScalarFunction("ends_with"), - classOf[GetJsonObject] -> CometGetJsonObject, - classOf[InitCap] -> CometInitCap, - classOf[Length] -> CometLength, - classOf[Like] -> CometLike, - classOf[Lower] -> CometLower, - classOf[OctetLength] -> CometScalarFunction("octet_length"), - classOf[RegExpReplace] -> CometRegExpReplace, - classOf[Reverse] -> CometReverse, - classOf[RLike] -> CometRLike, - classOf[StartsWith] -> CometScalarFunction("starts_with"), - classOf[StringInstr] -> CometScalarFunction("instr"), - classOf[StringRepeat] -> CometStringRepeat, - classOf[StringReplace] -> CometScalarFunction("replace"), - classOf[StringRPad] -> CometStringRPad, - classOf[StringLPad] -> CometStringLPad, - classOf[StringSpace] -> CometScalarFunction("space"), - classOf[StringSplit] -> CometStringSplit, - classOf[StringTranslate] -> CometScalarFunction("translate"), - classOf[StringTrim] -> CometScalarFunction("trim"), - classOf[StringTrimBoth] -> CometScalarFunction("btrim"), - classOf[StringTrimLeft] -> CometScalarFunction("ltrim"), - classOf[StringTrimRight] -> CometScalarFunction("rtrim"), - classOf[Left] -> CometLeft, - classOf[Right] -> CometRight, - classOf[Substring] -> CometSubstring, - classOf[Upper] -> CometUpper) + private[comet] val stringExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = + Map( + classOf[Ascii] -> CometScalarFunction("ascii"), + classOf[BitLength] -> CometScalarFunction("bit_length"), + classOf[Chr] -> CometScalarFunction("char"), + classOf[ConcatWs] -> CometConcatWs, + classOf[Concat] -> CometConcat, + classOf[Contains] -> CometScalarFunction("contains"), + classOf[EndsWith] -> CometScalarFunction("ends_with"), + classOf[GetJsonObject] -> CometGetJsonObject, + classOf[InitCap] -> CometInitCap, + classOf[Length] -> CometLength, + classOf[Like] -> CometLike, + classOf[Lower] -> CometLower, + classOf[OctetLength] -> CometScalarFunction("octet_length"), + classOf[RegExpReplace] -> CometRegExpReplace, + classOf[Reverse] -> CometReverse, + classOf[RLike] -> CometRLike, + classOf[StartsWith] -> CometScalarFunction("starts_with"), + classOf[StringInstr] -> CometScalarFunction("instr"), + classOf[StringRepeat] -> CometStringRepeat, + classOf[StringReplace] -> CometScalarFunction("replace"), + classOf[StringRPad] -> CometStringRPad, + classOf[StringLPad] -> CometStringLPad, + classOf[StringSpace] -> CometScalarFunction("space"), + classOf[StringSplit] -> CometStringSplit, + classOf[StringTranslate] -> CometScalarFunction("translate"), + classOf[StringTrim] -> CometScalarFunction("trim"), + classOf[StringTrimBoth] -> CometScalarFunction("btrim"), + classOf[StringTrimLeft] -> CometScalarFunction("ltrim"), + classOf[StringTrimRight] -> CometScalarFunction("rtrim"), + classOf[Left] -> CometLeft, + classOf[Right] -> CometRight, + classOf[Substring] -> CometSubstring, + classOf[Upper] -> CometUpper) private val bitwiseExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = Map( classOf[BitwiseAnd] -> CometBitwiseAnd, @@ -197,33 +199,34 @@ object QueryPlanSerde extends Logging with CometExprShim { classOf[ShiftLeft] -> CometShiftLeft, classOf[ShiftRight] -> CometShiftRight) - private val temporalExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = Map( - classOf[DateAdd] -> CometDateAdd, - classOf[DateDiff] -> CometDateDiff, - classOf[DateFormatClass] -> CometDateFormat, - classOf[DateFromUnixDate] -> CometDateFromUnixDate, - classOf[Days] -> CometDays, - classOf[Hours] -> CometHours, - classOf[DateSub] -> CometDateSub, - classOf[UnixDate] -> CometUnixDate, - classOf[FromUnixTime] -> CometFromUnixTime, - classOf[LastDay] -> CometLastDay, - classOf[Hour] -> CometHour, - classOf[MakeDate] -> CometMakeDate, - classOf[Minute] -> CometMinute, - classOf[NextDay] -> CometNextDay, - classOf[Second] -> CometSecond, - classOf[TruncDate] -> CometTruncDate, - classOf[TruncTimestamp] -> CometTruncTimestamp, - classOf[UnixTimestamp] -> CometUnixTimestamp, - classOf[Year] -> CometYear, - classOf[Month] -> CometMonth, - classOf[DayOfMonth] -> CometDayOfMonth, - classOf[DayOfWeek] -> CometDayOfWeek, - classOf[WeekDay] -> CometWeekDay, - classOf[DayOfYear] -> CometDayOfYear, - classOf[WeekOfYear] -> CometWeekOfYear, - classOf[Quarter] -> CometQuarter) + private[comet] val temporalExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = + Map( + classOf[DateAdd] -> CometDateAdd, + classOf[DateDiff] -> CometDateDiff, + classOf[DateFormatClass] -> CometDateFormat, + classOf[DateFromUnixDate] -> CometDateFromUnixDate, + classOf[Days] -> CometDays, + classOf[Hours] -> CometHours, + classOf[DateSub] -> CometDateSub, + classOf[UnixDate] -> CometUnixDate, + classOf[FromUnixTime] -> CometFromUnixTime, + classOf[LastDay] -> CometLastDay, + classOf[Hour] -> CometHour, + classOf[MakeDate] -> CometMakeDate, + classOf[Minute] -> CometMinute, + classOf[NextDay] -> CometNextDay, + classOf[Second] -> CometSecond, + classOf[TruncDate] -> CometTruncDate, + classOf[TruncTimestamp] -> CometTruncTimestamp, + classOf[UnixTimestamp] -> CometUnixTimestamp, + classOf[Year] -> CometYear, + classOf[Month] -> CometMonth, + classOf[DayOfMonth] -> CometDayOfMonth, + classOf[DayOfWeek] -> CometDayOfWeek, + classOf[WeekDay] -> CometWeekDay, + classOf[DayOfYear] -> CometDayOfYear, + classOf[WeekOfYear] -> CometWeekOfYear, + classOf[Quarter] -> CometQuarter) private val conversionExpressions: Map[Class[_ <: Expression], CometExpressionSerde[_]] = Map( classOf[Cast] -> CometCast) diff --git a/spark/src/main/scala/org/apache/comet/serde/aggregates.scala b/spark/src/main/scala/org/apache/comet/serde/aggregates.scala index 7d78bbe3e5..549776e7c3 100644 --- a/spark/src/main/scala/org/apache/comet/serde/aggregates.scala +++ b/spark/src/main/scala/org/apache/comet/serde/aggregates.scala @@ -151,6 +151,9 @@ object CometCount extends CometAggregateExpressionSerde[Count] { object CometAverage extends CometAggregateExpressionSerde[Average] { + override def getIncompatibleReasons(): Seq[String] = Seq( + "Falls back to Spark in ANSI mode. Supports all numeric inputs except decimal types.") + override def convert( aggExpr: AggregateExpression, avg: Average, @@ -666,6 +669,12 @@ object CometBloomFilterAggregate extends CometAggregateExpressionSerde[BloomFilt object CometCollectSet extends CometAggregateExpressionSerde[CollectSet] { + override def getIncompatibleReasons(): Seq[String] = Seq( + "Comet deduplicates NaN values (treats `NaN == NaN`) while Spark treats each NaN as a" + + s" distinct value. When `${COMET_EXEC_STRICT_FLOATING_POINT.key}=true`, `collect_set`" + + " on floating-point types falls back to Spark unless" + + " `spark.comet.expression.CollectSet.allowIncompatible=true` is set.") + override def getSupportLevel(expr: CollectSet): SupportLevel = { if (COMET_EXEC_STRICT_FLOATING_POINT.get() && SupportLevel.containsFloatingPoint(expr.children.head.dataType)) { diff --git a/spark/src/main/scala/org/apache/comet/serde/arrays.scala b/spark/src/main/scala/org/apache/comet/serde/arrays.scala index 14d4536fc1..41dd823cc8 100644 --- a/spark/src/main/scala/org/apache/comet/serde/arrays.scala +++ b/spark/src/main/scala/org/apache/comet/serde/arrays.scala @@ -123,6 +123,10 @@ object CometArrayContains extends CometExpressionSerde[ArrayContains] { object CometSortArray extends CometExpressionSerde[SortArray] { + override def getUnsupportedReasons(): Seq[String] = Seq( + "Nested arrays with `Struct` or `Null` child values are not supported natively and will" + + " fall back to Spark.") + private def supportedSortArrayElementType( dt: DataType, nestedInArray: Boolean = false): Boolean = { diff --git a/spark/src/main/scala/org/apache/comet/serde/datetime.scala b/spark/src/main/scala/org/apache/comet/serde/datetime.scala index 5413e8b439..52b086184b 100644 --- a/spark/src/main/scala/org/apache/comet/serde/datetime.scala +++ b/spark/src/main/scala/org/apache/comet/serde/datetime.scala @@ -180,12 +180,14 @@ object CometQuarter extends CometExpressionSerde[Quarter] with CometExprGetDateF object CometHour extends CometExpressionSerde[Hour] { + val incompatReason: String = "Incorrectly applies timezone conversion to TimestampNTZ inputs" + + " (https://github.com/apache/datafusion-comet/issues/3180)" + + override def getIncompatibleReasons(): Seq[String] = Seq(incompatReason) + override def getSupportLevel(expr: Hour): SupportLevel = { if (expr.child.dataType.typeName == "timestamp_ntz") { - Incompatible( - Some( - "Incorrectly applies timezone conversion to TimestampNTZ inputs" + - " (https://github.com/apache/datafusion-comet/issues/3180)")) + Incompatible(Some(incompatReason)) } else { Compatible() } @@ -423,6 +425,10 @@ object CometTruncDate extends CometExpressionSerde[TruncDate] { object CometTruncTimestamp extends CometExpressionSerde[TruncTimestamp] { + override def getIncompatibleReasons(): Seq[String] = Seq( + "Produces incorrect results when used with non-UTC timezones. Compatible when timezone is" + + " UTC. (https://github.com/apache/datafusion-comet/issues/2649)") + val supportedFormats: Seq[String] = Seq( "year", @@ -536,6 +542,15 @@ object CometDateFormat extends CometExpressionSerde[DateFormatClass] { // ISO formats "yyyy-MM-dd'T'HH:mm:ss" -> "%Y-%m-%dT%H:%M:%S") + override def getIncompatibleReasons(): Seq[String] = Seq( + "Non-UTC timezones may produce different results than Spark") + + override def getUnsupportedReasons(): Seq[String] = Seq( + "Only the following formats are supported:" + + supportedFormats.keys.toSeq.sorted + .map(k => s"`$k`") + .mkString("\n - ", "\n - ", "")) + override def getSupportLevel(expr: DateFormatClass): SupportLevel = { // Check timezone - only UTC is fully compatible val timezone = expr.timeZoneId.getOrElse("UTC") diff --git a/spark/src/main/scala/org/apache/comet/serde/math.scala b/spark/src/main/scala/org/apache/comet/serde/math.scala index a01d4cdf9d..2f3a6902d6 100644 --- a/spark/src/main/scala/org/apache/comet/serde/math.scala +++ b/spark/src/main/scala/org/apache/comet/serde/math.scala @@ -164,13 +164,17 @@ object CometUnhex extends CometExpressionSerde[Unhex] with MathExprBase { object CometAbs extends CometExpressionSerde[Abs] with MathExprBase { + val unsupportedReason: String = "Only integral, floating-point, and decimal types are supported" + + override def getUnsupportedReasons(): Seq[String] = Seq(unsupportedReason) + override def getSupportLevel(expr: Abs): SupportLevel = { expr.child.dataType match { case _: NumericType => Compatible() case _ => // Spark supports NumericType, DayTimeIntervalType, and YearMonthIntervalType - Unsupported(Some("Only integral, floating-point, and decimal types are supported")) + Unsupported(Some(unsupportedReason)) } } diff --git a/spark/src/main/scala/org/apache/comet/serde/structs.scala b/spark/src/main/scala/org/apache/comet/serde/structs.scala index 449d0fc5b9..a9fffcf6f7 100644 --- a/spark/src/main/scala/org/apache/comet/serde/structs.scala +++ b/spark/src/main/scala/org/apache/comet/serde/structs.scala @@ -105,6 +105,10 @@ object CometGetArrayStructFields extends CometExpressionSerde[GetArrayStructFiel object CometStructsToJson extends CometExpressionSerde[StructsToJson] { + override def getIncompatibleReasons(): Seq[String] = Seq( + "Does not support `+Infinity` and `-Infinity` for numeric types (float, double)." + + " (https://github.com/apache/datafusion-comet/issues/3016)") + override def getSupportLevel(expr: StructsToJson): SupportLevel = Incompatible( Some(