Skip to content

Commit f3b08bc

Browse files
andygroveclaude
andauthored
fix: [df52] timestamp nanos precision loss with nanosAsLong (#3502)
When Spark's `LEGACY_PARQUET_NANOS_AS_LONG=true` converts TIMESTAMP(NANOS) to LongType, the PhysicalExprAdapter detects a type mismatch between the file's Timestamp(Nanosecond) and the logical Int64. The DefaultAdapter creates a CastColumnExpr, which SparkPhysicalExprAdapter then replaces with Spark's Cast expression. Spark's Cast postprocess for Timestamp→Int64 unconditionally divides by MICROS_PER_SECOND (10^6), assuming microsecond precision. But the values are nanoseconds, so the raw value 1668537129123534758 becomes 1668537129123 — losing sub-millisecond precision. Fix: route Timestamp→Int64 casts through CometCastColumnExpr (which uses spark_parquet_convert → Arrow cast) instead of Spark Cast. Arrow's cast correctly reinterprets the raw i64 value without any division. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent dc2b9a4 commit f3b08bc

1 file changed

Lines changed: 11 additions & 4 deletions

File tree

native/core/src/parquet/schema_adapter.rs

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -302,23 +302,30 @@ impl SparkPhysicalExprAdapter {
302302
let physical_type = cast.input_field().data_type();
303303
let target_type = cast.target_field().data_type();
304304

305-
// For complex nested types (Struct, List, Map) and Timestamp timezone
306-
// mismatches, use CometCastColumnExpr with spark_parquet_convert which
307-
// handles field-name-based selection, reordering, nested type casting,
308-
// and metadata-only timestamp timezone relabeling correctly.
305+
// For complex nested types (Struct, List, Map), Timestamp timezone
306+
// mismatches, and Timestamp→Int64 (nanosAsLong), use CometCastColumnExpr
307+
// with spark_parquet_convert which handles field-name-based selection,
308+
// reordering, nested type casting, metadata-only timestamp timezone
309+
// relabeling, and raw value reinterpretation correctly.
309310
//
310311
// Timestamp mismatches (e.g., Timestamp(us, None) -> Timestamp(us, Some("UTC")))
311312
// occur when INT96 Parquet timestamps are coerced to Timestamp(us, None) by
312313
// DataFusion but the logical schema expects Timestamp(us, Some("UTC")).
313314
// Using Spark's Cast here would incorrectly treat the None-timezone values as
314315
// local time (TimestampNTZ) and apply a timezone conversion, but the values are
315316
// already in UTC. spark_parquet_convert handles this as a metadata-only change.
317+
//
318+
// Timestamp→Int64 occurs when Spark's `nanosAsLong` config converts
319+
// TIMESTAMP(NANOS) to LongType. Spark's Cast would divide by MICROS_PER_SECOND
320+
// (assuming microseconds), but the values are nanoseconds. Arrow cast correctly
321+
// reinterprets the raw i64 value without conversion.
316322
if matches!(
317323
(physical_type, target_type),
318324
(DataType::Struct(_), DataType::Struct(_))
319325
| (DataType::List(_), DataType::List(_))
320326
| (DataType::Map(_, _), DataType::Map(_, _))
321327
| (DataType::Timestamp(_, _), DataType::Timestamp(_, _))
328+
| (DataType::Timestamp(_, _), DataType::Int64)
322329
) {
323330
let comet_cast: Arc<dyn PhysicalExpr> = Arc::new(
324331
CometCastColumnExpr::new(

0 commit comments

Comments
 (0)