Skip to content

Commit f0652aa

Browse files
andygroveclaude
andauthored
fix: [df52] route timestamp timezone mismatches through spark_parquet_convert (#3494)
INT96 Parquet timestamps are coerced to Timestamp(us, None) by DataFusion but the logical schema expects Timestamp(us, Some("UTC")). The schema adapter was routing this mismatch through Spark's Cast expression, which incorrectly treats None-timezone values as TimestampNTZ (local time) and applies a timezone conversion. This caused results to be shifted by the session timezone offset (e.g., -5h45m for Asia/Kathmandu). Route Timestamp->Timestamp mismatches through CometCastColumnExpr which delegates to spark_parquet_convert, handling this as a metadata-only timezone relabel without modifying the underlying values. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 2482c43 commit f0652aa

1 file changed

Lines changed: 12 additions & 3 deletions

File tree

native/core/src/parquet/schema_adapter.rs

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -194,14 +194,23 @@ impl SparkPhysicalExprAdapter {
194194
let physical_type = cast.input_field().data_type();
195195
let target_type = cast.target_field().data_type();
196196

197-
// For complex nested types (Struct, List, Map), use CometCastColumnExpr
198-
// with spark_parquet_convert which handles field-name-based selection,
199-
// reordering, and nested type casting correctly.
197+
// For complex nested types (Struct, List, Map) and Timestamp timezone
198+
// mismatches, use CometCastColumnExpr with spark_parquet_convert which
199+
// handles field-name-based selection, reordering, nested type casting,
200+
// and metadata-only timestamp timezone relabeling correctly.
201+
//
202+
// Timestamp mismatches (e.g., Timestamp(us, None) -> Timestamp(us, Some("UTC")))
203+
// occur when INT96 Parquet timestamps are coerced to Timestamp(us, None) by
204+
// DataFusion but the logical schema expects Timestamp(us, Some("UTC")).
205+
// Using Spark's Cast here would incorrectly treat the None-timezone values as
206+
// local time (TimestampNTZ) and apply a timezone conversion, but the values are
207+
// already in UTC. spark_parquet_convert handles this as a metadata-only change.
200208
if matches!(
201209
(physical_type, target_type),
202210
(DataType::Struct(_), DataType::Struct(_))
203211
| (DataType::List(_), DataType::List(_))
204212
| (DataType::Map(_, _), DataType::Map(_, _))
213+
| (DataType::Timestamp(_, _), DataType::Timestamp(_, _))
205214
) {
206215
let comet_cast: Arc<dyn PhysicalExpr> = Arc::new(
207216
CometCastColumnExpr::new(

0 commit comments

Comments
 (0)