Shekharrajak
diff --git a/‎docs/iceberg-serialization-optimization-analysis.md‎
Lines changed: 306 additions & 0 deletions b/‎docs/iceberg-serialization-optimization-analysis.md‎
Lines changed: 306 additions & 0 deletions
@@ -0,0 +1,306 @@
+<!---
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Iceberg Serialization Optimization Analysis
+
+**GitHub Issue:** [#3456](https://github.com/apache/datafusion-comet/issues/3456)  
+**Date:** 2026-02-20  
+**Branch:** `feature/iceberg-serialization-optimizations-3456`
+
+## Executive Summary
+
+PR #3298 introduced **~50% faster** Iceberg serialization through reflection caching and deduplication optimizations. However, subsequent PRs #3349 and #3443 significantly refactored the code, and **most of these optimizations were lost**. This analysis identifies which optimizations can be re-applied to the current codebase.
+
+---
+
+## PR #3298 Original Optimizations
+
+### 1. ReflectionCache Case Class
+**Status: ❌ Removed**
+
+PR #3298 introduced a comprehensive `ReflectionCache` that cached all Iceberg classes and methods once:
+
+```scala
+case class ReflectionCache(
+    // Iceberg classes (loaded once)
+    contentScanTaskClass: Class[_],
+    fileScanTaskClass: Class[_],
+    contentFileClass: Class[_],
+    deleteFileClass: Class[_],
+    schemaParserClass: Class[_],
+    // ... many more
+    
+    // Cached methods (looked up once)
+    fileMethod: java.lang.reflect.Method,
+    startMethod: java.lang.reflect.Method,
+    deletesMethod: java.lang.reflect.Method,
+    // ... 20+ cached methods
+)
+```
+
+**Impact:** Avoided 30,000+ `Class.forName()` and `getMethod()` calls per query.
+
+### 2. Partition Spec Deduplication by Object Identity
+**Status: ❌ Reverted to JSON string comparison**
+
+PR #3298 used object identity for deduplication:
+```scala
+// PR #3298: Object identity (fast)
+partitionSpecToPoolIndex = mutable.HashMap[AnyRef, Int]()
+
+// Current code: JSON string (slow)
+partitionSpecToPoolIndex = mutable.HashMap[String, Int]()
+```
+
+**Impact:** Avoided redundant `toJson()` calls for duplicate specs.
+
+### 3. Partition Type Deduplication by Spec Identity
+**Status: ❌ Reverted to JSON string comparison**
+
+Same spec → same partition type. PR #3298 cached this relationship:
+```scala
+// PR #3298: Spec identity → type index
+partitionTypeToPoolIndex = mutable.HashMap[AnyRef, Int]()
+
+// Current code: JSON string comparison
+partitionTypeToPoolIndex = mutable.HashMap[String, Int]()
+```
+
+### 4. Field ID Mapping Cache
+**Status: ❌ Removed**
+
+PR #3298 cached `buildFieldIdMapping()` results by schema identity:
+```scala
+// PR #3298
+val fieldIdMappingCache = mutable.HashMap[AnyRef, Map[String, Int]]()
+val nameToFieldId = fieldIdMappingCache.getOrElseUpdate(
+    schema,
+    IcebergReflection.buildFieldIdMapping(schema))
+
+// Current code: Called every task iteration (line 878)
+val nameToFieldId = IcebergReflection.buildFieldIdMapping(schema)
+```
+
+**Impact:** `buildFieldIdMapping()` does reflection per-column. With 30K tasks × 10 columns = 300K redundant reflection calls.
+
+### 5. Delete Files Method Caching
+**Status: ⚠️ Partially present**
+
+Current code caches some methods but not all delete-related ones:
+- `deletesMethod` is NOT cached (called via `IcebergReflection.getDeleteFilesFromTask()`)
+- `contentMethod`, `specIdMethod`, `equalityIdsMethod` are looked up per-delete-file
+
+---
+
+## Current Code State
+
+### What's Preserved ✅
+
+In `serializePartitions()` (lines 773-789), some method caching exists:
+```scala
+// Load Iceberg classes once
+val contentScanTaskClass = Class.forName(...)
+val fileScanTaskClass = Class.forName(...)
+
+// Cache method lookups
+val fileMethod = contentScanTaskClass.getMethod("file")
+val startMethod = contentScanTaskClass.getMethod("start")
+val lengthMethod = contentScanTaskClass.getMethod("length")
+val residualMethod = contentScanTaskClass.getMethod("residual")
+val taskSchemaMethod = fileScanTaskClass.getMethod("schema")
+val toJsonMethod = schemaParserClass.getMethod("toJson", schemaClass)
+```
+
+### What's Missing ❌
+
+| Optimization | Location | Impact |
+|-------------|----------|--------|
+| `deletesMethod` caching | `extractDeleteFilesList()` | Per-task overhead |
+| `specMethod` caching | `serializePartitionData()` line 318 | Per-task overhead |
+| `partitionTypeMethod` caching | `serializePartitionData()` line 353 | Per-task overhead |
+| `fieldsMethod` caching | `serializePartitionData()` line 357 | Per-task overhead |
+| `fieldIdMethod`, `nameMethod`, `isOptionalMethod` | `serializePartitionData()` lines 390-397 | Per-field per-task |
+| `contentMethod`, `specIdMethod`, `equalityIdsMethod` | `extractDeleteFilesList()` | Per-delete-file |
+| Field ID mapping cache | `serializePartitions()` line 878 | Per-task schema reflection |
+| Object identity deduplication | Pool index maps | Per-task JSON serialization |
+
+---
+
+## Recommended Optimizations
+
+### Priority 1: Restore ReflectionCache (High Impact)
+
+Create a comprehensive cache in `IcebergReflection.scala`:
+
+```scala
+case class ReflectionCache(
+    // Classes
+    contentScanTaskClass: Class[_],
+    fileScanTaskClass: Class[_],
+    contentFileClass: Class[_],
+    deleteFileClass: Class[_],
+    schemaParserClass: Class[_],
+    schemaClass: Class[_],
+    partitionSpecParserClass: Class[_],
+    partitionSpecClass: Class[_],
+    structTypeClass: Class[_],
+    nestedFieldClass: Class[_],
+    structLikeClass: Class[_],
+    
+    // ContentScanTask methods
+    fileMethod: Method,
+    startMethod: Method,
+    lengthMethod: Method,
+    partitionMethod: Method,
+    residualMethod: Method,
+    
+    // FileScanTask methods
+    taskSchemaMethod: Method,
+    deletesMethod: Method,
+    specMethod: Method,
+    
+    // ContentFile methods
+    fileLocationMethod: Method,
+    
+    // DeleteFile methods
+    deleteContentMethod: Method,
+    deleteSpecIdMethod: Method,
+    deleteEqualityIdsMethod: Method,
+    
+    // Schema methods
+    schemaToJsonMethod: Method,
+    
+    // PartitionSpec methods
+    partitionSpecToJsonMethod: Method,
+    partitionTypeMethod: Method,
+    
+    // StructType/NestedField methods
+    structTypeFieldsMethod: Method,
+    nestedFieldTypeMethod: Method,
+    nestedFieldIdMethod: Method,
+    nestedFieldNameMethod: Method,
+    nestedFieldIsOptionalMethod: Method,
+    
+    // StructLike methods
+    structLikeGetMethod: Method
+)
+
+def createReflectionCache(): ReflectionCache = {
+    // Load all classes and methods once
+    // ...
+}
+```
+
+### Priority 2: Restore Field ID Mapping Cache (Medium Impact)
+
+In `serializePartitions()`:
+
+```scala
+// Add cache before the loop
+val fieldIdMappingCache = mutable.HashMap[AnyRef, Map[String, Int]]()
+
+// Inside the loop, replace:
+val nameToFieldId = IcebergReflection.buildFieldIdMapping(schema)
+
+// With:
+val nameToFieldId = fieldIdMappingCache.getOrElseUpdate(
+    schema,
+    IcebergReflection.buildFieldIdMapping(schema))
+```
+
+### Priority 3: Restore Object Identity Deduplication (Medium Impact)
+
+Change pool index map types:
+
+```scala
+// From:
+val partitionTypeToPoolIndex = mutable.HashMap[String, Int]()
+val partitionSpecToPoolIndex = mutable.HashMap[String, Int]()
+
+// To:
+val partitionTypeToPoolIndex = mutable.HashMap[AnyRef, Int]()  // Spec identity
+val partitionSpecToPoolIndex = mutable.HashMap[AnyRef, Int]()  // Object identity
+```
+
+Then update `serializePartitionData()` to use spec object as key instead of JSON string.
+
+### Priority 4: Pass Cache to Helper Methods (Low Impact)
+
+Update method signatures to accept the cache:
+
+```scala
+// From:
+private def extractDeleteFilesList(
+    task: Any,
+    contentFileClass: Class[_],
+    fileScanTaskClass: Class[_]): Seq[...] 
+
+// To:
+private def extractDeleteFilesList(
+    task: Any,
+    cache: ReflectionCache): Seq[...]
+```
+
+---
+
+## Estimated Performance Impact
+
+Based on PR #3298 benchmark (30,000 tasks):
+
+| Metric | Before Optimization | After Optimization | Improvement |
+|--------|--------------------|--------------------|-------------|
+| Serialization time | 34,425 ms | 16,618 ms | **52% faster** |
+| Reflection calls | ~1M+ | ~100 | **99.99% reduction** |
+| JSON serializations | ~60K | ~100 | **99.8% reduction** |
+
+---
+
+## Implementation Plan
+
+1. **Phase 1:** Add `ReflectionCache` to `IcebergReflection.scala`
+2. **Phase 2:** Update `serializePartitions()` to create cache once and pass to helpers
+3. **Phase 3:** Update `extractDeleteFilesList()` to use cache
+4. **Phase 4:** Update `serializePartitionData()` to use cache
+5. **Phase 5:** Add field ID mapping cache
+6. **Phase 6:** Restore object identity deduplication for pools
+7. **Phase 7:** Add benchmark test to validate improvements
+
+---
+
+## Files to Modify
+
+1. `spark/src/main/scala/org/apache/comet/iceberg/IcebergReflection.scala`
+   - Add `ReflectionCache` case class
+   - Add `createReflectionCache()` method
+
+2. `spark/src/main/scala/org/apache/comet/serde/operator/CometIcebergNativeScan.scala`
+   - Update `serializePartitions()` to use cache
+   - Update `extractDeleteFilesList()` signature and implementation
+   - Update `serializePartitionData()` signature and implementation
+   - Add field ID mapping cache
+   - Change pool index map types to use object identity
+
+---
+
+## References
+
+- PR #3298: Original optimizations (merged Jan 2026)
+- PR #3349: Per-partition plan building refactor (merged Feb 2026)  
+- PR #3443: Validation cleanup (merged Feb 2026)
+- Issue #3456: This analysis task