Background
PR #2521 added memory reservation debug logging (spark.comet.debug.memory config and LoggingPool wrapper). That PR also contained Python scripts for parsing and visualizing the memory debug logs, but those scripts were not merged. This issue tracks adding analysis/visualization scripts as a follow-up.
Log Format
When spark.comet.debug.memory=true is set, the LoggingPool produces log lines like:
[Task 486] MemoryPool[ExternalSorter[6]].try_grow(256232960) returning Ok
[Task 486] MemoryPool[ExternalSorter[6]].try_grow(257820416) returning Err
[Task 486] MemoryPool[ExternalSorterMerge[6]].shrink(10485760)
[Task 486] MemoryPool[ExternalSorterMerge[6]].try_grow(68928) returning Ok
Proposed Scripts
1. dev/scripts/mem_debug_to_csv.py — Parse logs to CSV
Parses the Spark executor/worker log file, filters by task ID, and tracks cumulative memory allocation per consumer (operator).
Key details from the #2521 implementation:
- Uses regex to parse lines matching
[Task <id>] MemoryPool[<consumer>].<method>(<size>)
- Tracks running total per consumer:
grow/try_grow add to allocation, shrink subtracts
- For
try_grow failures (line contains "Err"), the allocation is not updated but the row is annotated with an ERR label
- Outputs CSV with columns:
name, size, label
- Accepts
--task <id> to filter to a specific Spark task and --file <path> for the log file
2. dev/scripts/plot_memory_usage.py — Visualize memory usage
Reads the CSV output and produces a stacked area chart showing memory usage over time by consumer (operator).
Key details from the #2521 implementation:
- Uses pandas and matplotlib
- Creates a time index from row order (each row = sequential event)
- Pivots data so each consumer is a column, forward-fills missing values
- Renders a stacked area chart (
plt.stackplot)
- Annotates
try_grow failures with red vertical dashed lines labeled "ERR"
- Saves chart as PNG (same path as CSV but with
_chart.png suffix)
Suggestions from PR #2521 Code Review
The following review feedback should be incorporated:
- Use
#!/usr/bin/env python3 shebang and make scripts executable (chmod +x)
- Fix CSV formatting — use f-strings (
f"{consumer},{alloc[consumer]}") instead of print(consumer, ",", alloc[consumer]) to avoid extra spaces around values
- Fix ERR label handling — the original implementation printed two rows for the same event on
try_grow failure (one with ERR label, one without). Use a label variable so only one row is printed per event
- Handle first occurrence being
shrink — the original code assumed the first event for a consumer is always grow/try_grow, but the first event could be a shrink
- Fix
--task argument — int(None) fails with TypeError when --task is not provided; make it optional or a positional arg
- Consider making
--file a positional argument for simpler CLI usage
- Use
pandas.DataFrame.ffill() instead of deprecated fillna(method='ffill') (deprecated since pandas 2.1.0)
- Consider logging backtraces — when the backtrace feature is enabled, it could be useful to log backtraces on every call (not just errors) to trace precise allocation origins. This was suggested as an optional
trace!-level enhancement to the Rust LoggingPool
Example Workflow
# Step 1: Run Spark with memory debug logging enabled
spark-submit --conf spark.comet.debug.memory=true ...
# Step 2: Parse the log and generate CSV for a specific task
python3 dev/scripts/mem_debug_to_csv.py --task 486 /path/to/executor/log > /tmp/mem.csv
# Step 3: Generate a chart
python3 dev/scripts/plot_memory_usage.py /tmp/mem.csv
Reference
Background
PR #2521 added memory reservation debug logging (
spark.comet.debug.memoryconfig andLoggingPoolwrapper). That PR also contained Python scripts for parsing and visualizing the memory debug logs, but those scripts were not merged. This issue tracks adding analysis/visualization scripts as a follow-up.Log Format
When
spark.comet.debug.memory=trueis set, theLoggingPoolproduces log lines like:Proposed Scripts
1.
dev/scripts/mem_debug_to_csv.py— Parse logs to CSVParses the Spark executor/worker log file, filters by task ID, and tracks cumulative memory allocation per consumer (operator).
Key details from the #2521 implementation:
[Task <id>] MemoryPool[<consumer>].<method>(<size>)grow/try_growadd to allocation,shrinksubtractstry_growfailures (line contains "Err"), the allocation is not updated but the row is annotated with anERRlabelname, size, label--task <id>to filter to a specific Spark task and--file <path>for the log file2.
dev/scripts/plot_memory_usage.py— Visualize memory usageReads the CSV output and produces a stacked area chart showing memory usage over time by consumer (operator).
Key details from the #2521 implementation:
plt.stackplot)try_growfailures with red vertical dashed lines labeled "ERR"_chart.pngsuffix)Suggestions from PR #2521 Code Review
The following review feedback should be incorporated:
#!/usr/bin/env python3shebang and make scripts executable (chmod +x)f"{consumer},{alloc[consumer]}") instead ofprint(consumer, ",", alloc[consumer])to avoid extra spaces around valuestry_growfailure (one with ERR label, one without). Use a label variable so only one row is printed per eventshrink— the original code assumed the first event for a consumer is alwaysgrow/try_grow, but the first event could be ashrink--taskargument —int(None)fails with TypeError when--taskis not provided; make it optional or a positional arg--filea positional argument for simpler CLI usagepandas.DataFrame.ffill()instead of deprecatedfillna(method='ffill')(deprecated since pandas 2.1.0)trace!-level enhancement to the RustLoggingPoolExample Workflow
Reference
try_growcalls