Skip to content

fix(rasterx): self-contained tile.raster bytes (no VRT payloads)#29

Merged
mjohns-databricks merged 7 commits into
mainfrom
vrt-payload
May 20, 2026
Merged

fix(rasterx): self-contained tile.raster bytes (no VRT payloads)#29
mjohns-databricks merged 7 commits into
mainfrom
vrt-payload

Conversation

@mjohns-databricks
Copy link
Copy Markdown
Collaborator

@mjohns-databricks mjohns-databricks commented May 20, 2026

Summary

Three RasterX operations historically returned tiles whose metadata("driver") claimed VRT even though the on-disk file was a materialized GTiff. That mis-tag propagated through RasterDriver.writeToBytes (which keys the tempfile extension AND the -of flag in the inner gdal_translate call off metadata.driver), so the serialized tile.raster payload ended up as VRT XML referencing a /vsimem/ tempfile only reachable on the producing executor. Single-node tests passed by accident (the tempfile existed locally); multi-executor clusters hit file not found when the VRT was opened elsewhere — surfaced in notebooks/examples/eo-series/03. Gridded EO Data.ipynb after rst_merge_agg(...) on h3 k-rings.

Affected functions (all flow through one of three internal ops with the same bug pattern):

Internal op SQL / API functions
MergeRasters gbx_rst_merge, gbx_rst_merge_agg
MergeBands gbx_rst_frombands
PixelCombineRasters gbx_rst_derivedband, gbx_rst_derivedband_agg, gbx_rst_combineavg, gbx_rst_combineavg_agg

Two-layer fix:

  1. GDALTranslate.executeTranslate — record result.GetDriver().getShortName (output) in newOptions for both "driver" and "format", not raster.GetDriver().getShortName (input). Truthful metadata for every caller, not just the VRT-using ones. Falls back to input driver if result is null (translate error path) so existing error semantics are preserved.
  2. RasterDriver.writeToBytes — defensively coerce driver=VRT (whether from metadata or from the in-memory Dataset's GetDriver) to GTiff before computing the tempfile extension and assembling the translate options. Plus a post-write sniff: if the returned bytes start with <VRTDataset, log to stderr and re-translate via a recovery path. Belt-and-braces against future regressions.

Regression coverage: new RST_NoVrtPayloadTest asserts (a) tile.raster bytes don't start with <VRTDataset, (b) bytes start with GTiff magic (II*\0 or MM\0*), (c) tile.metadata("driver") isn't "VRT" — exercised once per internal op via its public SQL/API entry point (rst_merge_agg, rst_frombands, rst_combineavg_agg).

Docs (docs/docs/beta-release-notes.mdx):

  • The VRT fix is added as a new bullet in "What's new in v0.3.0" alongside the other raster-correctness items (rst_clip CRS axis-order, /vsimem/ hardening).

v0.3.0 release status: the previous v0.3.0 release/tag was deleted upstream (no one was on it yet); once this PR merges, v0.3.0 will be re-cut from main and package-geobrix-artifacts.yml will publish it with the full asset set (JAR, wheel, GDAL tarball, sidecar, init script, docs zip). The original release-flow infrastructure landed in #28.

Test plan

  • Run RST_NoVrtPayloadTest locally and on CI — all three sub-tests green:
    gbx:test:scala --suite 'com.databricks.labs.gbx.rasterx.expressions.RST_NoVrtPayloadTest'
    
  • Run the existing RST_AggEvalTest to confirm no regression in the unchanged aggregator path.
  • Reproduce the original bug on a multi-executor cluster pre-fix vs. post-fix:
    • Run notebooks/examples/eo-series/03. Gridded EO Data.ipynb against an attached multi-worker cluster.
    • Pre-fix (prior main jar): plot_raster(kring_df.select("tile.raster").first()[0]) returns VRT XML.
    • Post-fix (this PR's jar): plot_raster(...) renders a real raster image; kring_df.select("tile.metadata").first()[0]["driver"] reports GTiff.
  • After merge: cut the v0.3.0 tag, run package-geobrix-artifacts.yml against it, confirm all six release assets attach to the v0.3.0 release page.

Context for the release-flow infrastructure this PR's fix ships through: #28

This pull request and its description were written by Isaac.

Three RasterX operations — MergeRasters (gbx_rst_merge, gbx_rst_merge_agg),
MergeBands (gbx_rst_frombands), and PixelCombineRasters (gbx_rst_derivedband,
gbx_rst_derivedband_agg, gbx_rst_combineavg, gbx_rst_combineavg_agg) — were
returning tiles whose metadata("driver") claimed VRT even though the on-disk
file was a materialized GTiff. That mis-tag propagated through
RasterDriver.writeToBytes (which keys the temp-file extension AND the -of
flag in the inner gdal_translate call off metadata.driver), so the
serialized tile.raster payload ended up as VRT XML referencing a /vsimem/
tempfile only reachable on the producing executor. Single-node tests
passed by accident (the tempfile existed locally); multi-executor clusters
hit `file not found` when the VRT was opened elsewhere.

Two layers of fix:

1. GDALTranslate.executeTranslate now records result.GetDriver().getShortName
   (output) in newOptions for both "driver" and "format" keys, not
   raster.GetDriver().getShortName (input). Truthful metadata for every
   caller, not just the VRT-using ones. Falls back to input driver if
   result is null (translate error path) so existing error semantics
   are preserved.

2. RasterDriver.writeToBytes defensively coerces driver=VRT (whether
   reported via metadata or via the in-memory Dataset's GetDriver) to
   GTiff before computing the tempfile extension and assembling the
   translate options. Plus a post-write sniff: if the returned bytes
   start with `<VRTDataset`, logs to stderr and re-translates via a
   recovery path. Belt-and-braces against future regressions.

Regression coverage: new src/test/scala/.../RST_NoVrtPayloadTest.scala
asserts (a) tile.raster bytes don't start with `<VRTDataset`, (b) bytes
start with GTiff magic (II*\0 or MM\0*), (c) tile.metadata("driver")
isn't VRT — for each of the three internal operations via their public
SQL/API entry points (rst_merge_agg, rst_frombands, rst_combineavg_agg).

docs/docs/beta-release-notes.mdx:
- New "What's new in v0.3.1" section documenting the fix.
- ":::warning Known issue in v0.3.0" callout listing the 7 affected
  functions and the multi-executor symptom; recommends v0.3.1 upgrade.

Co-authored-by: Isaac
v0.3.0 has been deleted upstream and will be re-cut from this branch
once merged — no one was on it yet. So the VRT-payload fix lands as a
v0.3.0 highlight rather than a v0.3.1 patch:

- Drop the "What's new in v0.3.1" section.
- Drop the ":::warning Known issue in v0.3.0" callout (the issue is
  gone in the re-cut v0.3.0).
- Add the VRT-fix bullet to "What's new in v0.3.0" alongside the
  other raster-correctness items (rst_clip CRS, /vsimem/ hardening).
- Revert the Current Version banner from 0.3.1 back to 0.3.0.

Co-authored-by: Isaac
Two correctness fixes after running locally inside geobrix-dev:

- rst_frombands takes a single Column (Array of band tile structs), not
  Column varargs. Wrapped the three band columns in array(...).

- rst_combineavg_agg in isolation hits a `No such file or directory`
  error reading back the freshly-written VRT under
  NodeFilePathUtil.rootPath. Mirror the rst_clip preamble that
  RST_AggEvalTest uses — touches the same staging path setup that
  PixelCombineRasters relies on later. Both tests now go green.

All three sub-tests pass: rst_merge_agg, rst_frombands, rst_combineavg_agg.

Co-authored-by: Isaac
…ot_file

The "Band Stacking + Clipping" notebook hit a matplotlib warning:
`Clipping input data to the valid range for imshow with RGB data
([0..1] for floats or [0..255] for integers). Got range [3658..65535].`
Sentinel-2 / Landsat / most EO products are UInt16; raw values lie far
outside matplotlib's [0..255] RGB integer range, so imshow clips
everything above 255 to white and the rendered image is washed out.

This is what most EO viewers (QGIS, EO Browser, sentinelhub-py) handle
by default — a per-band percentile stretch. Refactor:

- Extract _decimated_read, _needs_percentile_stretch, _percentile_stretch,
  and _render helpers from the previously-duplicated logic in
  plot_raster + plot_file.
- _needs_percentile_stretch triggers only on integer rasters with max
  > 255; float rasters and small-range integers pass through untouched.
- _percentile_stretch goes 2nd-98th per band, preserves the mask on
  MaskedArray input so nodata pixels (65535 Sentinel-2 saturation, 0
  fill, etc.) keep their transparency, and outputs float32 in [0, 1]
  for clean imshow rendering.
- _render picks 'viridis' for single-band and lets rasterio.plot.show
  auto-RGB for multi-band; suffixes the title with the decimation
  factor only when actually downsampled (avoids the previous
  "Scale: 1/1x" cosmetic when no decimation happened).
- plot_raster now also reads masked=True (was previously inconsistent
  with plot_file).

Surfaces during multi-executor validation of #29 — EO Series notebook
04 ("Band Stacking + Clipping") now renders the stacked tile correctly
instead of clipping to white.

Co-authored-by: Isaac
The beta release notes already cover the v0.3.0 VRT-payload fix as a
"What's new" bullet, but readers consulting the RasterX package overview
or the tile schema later won't find the invariant without digging into
release notes.

Adds a short "Tile payload" section to docs/docs/packages/rasterx.mdx
between Function Categories and Usage Examples, stating:
  - Every RasterX function returns a self-contained, in-memory raster
    (GTiff by default).
  - Safe to serialize between Spark stages / executors, persist to
    Delta, hand off to rasterio / gdal, or write via the gdal writer.
  - Never an XML reference to a per-executor /vsimem/ tempfile.
  - Names the 7 VRT-internal functions that materialize to GTiff before
    returning, with the diagnostic hook tile.metadata.driver == "GTiff".
  - Cross-link to the beta release notes for the historical fix.

Co-authored-by: Isaac
…g fixes

End-to-end re-run of the four EO Series example notebooks against
this PR's GeoBrix jar on a multi-executor cluster — confirms the VRT
fix, the plot_raster percentile-stretch upgrade, and the grid-based
band-shape filter all behave on real Sentinel-2 data.

Notebook-side changes (high level — most of the diff is rendered cell
outputs / embedded images now that plot_raster produces real images
instead of clipped-to-white UInt16 noise):

- Suppress the harmless line_magic_sanitizer UserWarning emitted by
  Databricks's IPython preprocessing on `%run` invocations; small
  warnings.filterwarnings block at the top of each notebook.
- 04. Band Stacking + Clipping: replace the byte-size band-shape
  filter (.filter("red_size == nir_size")) with the pixel-grid
  equivalent using rst_width / rst_height. Byte size varied
  band-to-band with DEFLATE compression even when raster dimensions
  were identical, so the old filter was over-strict (~8 rows passed
  vs. ~19,816 with the dimension check).

Cluster output captured in the notebook cells confirms: post-fix,
plot_raster renders real RGB images from rst_frombands → rst_clip →
rst_merge_agg chains across executors, tile.metadata.driver reports
GTiff (not VRT), and the band-stacking → gdal-writer round-trip
completes without missing-file errors.

Co-authored-by: Isaac
The doc-tests.yml workflow has been broken by recent doc-structure
changes; the badge sits in the README banner showing red. Dropping it
rather than chasing the workflow back to green right now — the static
`documentation` shield on the next line still surfaces the published
docs site, so no reader-facing capability is lost.

Codecov badge stays in place; the "unknown" rendering is a separate
issue tracked off-band (likely a codecov.io repo-activation or token
question, not a code change).

Co-authored-by: Isaac
@mjohns-databricks mjohns-databricks merged commit 0dd7158 into main May 20, 2026
6 checks passed
@mjohns-databricks mjohns-databricks deleted the vrt-payload branch May 20, 2026 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant