fix(rasterx): self-contained tile.raster bytes (no VRT payloads)#29
Merged
Conversation
Three RasterX operations — MergeRasters (gbx_rst_merge, gbx_rst_merge_agg),
MergeBands (gbx_rst_frombands), and PixelCombineRasters (gbx_rst_derivedband,
gbx_rst_derivedband_agg, gbx_rst_combineavg, gbx_rst_combineavg_agg) — were
returning tiles whose metadata("driver") claimed VRT even though the on-disk
file was a materialized GTiff. That mis-tag propagated through
RasterDriver.writeToBytes (which keys the temp-file extension AND the -of
flag in the inner gdal_translate call off metadata.driver), so the
serialized tile.raster payload ended up as VRT XML referencing a /vsimem/
tempfile only reachable on the producing executor. Single-node tests
passed by accident (the tempfile existed locally); multi-executor clusters
hit `file not found` when the VRT was opened elsewhere.
Two layers of fix:
1. GDALTranslate.executeTranslate now records result.GetDriver().getShortName
(output) in newOptions for both "driver" and "format" keys, not
raster.GetDriver().getShortName (input). Truthful metadata for every
caller, not just the VRT-using ones. Falls back to input driver if
result is null (translate error path) so existing error semantics
are preserved.
2. RasterDriver.writeToBytes defensively coerces driver=VRT (whether
reported via metadata or via the in-memory Dataset's GetDriver) to
GTiff before computing the tempfile extension and assembling the
translate options. Plus a post-write sniff: if the returned bytes
start with `<VRTDataset`, logs to stderr and re-translates via a
recovery path. Belt-and-braces against future regressions.
Regression coverage: new src/test/scala/.../RST_NoVrtPayloadTest.scala
asserts (a) tile.raster bytes don't start with `<VRTDataset`, (b) bytes
start with GTiff magic (II*\0 or MM\0*), (c) tile.metadata("driver")
isn't VRT — for each of the three internal operations via their public
SQL/API entry points (rst_merge_agg, rst_frombands, rst_combineavg_agg).
docs/docs/beta-release-notes.mdx:
- New "What's new in v0.3.1" section documenting the fix.
- ":::warning Known issue in v0.3.0" callout listing the 7 affected
functions and the multi-executor symptom; recommends v0.3.1 upgrade.
Co-authored-by: Isaac
v0.3.0 has been deleted upstream and will be re-cut from this branch once merged — no one was on it yet. So the VRT-payload fix lands as a v0.3.0 highlight rather than a v0.3.1 patch: - Drop the "What's new in v0.3.1" section. - Drop the ":::warning Known issue in v0.3.0" callout (the issue is gone in the re-cut v0.3.0). - Add the VRT-fix bullet to "What's new in v0.3.0" alongside the other raster-correctness items (rst_clip CRS, /vsimem/ hardening). - Revert the Current Version banner from 0.3.1 back to 0.3.0. Co-authored-by: Isaac
Two correctness fixes after running locally inside geobrix-dev: - rst_frombands takes a single Column (Array of band tile structs), not Column varargs. Wrapped the three band columns in array(...). - rst_combineavg_agg in isolation hits a `No such file or directory` error reading back the freshly-written VRT under NodeFilePathUtil.rootPath. Mirror the rst_clip preamble that RST_AggEvalTest uses — touches the same staging path setup that PixelCombineRasters relies on later. Both tests now go green. All three sub-tests pass: rst_merge_agg, rst_frombands, rst_combineavg_agg. Co-authored-by: Isaac
…ot_file The "Band Stacking + Clipping" notebook hit a matplotlib warning: `Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Got range [3658..65535].` Sentinel-2 / Landsat / most EO products are UInt16; raw values lie far outside matplotlib's [0..255] RGB integer range, so imshow clips everything above 255 to white and the rendered image is washed out. This is what most EO viewers (QGIS, EO Browser, sentinelhub-py) handle by default — a per-band percentile stretch. Refactor: - Extract _decimated_read, _needs_percentile_stretch, _percentile_stretch, and _render helpers from the previously-duplicated logic in plot_raster + plot_file. - _needs_percentile_stretch triggers only on integer rasters with max > 255; float rasters and small-range integers pass through untouched. - _percentile_stretch goes 2nd-98th per band, preserves the mask on MaskedArray input so nodata pixels (65535 Sentinel-2 saturation, 0 fill, etc.) keep their transparency, and outputs float32 in [0, 1] for clean imshow rendering. - _render picks 'viridis' for single-band and lets rasterio.plot.show auto-RGB for multi-band; suffixes the title with the decimation factor only when actually downsampled (avoids the previous "Scale: 1/1x" cosmetic when no decimation happened). - plot_raster now also reads masked=True (was previously inconsistent with plot_file). Surfaces during multi-executor validation of #29 — EO Series notebook 04 ("Band Stacking + Clipping") now renders the stacked tile correctly instead of clipping to white. Co-authored-by: Isaac
The beta release notes already cover the v0.3.0 VRT-payload fix as a
"What's new" bullet, but readers consulting the RasterX package overview
or the tile schema later won't find the invariant without digging into
release notes.
Adds a short "Tile payload" section to docs/docs/packages/rasterx.mdx
between Function Categories and Usage Examples, stating:
- Every RasterX function returns a self-contained, in-memory raster
(GTiff by default).
- Safe to serialize between Spark stages / executors, persist to
Delta, hand off to rasterio / gdal, or write via the gdal writer.
- Never an XML reference to a per-executor /vsimem/ tempfile.
- Names the 7 VRT-internal functions that materialize to GTiff before
returning, with the diagnostic hook tile.metadata.driver == "GTiff".
- Cross-link to the beta release notes for the historical fix.
Co-authored-by: Isaac
…g fixes
End-to-end re-run of the four EO Series example notebooks against
this PR's GeoBrix jar on a multi-executor cluster — confirms the VRT
fix, the plot_raster percentile-stretch upgrade, and the grid-based
band-shape filter all behave on real Sentinel-2 data.
Notebook-side changes (high level — most of the diff is rendered cell
outputs / embedded images now that plot_raster produces real images
instead of clipped-to-white UInt16 noise):
- Suppress the harmless line_magic_sanitizer UserWarning emitted by
Databricks's IPython preprocessing on `%run` invocations; small
warnings.filterwarnings block at the top of each notebook.
- 04. Band Stacking + Clipping: replace the byte-size band-shape
filter (.filter("red_size == nir_size")) with the pixel-grid
equivalent using rst_width / rst_height. Byte size varied
band-to-band with DEFLATE compression even when raster dimensions
were identical, so the old filter was over-strict (~8 rows passed
vs. ~19,816 with the dimension check).
Cluster output captured in the notebook cells confirms: post-fix,
plot_raster renders real RGB images from rst_frombands → rst_clip →
rst_merge_agg chains across executors, tile.metadata.driver reports
GTiff (not VRT), and the band-stacking → gdal-writer round-trip
completes without missing-file errors.
Co-authored-by: Isaac
The doc-tests.yml workflow has been broken by recent doc-structure changes; the badge sits in the README banner showing red. Dropping it rather than chasing the workflow back to green right now — the static `documentation` shield on the next line still surfaces the published docs site, so no reader-facing capability is lost. Codecov badge stays in place; the "unknown" rendering is a separate issue tracked off-band (likely a codecov.io repo-activation or token question, not a code change). Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three RasterX operations historically returned tiles whose
metadata("driver")claimedVRTeven though the on-disk file was a materialized GTiff. That mis-tag propagated throughRasterDriver.writeToBytes(which keys the tempfile extension AND the-offlag in the innergdal_translatecall offmetadata.driver), so the serializedtile.rasterpayload ended up as VRT XML referencing a/vsimem/tempfile only reachable on the producing executor. Single-node tests passed by accident (the tempfile existed locally); multi-executor clusters hitfile not foundwhen the VRT was opened elsewhere — surfaced innotebooks/examples/eo-series/03. Gridded EO Data.ipynbafterrst_merge_agg(...)on h3 k-rings.Affected functions (all flow through one of three internal ops with the same bug pattern):
MergeRastersgbx_rst_merge,gbx_rst_merge_aggMergeBandsgbx_rst_frombandsPixelCombineRastersgbx_rst_derivedband,gbx_rst_derivedband_agg,gbx_rst_combineavg,gbx_rst_combineavg_aggTwo-layer fix:
GDALTranslate.executeTranslate— recordresult.GetDriver().getShortName(output) innewOptionsfor both"driver"and"format", notraster.GetDriver().getShortName(input). Truthful metadata for every caller, not just the VRT-using ones. Falls back to input driver ifresultis null (translate error path) so existing error semantics are preserved.RasterDriver.writeToBytes— defensively coercedriver=VRT(whether from metadata or from the in-memory Dataset'sGetDriver) toGTiffbefore computing the tempfile extension and assembling the translate options. Plus a post-write sniff: if the returned bytes start with<VRTDataset, log to stderr and re-translate via a recovery path. Belt-and-braces against future regressions.Regression coverage: new
RST_NoVrtPayloadTestasserts (a)tile.rasterbytes don't start with<VRTDataset, (b) bytes start with GTiff magic (II*\0orMM\0*), (c)tile.metadata("driver")isn't"VRT"— exercised once per internal op via its public SQL/API entry point (rst_merge_agg,rst_frombands,rst_combineavg_agg).Docs (
docs/docs/beta-release-notes.mdx):rst_clipCRS axis-order,/vsimem/hardening).v0.3.0 release status: the previous v0.3.0 release/tag was deleted upstream (no one was on it yet); once this PR merges, v0.3.0 will be re-cut from main and
package-geobrix-artifacts.ymlwill publish it with the full asset set (JAR, wheel, GDAL tarball, sidecar, init script, docs zip). The original release-flow infrastructure landed in #28.Test plan
RST_NoVrtPayloadTestlocally and on CI — all three sub-tests green:RST_AggEvalTestto confirm no regression in the unchanged aggregator path.notebooks/examples/eo-series/03. Gridded EO Data.ipynbagainst an attached multi-worker cluster.mainjar):plot_raster(kring_df.select("tile.raster").first()[0])returns VRT XML.plot_raster(...)renders a real raster image;kring_df.select("tile.metadata").first()[0]["driver"]reportsGTiff.package-geobrix-artifacts.ymlagainst it, confirm all six release assets attach to the v0.3.0 release page.Context for the release-flow infrastructure this PR's fix ships through: #28
This pull request and its description were written by Isaac.