Skip to content

geotiff: validate HTTP range responses (#1735)#1744

Merged
brendancol merged 2 commits into
mainfrom
issue-1735
May 12, 2026
Merged

geotiff: validate HTTP range responses (#1735)#1744
brendancol merged 2 commits into
mainfrom
issue-1735

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • _HTTPSource.read_range previously returned the response body without checking status, Content-Range, or length. urllib3 does not raise on non-2xx by default, and a server that ignored Range returned the full object as 200 — both slipped through silently.
  • Add _validate_range_response covering status, Content-Range parse + offset match, and body-length vs Content-Range agreement.
  • Short reads near EOF are still accepted: the validator only requires Content-Range start to equal the requested start, and the body length to equal what the server itself reported via Content-Range.
  • Wired into both the urllib3 and stdlib urllib paths.

Closes #1735.

Test plan

  • New test_http_range_validation_1735.py covers a server that ignores Range (status 200 at non-zero start), a server with mismatched Content-Range, a truncated body, and a well-formed 206.
  • Full HTTP-tagged geotiff suite still passes (64/64).

@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 12, 2026
@brendancol brendancol requested a review from Copilot May 12, 2026 20:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the GeoTIFF HTTP range reader by validating that server responses actually satisfy Range requests, preventing silent misreads (wrong offsets), truncated bodies, and non-partial responses from being treated as valid byte windows.

Changes:

  • Added _HTTPSource._validate_range_response to validate status codes, parse/verify Content-Range, and check body length consistency.
  • Integrated the validation into both urllib3 and stdlib urllib transport paths in _HTTPSource.read_range.
  • Added regression tests that spin up local HTTP servers to simulate misbehaving range responses (issue #1735).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
xrspatial/geotiff/_reader.py Adds and wires in HTTP range-response validation for status/headers/body-length correctness.
xrspatial/geotiff/tests/test_http_range_validation_1735.py Adds loopback-server regression tests covering ignored ranges, mismatched Content-Range, short bodies, and a well-formed 206.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 682 to 685
def read_range(self, start: int, length: int) -> bytes:
end = start + length - 1
headers = {'Range': f'bytes={start}-{end}'}
if self._pool is not None:
)
if len(data) < min(length, 1):
# Empty body but caller wanted bytes.
raise OSError("HTTP 200 response body is empty.")
_HTTPSource.read_range previously handed the response body to the
caller without checking the status code, Content-Range header, or
returned byte length. Three failure modes slipped through silently:

- urllib3 by default does not raise on non-2xx; a 4xx/5xx body
  reached the codec as if it were the requested range.
- A server that ignored Range and returned the whole object as a 200
  with no Content-Range was trusted for any non-zero start, handing
  the codec wrong-offset bytes.
- A truncated body, or one whose Content-Range header did not match
  the request, surfaced as an opaque decode error far from the real
  cause.

Add _validate_range_response that checks status, parses Content-Range,
and confirms the body length matches what Content-Range advertises.
Short reads near EOF stay legitimate: the validator only requires the
Content-Range to start at the requested offset and the body length to
match what the server itself reported.

Applied to both the urllib3 and stdlib urllib code paths.

Adds regression coverage in test_http_range_validation_1735.py.
Two Copilot review follow-ups on PR #1744:

- _HTTPSource.read_range now short-circuits to b'' for length <= 0,
  matching the convention used by other source implementations like
  _BytesIOSource. Without the guard, Range: bytes=<start>-<start-1>
  goes on the wire, which is invalid and either trips a 416 from a
  well-behaved server or pulls down arbitrarily large bytes from a
  misbehaving one.

- _validate_range_response now slices the response to length bytes
  when status == 200, Content-Range is missing, and the body is
  longer than requested. The implicit contract of read_range is
  "at most length bytes", so a 16 KiB header prefetch against a
  2 GiB object should not drag the whole thing into memory just
  because the server ignored Range.

Tests cover both fixes.
@brendancol brendancol merged commit a4d410f into main May 12, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: HTTP range reader does not validate response status, Content-Range, or length

2 participants