Fix DecompressMemMapReader.read returning b'' before EOF (#120) by mvanhorn · Pull Request #135 · gitpython-developers/gitdb

mvanhorn · 2026-05-08T11:45:57Z

Summary

Closes #120

DecompressMemMapReader.read(N) could return b'' before _br == _s when N was small enough that the underlying zlib.decompress(indata, N) consumed input bytes without producing output (e.g. while ingesting the zlib header / dictionary frames). Callers using the standard while chunk := stream.read(N) idiom terminated at the first empty chunk -- before reaching the actual end of the uncompressed object. Reproduced with chunk sizes 1, 4, 16 against a 13 KB stream of compressible data.

Why the original guard existed

The previous if dcompdat and ... guard at the recursion site was added so that compressed_bytes_read() could drive read() in a scrub-loop that intentionally manipulates _br = 0 while the inner zlib object is already past EOF. Without that guard, the scrub loop would recurse forever because _br < _s stays true.

Fix

Convert the recursive "refill to size" branch into an iterative loop around the existing decompress block. The loop terminates when:

len(dcompdat) >= size (caller's request satisfied), OR
_br >= _s (uncompressed object fully read), OR
Inner zlib produced an empty chunk AND (zlib has hit EOF, OR the input window has run off self._m, OR no compressed bytes were consumed on this turn).

Condition 3 preserves the compressed_bytes_read() scrub safety: once the inner zip is at EOF, the loop breaks instead of looping forever. It also bounds the truncated-stream case (zip.eof never becomes true) by the input-exhaustion guard, and bounds the crafted "many empty deflate blocks" attack -- previously this could blow Python's recursion depth at ~1500+ stored-block headers because each recursion only consumed a handful of input bytes; the iterative form walks the stream forward without growing the stack.

Tests

New: test_decompress_reader_chunked_read_does_not_terminate_early reads a 13 KB highly-compressible stream with chunk sizes 1, 4, 16, 64 and asserts the full payload is returned and _br == _s.
Existing: gitdb/test/ (24 tests) all pass, including test_decompress_reader_special_case and test_pack -- both exercise the compressed_bytes_read scrub path.

$ pytest gitdb/test
======================== 24 passed, 1 skipped in 7.10s ========================

Manual smoke

import zlib
from gitdb import DecompressMemMapReader

data = b"hello world! " * 1000
zdata = zlib.compress(data)
for chunk_size in (1, 4, 16, 64, 4096):
    r = DecompressMemMapReader(zdata, close_on_deletion=False, size=len(data))
    out = b""
    while chunk := r.read(chunk_size):
        out += chunk
    assert out == data, chunk_size

(Pre-fix this fails for chunk_size in (1, 4, 16).)

Truncated input is handled the same as before: read() returns whatever was decoded so far, then b'', instead of looping or raising. The crafted recursion-depth attack (~5000 empty deflate blocks ahead of one valid block) now decodes correctly instead of raising RecursionError.

Closes gitpython-developers#120 DecompressMemMapReader.read(N) could return b'' mid-stream when zlib consumed input without producing output on a single decompress call (small N, header / dictionary frames in flight). The original `if dcompdat and ...` guard at the recursion site skipped the "refill to size" recursion in that case, so a caller using the standard idiom while chunk := stream.read(4096): yield chunk terminated at the first empty chunk -- before _br == _s. The guard exists for compressed_bytes_read(), which manipulates _br=0 and then drains the inner zip past its EOF. Recursing there would loop forever because the inner zip is already done. The fix uses zlib's own `eof` attribute (available on standard zlib.Decompress objects since Python 3.6) to distinguish: - dcompdat empty AND zip not at EOF -> still digesting, recurse - dcompdat empty AND zip at EOF -> compressed_bytes_read scrub or genuine EOF; do not recurse. `getattr(_zip, 'eof', False)` keeps the conservative behavior when running against a custom zlib object that does not expose the attribute. Adds a regression test that reads with chunk_size in {1, 4, 16, 64} from a 13 KB highly-compressible stream. With the old guard, the chunk_size <= 16 cases stopped at byte 0; the new test asserts they read all 13000 bytes. The full existing test suite (24 tests) still passes, including test_decompress_reader_special_case and test_pack which exercise the compressed_bytes_read scrub path that the original guard existed to protect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135

Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135
mvanhorn wants to merge 1 commit intogitpython-developers:masterfrom
mvanhorn:fix/120-empty-read-before-eof

mvanhorn commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

mvanhorn commented May 8, 2026

Summary

Why the original guard existed

Fix

Tests

Manual smoke

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant