Skip to content

Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135

Open
mvanhorn wants to merge 1 commit intogitpython-developers:masterfrom
mvanhorn:fix/120-empty-read-before-eof
Open

Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135
mvanhorn wants to merge 1 commit intogitpython-developers:masterfrom
mvanhorn:fix/120-empty-read-before-eof

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

@mvanhorn mvanhorn commented May 8, 2026

Summary

Closes #120

DecompressMemMapReader.read(N) could return b'' before _br == _s when N was small enough that the underlying zlib.decompress(indata, N) consumed input bytes without producing output (e.g. while ingesting the zlib header / dictionary frames). Callers using the standard while chunk := stream.read(N) idiom terminated at the first empty chunk -- before reaching the actual end of the uncompressed object. Reproduced with chunk sizes 1, 4, 16 against a 13 KB stream of compressible data.

Why the original guard existed

The previous if dcompdat and ... guard at the recursion site was added so that compressed_bytes_read() could drive read() in a scrub-loop that intentionally manipulates _br = 0 while the inner zlib object is already past EOF. Without that guard, the scrub loop would recurse forever because _br < _s stays true.

Fix

Convert the recursive "refill to size" branch into an iterative loop around the existing decompress block. The loop terminates when:

  1. len(dcompdat) >= size (caller's request satisfied), OR
  2. _br >= _s (uncompressed object fully read), OR
  3. Inner zlib produced an empty chunk AND (zlib has hit EOF, OR the input window has run off self._m, OR no compressed bytes were consumed on this turn).

Condition 3 preserves the compressed_bytes_read() scrub safety: once the inner zip is at EOF, the loop breaks instead of looping forever. It also bounds the truncated-stream case (zip.eof never becomes true) by the input-exhaustion guard, and bounds the crafted "many empty deflate blocks" attack -- previously this could blow Python's recursion depth at ~1500+ stored-block headers because each recursion only consumed a handful of input bytes; the iterative form walks the stream forward without growing the stack.

Tests

  • New: test_decompress_reader_chunked_read_does_not_terminate_early reads a 13 KB highly-compressible stream with chunk sizes 1, 4, 16, 64 and asserts the full payload is returned and _br == _s.
  • Existing: gitdb/test/ (24 tests) all pass, including test_decompress_reader_special_case and test_pack -- both exercise the compressed_bytes_read scrub path.
$ pytest gitdb/test
======================== 24 passed, 1 skipped in 7.10s ========================

Manual smoke

import zlib
from gitdb import DecompressMemMapReader

data = b"hello world! " * 1000
zdata = zlib.compress(data)
for chunk_size in (1, 4, 16, 64, 4096):
    r = DecompressMemMapReader(zdata, close_on_deletion=False, size=len(data))
    out = b""
    while chunk := r.read(chunk_size):
        out += chunk
    assert out == data, chunk_size

(Pre-fix this fails for chunk_size in (1, 4, 16).)

Truncated input is handled the same as before: read() returns whatever was decoded so far, then b'', instead of looping or raising. The crafted recursion-depth attack (~5000 empty deflate blocks ahead of one valid block) now decodes correctly instead of raising RecursionError.

Closes gitpython-developers#120

DecompressMemMapReader.read(N) could return b'' mid-stream when zlib
consumed input without producing output on a single decompress call
(small N, header / dictionary frames in flight). The original
`if dcompdat and ...` guard at the recursion site skipped the
"refill to size" recursion in that case, so a caller using the
standard idiom

    while chunk := stream.read(4096):
        yield chunk

terminated at the first empty chunk -- before _br == _s.

The guard exists for compressed_bytes_read(), which manipulates
_br=0 and then drains the inner zip past its EOF. Recursing there
would loop forever because the inner zip is already done.

The fix uses zlib's own `eof` attribute (available on standard
zlib.Decompress objects since Python 3.6) to distinguish:

  - dcompdat empty AND zip not at EOF -> still digesting, recurse
  - dcompdat empty AND zip at EOF      -> compressed_bytes_read
                                         scrub or genuine EOF; do
                                         not recurse.

`getattr(_zip, 'eof', False)` keeps the conservative behavior
when running against a custom zlib object that does not expose
the attribute.

Adds a regression test that reads with chunk_size in
{1, 4, 16, 64} from a 13 KB highly-compressible stream. With the
old guard, the chunk_size <= 16 cases stopped at byte 0; the new
test asserts they read all 13000 bytes.

The full existing test suite (24 tests) still passes, including
test_decompress_reader_special_case and test_pack which exercise
the compressed_bytes_read scrub path that the original guard
existed to protect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Empty read from gitdb.OStream.read() before EOF

1 participant