Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135
Open
mvanhorn wants to merge 1 commit intogitpython-developers:masterfrom
Open
Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135mvanhorn wants to merge 1 commit intogitpython-developers:masterfrom
mvanhorn wants to merge 1 commit intogitpython-developers:masterfrom
Conversation
Closes gitpython-developers#120 DecompressMemMapReader.read(N) could return b'' mid-stream when zlib consumed input without producing output on a single decompress call (small N, header / dictionary frames in flight). The original `if dcompdat and ...` guard at the recursion site skipped the "refill to size" recursion in that case, so a caller using the standard idiom while chunk := stream.read(4096): yield chunk terminated at the first empty chunk -- before _br == _s. The guard exists for compressed_bytes_read(), which manipulates _br=0 and then drains the inner zip past its EOF. Recursing there would loop forever because the inner zip is already done. The fix uses zlib's own `eof` attribute (available on standard zlib.Decompress objects since Python 3.6) to distinguish: - dcompdat empty AND zip not at EOF -> still digesting, recurse - dcompdat empty AND zip at EOF -> compressed_bytes_read scrub or genuine EOF; do not recurse. `getattr(_zip, 'eof', False)` keeps the conservative behavior when running against a custom zlib object that does not expose the attribute. Adds a regression test that reads with chunk_size in {1, 4, 16, 64} from a 13 KB highly-compressible stream. With the old guard, the chunk_size <= 16 cases stopped at byte 0; the new test asserts they read all 13000 bytes. The full existing test suite (24 tests) still passes, including test_decompress_reader_special_case and test_pack which exercise the compressed_bytes_read scrub path that the original guard existed to protect.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #120
DecompressMemMapReader.read(N)could returnb''before_br == _swhenNwas small enough that the underlyingzlib.decompress(indata, N)consumed input bytes without producing output (e.g. while ingesting the zlib header / dictionary frames). Callers using the standardwhile chunk := stream.read(N)idiom terminated at the first empty chunk -- before reaching the actual end of the uncompressed object. Reproduced with chunk sizes 1, 4, 16 against a 13 KB stream of compressible data.Why the original guard existed
The previous
if dcompdat and ...guard at the recursion site was added so thatcompressed_bytes_read()could driveread()in a scrub-loop that intentionally manipulates_br = 0while the inner zlib object is already past EOF. Without that guard, the scrub loop would recurse forever because_br < _sstays true.Fix
Convert the recursive "refill to size" branch into an iterative loop around the existing decompress block. The loop terminates when:
len(dcompdat) >= size(caller's request satisfied), OR_br >= _s(uncompressed object fully read), ORself._m, OR no compressed bytes were consumed on this turn).Condition 3 preserves the
compressed_bytes_read()scrub safety: once the inner zip is at EOF, the loop breaks instead of looping forever. It also bounds the truncated-stream case (zip.eofnever becomes true) by the input-exhaustion guard, and bounds the crafted "many empty deflate blocks" attack -- previously this could blow Python's recursion depth at ~1500+ stored-block headers because each recursion only consumed a handful of input bytes; the iterative form walks the stream forward without growing the stack.Tests
test_decompress_reader_chunked_read_does_not_terminate_earlyreads a 13 KB highly-compressible stream with chunk sizes 1, 4, 16, 64 and asserts the full payload is returned and_br == _s.gitdb/test/(24 tests) all pass, includingtest_decompress_reader_special_caseandtest_pack-- both exercise thecompressed_bytes_readscrub path.Manual smoke
(Pre-fix this fails for
chunk_size in (1, 4, 16).)Truncated input is handled the same as before:
read()returns whatever was decoded so far, thenb'', instead of looping or raising. The crafted recursion-depth attack (~5000 empty deflate blocks ahead of one valid block) now decodes correctly instead of raisingRecursionError.