A misdeclared Content-Type renames files whose bytes prove the URL extension right#478
Merged
Conversation
A URL whose extension maps to a specific type but is served with a disagreeing specific Content-Type was always renamed after the wire (photo.jpg served as image/png became photo.png). The contested verdict (#480) is now settled by the leading body bytes: magic proving the extension's own type keeps it, anything inconclusive trusts the wire as before, and the #267 soft-404 guard is unchanged. New htssniff.c covers the magic-sniffable part of the supported MIME set (images, A/V containers by RIFF subtype and ftyp brand, zip/OLE document containers, archives, fonts, conservative text prefixes). hts_wait_delayed waits for a sniffable head (or EOF) only on contested verdicts; the head is read from the live backing slot (memory, url_sav, or the compressed-stream tmpfile, inflated in memory). Update runs never re-read bytes: they reproduce the previous run's verdict from the recorded X-Save name (cache_read_including_broken grows a return_save), so names never churn across updates or upgrades. Non-delayed mode never sniffs; its HEAD probe has no body on the first run. Also unlock the waiter's slot on the user-cancel abort. Tests flip the #480 contract pins to the sniffed outcomes (wrongtype/ bigtype/packed/mutant keep their extension, lie.png stays png), add -#test=sniff table rows, and pin the recorded-verdict proxy in 01_zlib-savename-cached (kept out of the MSan job: uninstrumented zlib). All discriminate against the pre-sniff binary. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a server declares a wrong but specific Content-Type, httrack renames the file after the wire:
photo.jpgserved asimage/pngbecomesphoto.png. Browsers resolve this exact conflict the other way: the WHATWG MIME Sniffing Standard computes the type from the magic bytes in image, audio/video and font contexts, and Chrome and Firefox both treat a mislabeled resource as its actual format. So a mirror currently names such files in a way no browser agrees with, which bites once the mirror is re-served with extension-derived types. The contested verdict named by #480 now follows the same rule, more conservatively than the spec: magic proving the URL extension's own type keeps it, anything inconclusive trusts the wire as before (a file is never renamed to a third, sniffed type), and the #267 soft-404 guard is unchanged. The htssniff.c table follows the spec's pattern tables where it defines them (including the Windows-cursor variant of image/x-icon) and extends the same discipline to the rest of httrack's wider MIME set, gzip'd bodies are inflated in memory, and update runs reproduce the previous run's verdict from the recorded save name (X-Save), so names never churn across updates or binary upgrades.The diff is the pure feature on top of #479/#480: the sniff table, its plumbing in the naming decision, a wait for a sniffable body head on contested verdicts, and the recorded-verdict fallback. The tests are explicit flips of #480's contract pins (
wrongtype/bigtype/packed/mutantkeep their extension,lie.pngstays.png) plus-#test=snifftable rows; every flip fails on the pre-sniff binary, and the suite passes under ASan+UBSan and MSan.