Skip to content

A misdeclared Content-Type renames files whose bytes prove the URL extension right#478

Merged
xroche merged 1 commit into
masterfrom
p1-3-mime-sniff
Jul 4, 2026
Merged

A misdeclared Content-Type renames files whose bytes prove the URL extension right#478
xroche merged 1 commit into
masterfrom
p1-3-mime-sniff

Conversation

@xroche

@xroche xroche commented Jul 3, 2026

Copy link
Copy Markdown
Owner

When a server declares a wrong but specific Content-Type, httrack renames the file after the wire: photo.jpg served as image/png becomes photo.png. Browsers resolve this exact conflict the other way: the WHATWG MIME Sniffing Standard computes the type from the magic bytes in image, audio/video and font contexts, and Chrome and Firefox both treat a mislabeled resource as its actual format. So a mirror currently names such files in a way no browser agrees with, which bites once the mirror is re-served with extension-derived types. The contested verdict named by #480 now follows the same rule, more conservatively than the spec: magic proving the URL extension's own type keeps it, anything inconclusive trusts the wire as before (a file is never renamed to a third, sniffed type), and the #267 soft-404 guard is unchanged. The htssniff.c table follows the spec's pattern tables where it defines them (including the Windows-cursor variant of image/x-icon) and extends the same discipline to the rest of httrack's wider MIME set, gzip'd bodies are inflated in memory, and update runs reproduce the previous run's verdict from the recorded save name (X-Save), so names never churn across updates or binary upgrades.

The diff is the pure feature on top of #479/#480: the sniff table, its plumbing in the naming decision, a wait for a sniffable body head on contested verdicts, and the recorded-verdict fallback. The tests are explicit flips of #480's contract pins (wrongtype/bigtype/packed/mutant keep their extension, lie.png stays .png) plus -#test=sniff table rows; every flip fails on the pre-sniff binary, and the suite passes under ASan+UBSan and MSan.

A URL whose extension maps to a specific type but is served with a
disagreeing specific Content-Type was always renamed after the wire
(photo.jpg served as image/png became photo.png). The contested
verdict (#480) is now settled by the leading body bytes: magic proving
the extension's own type keeps it, anything inconclusive trusts the
wire as before, and the #267 soft-404 guard is unchanged.

New htssniff.c covers the magic-sniffable part of the supported MIME
set (images, A/V containers by RIFF subtype and ftyp brand, zip/OLE
document containers, archives, fonts, conservative text prefixes).
hts_wait_delayed waits for a sniffable head (or EOF) only on contested
verdicts; the head is read from the live backing slot (memory,
url_sav, or the compressed-stream tmpfile, inflated in memory). Update
runs never re-read bytes: they reproduce the previous run's verdict
from the recorded X-Save name (cache_read_including_broken grows a
return_save), so names never churn across updates or upgrades.
Non-delayed mode never sniffs; its HEAD probe has no body on the
first run. Also unlock the waiter's slot on the user-cancel abort.

Tests flip the #480 contract pins to the sniffed outcomes (wrongtype/
bigtype/packed/mutant keep their extension, lie.png stays png), add
-#test=sniff table rows, and pin the recorded-verdict proxy in
01_zlib-savename-cached (kept out of the MSan job: uninstrumented
zlib). All discriminate against the pre-sniff binary.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
@xroche xroche force-pushed the p1-3-mime-sniff branch from a0ef2eb to 8d51f0e Compare July 4, 2026 07:49
@xroche xroche merged commit a3f04bd into master Jul 4, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant