Name the contested case in extension naming and pin the current contract#480
Merged
Conversation
Behavior-preserving refactor of wire_patches_ext: the decision becomes a three-way wire_ext_verdict (ext kept / wire wins / contested), with the contested case, a specific declared type disagreeing with a specific URL extension, named explicitly instead of falling through. Today a contested verdict trusts the wire, unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
…aming -#test=savename gains body= (leading body bytes via a temp url_sav file) and cached= (a real one-entry cache, reopened read-only, whose stored body is PNG magic); new rows and 01_zlib-savename-cached.test pin that naming never depends on content or on the previously recorded save name, only on headers. e2e fixtures (wrongtype.jpg served as image/png, a gzip variant, a 16 KiB body, content that changes between crawls) pin the wire-wins outcome across fresh and update passes. Any future content-based tie-break must flip these rows explicitly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
xroche
added a commit
that referenced
this pull request
Jul 4, 2026
A URL whose extension maps to a specific type but is served with a disagreeing specific Content-Type was always renamed after the wire (photo.jpg served as image/png became photo.png). The contested verdict (#480) is now settled by the leading body bytes: magic proving the extension's own type keeps it, anything inconclusive trusts the wire as before, and the #267 soft-404 guard is unchanged. New htssniff.c covers the magic-sniffable part of the supported MIME set (images, A/V containers by RIFF subtype and ftyp brand, zip/OLE document containers, archives, fonts, conservative text prefixes). hts_wait_delayed waits for a sniffable head (or EOF) only on contested verdicts; the head is read from the live backing slot (memory, url_sav, or the compressed-stream tmpfile, inflated in memory). Update runs never re-read bytes: they reproduce the previous run's verdict from the recorded X-Save name (cache_read_including_broken grows a return_save), so names never churn across updates or upgrades. Non-delayed mode never sniffs; its HEAD probe has no body on the first run. Also unlock the waiter's slot on the user-cancel abort. Tests flip the #480 contract pins to the sniffed outcomes (wrongtype/ bigtype/packed/mutant keep their extension, lie.png stays png), add -#test=sniff table rows, and pin the recorded-verdict proxy in 01_zlib-savename-cached (kept out of the MSan job: uninstrumented zlib). All discriminate against the pre-sniff binary. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
xroche
added a commit
that referenced
this pull request
Jul 4, 2026
A URL whose extension maps to a specific type but is served with a disagreeing specific Content-Type was always renamed after the wire (photo.jpg served as image/png became photo.png). The contested verdict (#480) is now settled by the leading body bytes: magic proving the extension's own type keeps it, anything inconclusive trusts the wire as before, and the #267 soft-404 guard is unchanged. New htssniff.c covers the magic-sniffable part of the supported MIME set (images, A/V containers by RIFF subtype and ftyp brand, zip/OLE document containers, archives, fonts, conservative text prefixes). hts_wait_delayed waits for a sniffable head (or EOF) only on contested verdicts; the head is read from the live backing slot (memory, url_sav, or the compressed-stream tmpfile, inflated in memory). Update runs never re-read bytes: they reproduce the previous run's verdict from the recorded X-Save name (cache_read_including_broken grows a return_save), so names never churn across updates or upgrades. Non-delayed mode never sniffs; its HEAD probe has no body on the first run. Also unlock the waiter's slot on the user-cancel abort. Tests flip the #480 contract pins to the sniffed outcomes (wrongtype/ bigtype/packed/mutant keep their extension, lie.png stays png), add -#test=sniff table rows, and pin the recorded-verdict proxy in 01_zlib-savename-cached (kept out of the MSan job: uninstrumented zlib). All discriminate against the pre-sniff binary. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
xroche
added a commit
that referenced
this pull request
Jul 4, 2026
A URL whose extension maps to a specific type but is served with a disagreeing specific Content-Type was always renamed after the wire (photo.jpg served as image/png became photo.png). The contested verdict (#480) is now settled by the leading body bytes: magic proving the extension's own type keeps it, anything inconclusive trusts the wire as before, and the #267 soft-404 guard is unchanged. New htssniff.c covers the magic-sniffable part of the supported MIME set (images, A/V containers by RIFF subtype and ftyp brand, zip/OLE document containers, archives, fonts, conservative text prefixes). hts_wait_delayed waits for a sniffable head (or EOF) only on contested verdicts; the head is read from the live backing slot (memory, url_sav, or the compressed-stream tmpfile, inflated in memory). Update runs never re-read bytes: they reproduce the previous run's verdict from the recorded X-Save name (cache_read_including_broken grows a return_save), so names never churn across updates or upgrades. Non-delayed mode never sniffs; its HEAD probe has no body on the first run. Also unlock the waiter's slot on the user-cancel abort. Tests flip the #480 contract pins to the sniffed outcomes (wrongtype/ bigtype/packed/mutant keep their extension, lie.png stays png), add -#test=sniff table rows, and pin the recorded-verdict proxy in 01_zlib-savename-cached (kept out of the MSan job: uninstrumented zlib). All discriminate against the pre-sniff binary. Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactor plus test coverage for the extension-naming decision, no behavior change. The wire-type-vs-extension choice in htsname.c was a single opaque function with early returns; it becomes an explicit three-way verdict (
wire_ext_verdict: extension kept / wire wins / contested), naming the case where a specific declared type disagrees with a specific URL extension. A contested verdict trusts the wire, as today.The naming contract was untested and largely implicit; it is now pinned so any future change has to flip a test on purpose.
-#test=savenamegainsbody=(leading body bytes) andcached=(a real one-entry cache, reopened read-only, whose stored body is deliberately PNG magic), with rows asserting that naming depends only on headers, never on body content or on the previously recorded save name. New e2e fixtures (wrongtype.jpgserved asimage/png, a gzip variant, a 16 KiB body, content that changes between crawls) pin the wire-wins outcome across fresh and update passes.