Detect OSV degenerate tag matches via canary query#50
Conversation
OSV's normalize_tag collapses any string with no decimal digit to "", which then fuzzy-matches any digit-free tag in a CVE's affected list (icu has tags like "last-cvs-commit" and "latest"). Send a digit-free canary alongside each real query; if both return identical non-empty sets, OSV did not meaningfully resolve the real tag. Skip and warn. Also adds regression tests pinning affects_version? against real-shaped icu CVE records, and verifies batch slicing at the 500-package boundary. See #41 and google/osv.dev#5230.
There was a problem hiding this comment.
Pull request overview
Adds a defensive “canary” query to OSV batch requests to detect OSV’s degenerate digit-free tag normalization behavior (issue #41 / google/osv.dev#5230) and avoids reporting misleading fallback matches, alongside new regression tests to pin local version-filter behavior and batch alignment.
Changes:
- Interleave a digit-free canary version query with each real OSV batch query and skip results when OSV appears to have fallen back.
- Reduce effective batch slicing to 500 packages per request (to keep within OSV’s 1000-query batch limit after doubling).
- Add regression tests for
affects_version?behavior on real-shaped ICU OSV records and for batch interleaving/slicing/index alignment.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| lib/brew/vulns/osv_client.rb | Interleaves canary queries into query_batch and applies skip logic when real+canary results match. |
| test/brew/test_osv_client.rb | Adds tests for canary interleaving, skip behavior, set comparison, and 500-package slicing/index alignment. |
| test/brew/test_vulnerability.rb | Adds regression tests for affects_version? using ICU-shaped OSV/CVE records and separator normalization guard. |
| test/brew/test_cli.rb | Adds end-to-end regression tests covering local filtering and batch request payload/index alignment invariants. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| real_ids = (real["vulns"] || []).map { |v| v["id"] }.sort | ||
| canary_ids = (canary["vulns"] || []).map { |v| v["id"] }.sort | ||
|
|
||
| if !canary_ids.empty? && real_ids == canary_ids | ||
| warn "Warning: OSV could not resolve #{pkg[:name] || pkg[:repo_url]} " \ | ||
| "tag #{pkg[:version].inspect}; skipping (results matched " \ | ||
| "default-branch fallback)" | ||
| results[global_idx] = [] |
There was a problem hiding this comment.
The canary-skip heuristic currently triggers for any version when the real and canary vuln ID sets match. That can produce false negatives for normal digit-bearing tags (e.g., if a vuln truly affects all versions, or if the tag legitimately has the same vuln set as default-branch/fallback), since it will be skipped even though OSV may have resolved the tag correctly. Consider gating the canary comparison (and ideally the extra canary query) to only run when the real version string contains no decimal digits, which is the specific upstream degeneracy described in the comment above CANARY_VERSION.
| response = post("/querybatch", { queries: queries }) | ||
| batch_results = response["results"] || [] | ||
|
|
||
| batch_results.each_with_index do |result, idx| | ||
| global_idx = batch_idx * BATCH_SIZE + idx | ||
| results[global_idx] = result["vulns"] || [] | ||
| batch.each_with_index do |pkg, idx| | ||
| global_idx = batch_idx * slice_size + idx | ||
| real = batch_results[idx * 2] || {} | ||
| canary = batch_results[idx * 2 + 1] || {} | ||
|
|
There was a problem hiding this comment.
query_batch assumes OSV returns a results array aligned 1:1 with the submitted queries. If the API returns fewer/more entries (or nils), the idx*2 indexing will silently drop or misattribute vulnerabilities. Consider validating that response["results"] is an Array of the expected length (queries.length) and raising ApiError (or at least warning) when it isn’t, to avoid producing incorrect scans.
| packages = [{ repo_url: "https://github.com/a/a", version: "v9.9.9", name: "a" }] | ||
|
|
||
| err = capture_stderr { @results = @client.query_batch(packages) } | ||
|
|
||
| assert_equal [[]], @results | ||
| assert_match(/OSV could not resolve a tag "v9.9.9"/, err) |
There was a problem hiding this comment.
This test uses a digit-bearing version ("v9.9.9"), but the upstream OSV degeneracy described for the canary is specifically about digit-free tags collapsing to "". Using a digit-free version here (and/or asserting that the canary comparison is only applied to digit-free versions) would better match the intended failure mode and avoid baking overly-broad skipping behavior into the test suite.
| packages = [{ repo_url: "https://github.com/a/a", version: "v9.9.9", name: "a" }] | |
| err = capture_stderr { @results = @client.query_batch(packages) } | |
| assert_equal [[]], @results | |
| assert_match(/OSV could not resolve a tag "v9.9.9"/, err) | |
| packages = [{ repo_url: "https://github.com/a/a", version: "v", name: "a" }] | |
| err = capture_stderr { @results = @client.query_batch(packages) } | |
| assert_equal [[]], @results | |
| assert_match(/OSV could not resolve a tag "v"/, err) |
Investigating #41 turned up a permanent OSV quirk:
normalize_tagcollapses any string with no decimal digit to"", which then fuzzy-matches any digit-free tag sitting in a CVE's affected-versions list. The icu repo has tags likelast-cvs-commitandlatest, so querying for"deadbeef"returns seven 2017-era CVEs while"deadbeef0"returns zero. Filed upstream as google/osv.dev#5230.This PR sends a digit-free canary version alongside each real query in the batch. If real and canary return identical non-empty result sets, OSV did not meaningfully resolve the real tag — skip that formula with a stderr warning instead of reporting whatever the degenerate match produced. Cost is doubling the query count inside the same HTTP request; batch slicing now cuts at 500 packages instead of 1000.
Also adds regression tests for the local
affects_version?filter using real-shaped icu CVE records (CVE-2016-6293, OSV-2023-1328), batch index alignment, and slicing at the 500-package boundary.The reporter's specific #41 case appears to have been transient bad data from google/osv.dev#5103 — every flagged CVE was re-indexed in the days after that PR landed. The canary doesn't defend against bad upstream
affected.versionscontent, but the local-filter regression tests pin what we can verify.