-
-
Notifications
You must be signed in to change notification settings - Fork 209
Privacy 2025 queries #4178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Privacy 2025 queries #4178
Changes from 24 commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
6aebd35
dates updated
max-ostapenko 943ae28
query updates
max-ostapenko 451cca9
sheet exporter update
max-ostapenko e394bb6
ID update
max-ostapenko 2a7db9b
formatting
max-ostapenko cd572c8
lint
max-ostapenko bb5959a
Refactor origin trial functions for improved readability and structure
max-ostapenko 579fb49
Merge remote-tracking branch 'origin/main' into privacy-sql-2025
max-ostapenko 55cec4f
lint
max-ostapenko c3a2ee7
lint
max-ostapenko 489d07d
Merge branch 'main' into privacy-sql-2025
max-ostapenko 7640ee0
Merge branch 'main' into privacy-sql-2025
max-ostapenko bd7506d
make bq_to_sheets.ipynb runnable and add deps to requirements
max-ostapenko 08aa531
Refactor privacy queries and utilities; make bq_to_sheets runnable
max-ostapenko c2566e6
Potential fix for code scanning alert no. 640: Unused import
max-ostapenko 42da6ad
Remove unused json import
max-ostapenko c88867e
Merge branch 'main' into privacy-sql-2025
max-ostapenko 28d240b
Merge branch 'main' into privacy-sql-2025
max-ostapenko 8e543ef
Add SQL scripts for tracking first-party and third-party cookies; rem…
max-ostapenko c195c07
lint
max-ostapenko 5c300e4
Refactor SQL scripts for IAB TCF v2 and client hints; streamline quer…
max-ostapenko 8768f22
Review and apply sql pivots
max-ostapenko 951a7d2
Remove deprecated SQL scripts, and add new scripts for tracker distri…
max-ostapenko 1c31d62
Merge branch 'main' into privacy-sql-2025
max-ostapenko 9ab94bd
updated metrics
max-ostapenko 864fddd
formatting
max-ostapenko 2b523a9
Merge branch 'main' into privacy-sql-2025
max-ostapenko c14742b
Merge branch 'main' into privacy-sql-2025
max-ostapenko 8c4e816
3p cookie domains
max-ostapenko 220d0b8
switch the columns for a chart
max-ostapenko edc9fb1
exclude android.clients.google.com
max-ostapenko d280cd3
fix order by
max-ostapenko 0d23ecb
lint
max-ostapenko 1096577
Merge branch 'main' into privacy-sql-2025
max-ostapenko 6c24d45
split requirements.txt
max-ostapenko 78acbd8
Merge branch 'main' into privacy-sql-2025
max-ostapenko 8d9e84a
lint
max-ostapenko f521a83
Merge branch 'main' into privacy-sql-2025
max-ostapenko 1e3a9ab
Merge branch 'privacy-sql-2025' of https://github.com/HTTPArchive/alm…
max-ostapenko File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
32 changes: 18 additions & 14 deletions
32
sql/2024/privacy/number_of_websites_with_related_origin_trials.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| -- noqa: disable=PRS | ||
| -- Detection logic explained: | ||
| -- https://github.com/privacycg/proposals/issues/6 | ||
| -- https://github.com/privacycg/nav-tracking-mitigations/blob/main/bounce-tracking-explainer.md | ||
|
|
||
| WITH redirect_requests AS ( | ||
| FROM `httparchive.crawl.requests` | ||
| |> WHERE | ||
| date = '2025-07-01' AND | ||
| --rank = 1000 AND | ||
| SAFE.INT64(summary.status) BETWEEN 300 AND 399 AND | ||
| index <= 2 | ||
| |> JOIN UNNEST(response_headers) AS header | ||
| |> WHERE LOWER(header.name) = 'location' | ||
| |> SELECT | ||
| client, | ||
| url, | ||
| index, | ||
| NET.REG_DOMAIN(header.value) AS location_domain, | ||
| page | ||
| ), | ||
|
|
||
| -- Find the first navigation redirect | ||
| navigation_redirect AS ( | ||
| FROM redirect_requests | ||
| |> WHERE | ||
| index = 1 AND | ||
| NET.REG_DOMAIN(page) = NET.REG_DOMAIN(url) AND | ||
| NET.REG_DOMAIN(url) != location_domain | ||
| |> SELECT | ||
| client, | ||
| page, | ||
| location_domain AS bounce_domain | ||
| ), | ||
|
|
||
| -- Find the second navigation redirect | ||
| bounce_redirect AS ( | ||
| FROM redirect_requests | ||
| |> WHERE | ||
| index = 2 AND | ||
| NET.REG_DOMAIN(page) != NET.REG_DOMAIN(url) AND | ||
| NET.REG_DOMAIN(url) != location_domain | ||
| |> SELECT | ||
| client, | ||
| url, | ||
| page, | ||
| location_domain AS bounce_redirect_location_domain | ||
| ), | ||
|
|
||
| -- Combine the first and second navigation redirects | ||
| bounce_sequences AS ( | ||
| FROM navigation_redirect AS nav | ||
| |> JOIN bounce_redirect AS bounce | ||
| ON | ||
| nav.client = bounce.client AND | ||
| nav.page = bounce.page | ||
| |> AGGREGATE COUNT(DISTINCT nav.page) AS pages_count | ||
| GROUP BY nav.client, bounce_domain | ||
| ), | ||
|
|
||
| pages_total AS ( | ||
| FROM `httparchive.crawl.pages` | ||
| |> WHERE date = '2025-07-01' --AND rank = 1000 | ||
| |> AGGREGATE COUNT(DISTINCT page) AS total_pages GROUP BY client | ||
| ) | ||
|
|
||
| FROM bounce_sequences | ||
| |> JOIN pages_total USING (client) | ||
| |> EXTEND pages_count / total_pages AS pages_pct | ||
| |> DROP total_pages | ||
| |> PIVOT( | ||
| ANY_VALUE(pages_count) AS cnt, | ||
| ANY_VALUE(pages_pct) AS pages_pct | ||
| FOR client IN ('desktop', 'mobile') | ||
| ) | ||
| |> RENAME cnt_mobile AS mobile, cnt_desktop AS desktop | ||
| |> ORDER BY mobile + desktop DESC | ||
| |> LIMIT 100 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| -- noqa: disable=PRS | ||
| WITH totals AS ( | ||
| FROM `httparchive.crawl.pages` | ||
| |> WHERE date = '2025-07-01' AND is_root_page --AND rank = 1000 | ||
| |> AGGREGATE COUNT(*) AS total_websites GROUP BY client | ||
| ), | ||
|
|
||
| /* Get Accept-CH Headers */ | ||
| headers AS ( | ||
| FROM `httparchive.crawl.requests` | ||
| |> WHERE date = '2025-07-01' AND is_root_page AND is_main_document --AND rank = 1000 | ||
| |> JOIN UNNEST(response_headers) AS header | ||
| |> WHERE LOWER(header.name) = 'accept-ch' | ||
| |> LEFT JOIN UNNEST(SPLIT(LOWER(header.value), ',')) AS header_value | ||
| |> SELECT client, page, header_value | ||
|
|
||
| ), | ||
|
|
||
| /* Get Accept-CH Meta Tags */ | ||
| meta_tags AS ( | ||
| FROM `httparchive.crawl.pages` | ||
| |> WHERE date = '2025-07-01' AND is_root_page --AND rank = 1000 | ||
| |> JOIN UNNEST(JSON_QUERY_ARRAY(custom_metrics.other.almanac.`meta-nodes`.nodes)) AS meta_node | ||
| |> EXTEND | ||
| LOWER(SAFE.STRING(meta_node.`http-equiv`)) AS tag_name, | ||
| |> WHERE tag_name = 'accept-ch' | ||
| |> LEFT JOIN UNNEST(SPLIT(LOWER(SAFE.STRING(meta_node.content)), ',')) AS tag_value | ||
| |> SELECT client, page, tag_value | ||
| ) | ||
|
|
||
| FROM headers | ||
| |> FULL OUTER JOIN meta_tags USING (client, page) | ||
| |> JOIN totals USING (client) | ||
| |> EXTEND TRIM(COALESCE(header_value, tag_value)) AS value | ||
| |> AGGREGATE | ||
| COUNT(DISTINCT page) AS number_of_pages, | ||
| COUNT(DISTINCT page) / ANY_VALUE(total_websites) AS pct_pages | ||
| GROUP BY client, value | ||
| |> PIVOT( | ||
| ANY_VALUE(number_of_pages) AS pages_count, | ||
| ANY_VALUE(pct_pages) AS pct | ||
| FOR client IN ('desktop', 'mobile') | ||
| ) | ||
| |> RENAME | ||
| pct_mobile AS mobile, | ||
| pct_desktop AS desktop | ||
| |> ORDER BY pages_count_mobile + pages_count_desktop DESC | ||
| |> LIMIT 200 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| -- noqa: disable=PRS | ||
| WITH base_totals AS ( | ||
| SELECT | ||
| client, | ||
| COUNT(DISTINCT root_page) AS total_websites | ||
| FROM `httparchive.crawl.pages` | ||
| WHERE date = '2025-07-01' | ||
| --AND rank = 1000 | ||
| GROUP BY client | ||
| ), | ||
|
|
||
| accept_ch_headers AS ( | ||
| SELECT DISTINCT | ||
| client, | ||
| root_page | ||
| FROM `httparchive.crawl.requests`, | ||
| UNNEST(response_headers) response_header | ||
| WHERE | ||
| date = '2025-07-01' AND | ||
| is_main_document = TRUE AND | ||
| --rank = 1000 AND | ||
| LOWER(response_header.name) = 'accept-ch' | ||
| ), | ||
|
|
||
| accept_ch_meta AS ( | ||
| SELECT DISTINCT | ||
| client, | ||
| root_page | ||
| FROM ( | ||
| SELECT | ||
| client, | ||
| root_page, | ||
| custom_metrics.other.almanac AS metrics | ||
| FROM `httparchive.crawl.pages` | ||
| WHERE date = '2025-07-01' | ||
| --AND rank = 1000 | ||
| ), | ||
| UNNEST(JSON_QUERY_ARRAY(metrics.`meta-nodes`.nodes)) AS meta_node | ||
| WHERE LOWER(SAFE.STRING(meta_node.`http-equiv`)) = 'accept-ch' | ||
| ), | ||
|
|
||
| -- Combine both sources | ||
| all_accept_ch AS ( | ||
| SELECT client, root_page FROM accept_ch_headers | ||
| UNION DISTINCT | ||
| SELECT client, root_page FROM accept_ch_meta | ||
| ) | ||
|
|
||
| FROM all_accept_ch | ||
| |> JOIN base_totals USING (client) | ||
| |> AGGREGATE | ||
| COUNT(DISTINCT all_accept_ch.root_page) AS number_of_websites, | ||
| COUNT(DISTINCT all_accept_ch.root_page) / ANY_VALUE(base_totals.total_websites) AS pct_websites | ||
| GROUP BY all_accept_ch.client | ||
| |> PIVOT( | ||
| ANY_VALUE(number_of_websites) AS websites_count, | ||
| ANY_VALUE(pct_websites) AS pct | ||
| FOR client IN ('desktop', 'mobile') | ||
| ) | ||
| |> RENAME pct_mobile AS mobile, pct_desktop AS desktop | ||
| |> ORDER BY websites_count_mobile + websites_count_desktop DESC |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| /* Most common cookie names, by number of domains on which they appear. | ||
| Goal is to identify common trackers that use first-party cookies across sites. | ||
| */ | ||
|
|
||
| FROM `httparchive.crawl.pages` | ||
| |> WHERE date = '2025-07-01' -- AND rank = 1000 | ||
| |> EXTEND COUNT(DISTINCT NET.HOST(root_page)) OVER (PARTITION BY client) AS total_domains | ||
| |> JOIN UNNEST(JSON_QUERY_ARRAY(custom_metrics.cookies)) AS cookie | ||
| |> EXTEND | ||
| NET.HOST(root_page) AS firstparty_domain, | ||
| NET.HOST(SAFE.STRING(cookie.domain)) AS cookie_domain, | ||
| SAFE.STRING(cookie.name) AS cookie_name | ||
| |> WHERE ENDS_WITH('.' || firstparty_domain, '.' || cookie_domain) | ||
| |> AGGREGATE | ||
| COUNT(DISTINCT firstparty_domain) AS domain_count, | ||
| COUNT(DISTINCT firstparty_domain) / ANY_VALUE(total_domains) AS pct_domains | ||
| GROUP BY client, cookie_name | ||
| |> PIVOT ( | ||
| ANY_VALUE(domain_count) AS domain_count, | ||
| ANY_VALUE(pct_domains) AS pct_domains | ||
| FOR client IN ('desktop', 'mobile') | ||
| ) | ||
| |> RENAME | ||
| pct_domains_mobile AS mobile, | ||
| pct_domains_desktop AS desktop | ||
| |> ORDER BY domain_count_mobile + domain_count_desktop DESC | ||
| |> LIMIT 1000 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| FROM `httparchive.crawl.pages` | ||
| |> WHERE date = '2025-07-01' -- AND rank = 1000 | ||
| |> EXTEND COUNT(DISTINCT NET.HOST(root_page)) OVER (PARTITION BY client) AS total_domains | ||
| |> JOIN UNNEST(JSON_QUERY_ARRAY(custom_metrics.cookies)) AS cookie | ||
| |> EXTEND | ||
| NET.HOST(root_page) AS firstparty_domain, | ||
| NET.HOST(SAFE.STRING(cookie.domain)) AS cookie_domain, | ||
| NET.HOST(SAFE.STRING(cookie.domain)) || ' / ' || SAFE.STRING(cookie.name) AS cookie_details | ||
| |> WHERE NOT ENDS_WITH('.' || firstparty_domain, '.' || cookie_domain) | ||
| |> AGGREGATE | ||
| COUNT(DISTINCT firstparty_domain) AS domain_count, | ||
| COUNT(DISTINCT firstparty_domain) / ANY_VALUE(total_domains) AS pct_domains | ||
| GROUP BY client, cookie_details | ||
| |> PIVOT ( | ||
| ANY_VALUE(domain_count) AS domain_count, | ||
| ANY_VALUE(pct_domains) AS pct_domains | ||
| FOR client IN ('desktop', 'mobile') | ||
| ) | ||
| |> RENAME | ||
| pct_domains_mobile AS mobile, | ||
| pct_domains_desktop AS desktop | ||
| |> ORDER BY domain_count_mobile + domain_count_desktop DESC | ||
| |> LIMIT 1000 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| -- Pages that use DNT feature | ||
|
|
||
| FROM `httparchive.blink_features.usage` | ||
| |> WHERE | ||
| date = '2025-07-01' AND | ||
| --rank <= 10000 AND | ||
| feature = 'NavigatorDoNotTrack' | ||
| |> SELECT DISTINCT | ||
| client, | ||
| rank, | ||
| num_urls, | ||
| pct_urls | ||
| |> PIVOT ( | ||
| ANY_VALUE(num_urls) AS pages_count, | ||
| ANY_VALUE(pct_urls) AS pct | ||
| FOR client IN ('desktop', 'mobile') | ||
| ) | ||
| |> RENAME pct_mobile AS mobile, pct_desktop AS desktop | ||
| |> ORDER BY rank ASC |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| -- noqa: disable=PRS | ||
| -- Percent of websites using a fingerprinting library based on wappalyzer category | ||
|
|
||
| WITH base_totals AS ( | ||
| SELECT | ||
| client, | ||
| COUNT(DISTINCT root_page) AS websites_total | ||
| FROM httparchive.crawl.pages | ||
| WHERE date = '2025-07-01' | ||
| GROUP BY client | ||
| ) | ||
|
|
||
| FROM httparchive.crawl.pages, | ||
| UNNEST(technologies) AS technology, | ||
| UNNEST(technology.categories) AS category | ||
| |> WHERE | ||
| date = '2025-07-01' AND | ||
| category = 'Browser fingerprinting' | ||
| |> AGGREGATE | ||
| COUNT(DISTINCT root_page) AS websites_count | ||
| GROUP BY client, technology.technology | ||
| |> JOIN base_totals USING (client) | ||
| |> EXTEND websites_count / websites_total AS websites_pct | ||
| |> DROP websites_total | ||
| |> PIVOT( | ||
| ANY_VALUE(websites_count) AS websites_count, | ||
| ANY_VALUE(websites_pct) AS pct | ||
| FOR client IN ('desktop', 'mobile') | ||
| ) | ||
| |> RENAME websites_count_mobile AS mobile, websites_count_desktop AS desktop | ||
| |> ORDER BY mobile + desktop DESC |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| -- Counts of CMPs using IAB Transparency & Consent Framework | ||
| -- cf. https://github.com/InteractiveAdvertisingBureau/GDPR-Transparency-and-Consent-Framework/blob/master/TCFv2/IAB%20Tech%20Lab%20-%20CMP%20API%20v2.md--tcdata | ||
| -- CMP vendor list: https://iabeurope.eu/cmp-list/ | ||
|
|
||
| FROM `httparchive.crawl.pages` | ||
| |> WHERE date = '2025-07-01' --AND rank = 1000 | ||
| |> EXTEND | ||
| SAFE.INT64(custom_metrics.privacy.iab_tcf_v2.data.cmpId) AS cmpId, | ||
| COUNT(DISTINCT page) OVER (PARTITION BY client) AS total_pages | ||
| |> AGGREGATE | ||
| COUNT(0) AS number_of_pages, | ||
| COUNT(0) / ANY_VALUE(total_pages) AS pct_pages | ||
| GROUP BY client, cmpId | ||
| |> PIVOT ( | ||
| ANY_VALUE(number_of_pages) AS pages_count, | ||
| ANY_VALUE(pct_pages) AS pct | ||
| FOR client IN ('desktop', 'mobile') | ||
| ) | ||
| |> RENAME pct_mobile AS mobile, pct_desktop AS desktop | ||
| |> ORDER BY pages_count_mobile + pages_count_desktop DESC |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| -- noqa: disable=PRS | ||
| -- Counts of countries for publishers using IAB Transparency & Consent Framework | ||
| -- cf. https://github.com/InteractiveAdvertisingBureau/GDPR-Transparency-and-Consent-Framework/blob/master/TCFv2/IAB%20Tech%20Lab%20-%20CMP%20API%20v2.md--tcdata | ||
| -- "Country code of the country that determines the legislation of | ||
| -- reference. Normally corresponds to the country code of the country | ||
| -- in which the publisher's business entity is established." | ||
|
|
||
| WITH base_totals AS ( | ||
| SELECT | ||
| client, | ||
| COUNT(DISTINCT root_page) AS total_websites | ||
| FROM `httparchive.crawl.pages` | ||
| WHERE date = '2025-07-01' --AND rank = 1000 | ||
| GROUP BY client | ||
| ), | ||
|
|
||
| base_data AS ( | ||
| SELECT | ||
| client, | ||
| root_page, | ||
| UPPER(SAFE.STRING(custom_metrics.privacy.iab_tcf_v2.data.publisherCC)) AS publisherCC | ||
| FROM `httparchive.crawl.pages` | ||
| WHERE | ||
| date = '2025-07-01' AND --rank = 1000 AND | ||
| JSON_TYPE(custom_metrics.privacy.iab_tcf_v2.data) = 'object' | ||
| ) | ||
|
|
||
| FROM base_data | ||
| |> AGGREGATE | ||
| COUNT(DISTINCT root_page) AS number_of_pages | ||
| GROUP BY client, publisherCC | ||
| |> JOIN base_totals USING (client) | ||
| |> EXTEND number_of_pages / total_websites AS pct_of_pages | ||
| |> DROP total_websites | ||
| |> PIVOT( | ||
| ANY_VALUE(number_of_pages) AS pages_count, | ||
| ANY_VALUE(pct_of_pages) AS pct | ||
| FOR client IN ('desktop', 'mobile') | ||
| ) | ||
| |> RENAME pct_mobile AS mobile, pct_desktop AS desktop | ||
| |> ORDER BY pages_count_mobile + pages_count_desktop DESC |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.