Skip to content

Privacy 2025 queries#4178

Merged
max-ostapenko merged 39 commits intomainfrom
privacy-sql-2025
Jan 14, 2026
Merged

Privacy 2025 queries#4178
max-ostapenko merged 39 commits intomainfrom
privacy-sql-2025

Conversation

@max-ostapenko
Copy link
Copy Markdown
Contributor

@max-ostapenko max-ostapenko commented Aug 1, 2025

Makes progress on #4083

Tracking and Technologies

  • the most common tracker categories deployed on websites
  • websites by technology and technology category, focusing on privacy-related categories such as analytics and advertising

Cookie Analysis

  • the most common first-party cookie names across domains
  • the most common third-party cookie names and domains

Other Privacy Metrics

  • the most common referrer policies
  • bounce tracking domains

Privacy Compliance Frameworks

  • CMPs using the IAB Transparency and Consent Framework v2
  • the distribution of US Privacy String values for websites using the IAB US Privacy Framework
  • publishers by country using the IAB TCF v2

@max-ostapenko max-ostapenko changed the title Privacy 2025 Privacy 2025 queries Aug 1, 2025
@tunetheweb tunetheweb added the analysis Querying the dataset label Aug 18, 2025
Comment thread sql/util/haveibeenpwned.py Fixed
max-ostapenko and others added 3 commits October 20, 2025 20:52
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
@tunetheweb
Copy link
Copy Markdown
Member

@max-ostapenko what's the latest with this? It's still marked as draft.

@max-ostapenko max-ostapenko marked this pull request as ready for review January 11, 2026 23:35
Copilot AI review requested due to automatic review settings January 11, 2026 23:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements Privacy 2025 queries for the HTTP Archive almanac, covering tracking technologies, cookie analysis, privacy metrics, and compliance frameworks. The PR adds multiple new SQL queries to analyze privacy-related data from the July 2025 crawl and updates supporting Python utilities for data processing and export to Google Sheets.

Changes:

  • Added 19 new SQL queries for privacy analysis (trackers, cookies, IAB frameworks, referrer policies, etc.)
  • Updated utility scripts for WhoTracksMe data and Have I Been Pwned breach data
  • Enhanced the BigQuery-to-Sheets notebook with improved error handling and local development support
  • Added new Python dependencies for data processing (tabulate, gspread, ipykernel, db-dtypes)

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/requirements.txt Added dependencies for Jupyter notebook and Google Sheets integration
sql/util/whotracksme_trackers.py Updated date to 2025-07-01 for new crawl data
sql/util/haveibeenpwned.py Refactored breach data retrieval with updated schema and TRUNCATE mode
sql/util/bq_writer.py Removed CSV source format specification from BigQuery load config
sql/util/bq_to_sheets.ipynb Major refactor with improved Colab compatibility and error handling
sql/2025/privacy/*.sql 19 new SQL queries analyzing trackers, cookies, privacy frameworks, and compliance
sql/2024/privacy/number_of_websites_with_related_origin_trials.sql Refactored origin trial parsing function

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sql/util/bq_to_sheets.ipynb
Comment thread sql/2025/privacy/client_hints_top.sql Outdated
Comment thread sql/util/haveibeenpwned.py
Comment thread sql/util/haveibeenpwned.py
Comment thread sql/util/haveibeenpwned.py Outdated
Comment thread sql/util/bq_writer.py
Comment thread sql/2025/privacy/tracker_technologies_top.sql Outdated
@max-ostapenko
Copy link
Copy Markdown
Contributor Author

@tunetheweb how does it look?
I'm ready to merge.

Copy link
Copy Markdown
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the SQL from a (very!) quick look but one question on the requirements.txt changes.

Comment thread src/requirements.txt Outdated
@max-ostapenko max-ostapenko merged commit d0f3b7a into main Jan 14, 2026
13 checks passed
@max-ostapenko max-ostapenko deleted the privacy-sql-2025 branch January 14, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

analysis Querying the dataset

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants