The queries are split by the section where they are used:
design/is about foundries and families,development/is about tools and technologies, andperformance/is about hosting and serving.
Each file name starts with one of the following prefixes indicating the primary subject of the corresponding analysis:
fonts_is about font files,pages_is about HTML pages,scripts_is about JavaScript scripts, andstyles_is about CSS style sheets.
The prefix is followed by the property studied given in singular, potentially extended one or several suffixes narrowing down the scope, as in fonts_size_by_table.sql and pages_link_relation.sql.
Each query starts with a preamble indicating the section, question, and normalization type, as illustrated below:
-- Section: Performance
-- Question: What is the distribution of the file size broken down by table?
-- Normalization: PagesMany queries rely on temporary functions for convenience and clarity. The functions that appear in several queries are extracted into a common file called common.sql. Whenever any of the functions defined in common.sql is used by a query, the query has the following pseudo-directive at the top:
-- INCLUDE https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/{year}/fonts/common.sqlThe pseudo-directive has to be replaced with the content of common.sql prior to executing the query in question.
In addition, queries generally have parameters, as in @date, so as to be able to run them for different configurations. The values for the parameters will have to be supplied upon execution.
All the above is taken take of automatically if the queries are executed using execute.py, which we discuss next.
The queries can be executed using the execute.py script. The results are first saved in local CSV files sitting next to the SQL files and then uploaded to the spreadsheet. In the spreadsheet, for each query, a separate sheet is created and named after the question the query answers, which is given in its preamble. If the CSV file already exists, the corresponding query is not executed. If cell A1 is already populated, the corresponding sheet is not updated.
First, ensure that the Application Default Credentials authorization strategy is configured, and that the HTTP Archive project is used as the quota project:
gcloud auth application-default login \
--scopes https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/spreadsheets
gcloud auth application-default set-quota-project httparchiveSecond, install the Python prerequisites for the script:
pip install -r requirements.txtThe script can be run for all or a subset of the queries as illustrated below:
python execute.py
python execute.py design/*.sql
python execute.py development/fonts_*.sqlBy default, it operates in a dry-run mode: it does not run the queries but prints an estimate of the amount of data that would be processed by each query. To actually run the queries, pass the --no-dry-run option as follows:
python execute.py --no-dry-run
python execute.py --no-dry-run design/*.sql
python execute.py --no-dry-run development/fonts_*.sql