From d896e22d29dec508ccea87d6b097607ed5310d50 Mon Sep 17 00:00:00 2001 From: Chris Green <75560394+chr156r33n@users.noreply.github.com> Date: Tue, 30 Dec 2025 16:32:02 +0000 Subject: [PATCH 01/10] Update seo.md SEO 2025 Chapter --- src/content/en/2025/seo.md | 1311 +++++++++++++++++++++++++++++++++++- 1 file changed, 1295 insertions(+), 16 deletions(-) diff --git a/src/content/en/2025/seo.md b/src/content/en/2025/seo.md index 38f17a689f7..6ee25bb0e34 100644 --- a/src/content/en/2025/seo.md +++ b/src/content/en/2025/seo.md @@ -1,20 +1,1299 @@ --- #See https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Authors'-Guide#metadata-to-add-at-the-top-of-your-chapters -title: SEO -description: SEO chapter of the 2025 Web Almanac covering crawlability, indexability, page experience, on-page SEO, links, AMP, internationalization, and more. -hero_alt: Hero image of various web pages beneath a search field with Web Almanac characters shine a light on the pages and make various checks. -authors: [] -reviewers: [] -analysts: [] -editors: [] -translators: [] +title: SEO +description: SEO chapter of the 2025 Web Almanac covering crawlability, indexability, page experience, on-page SEO, links, AMP, internationalization, and more. +hero_alt: Hero image of various web pages beneath a search field with Web Almanac characters shine a light on the pages and make various checks. +authors: [Amaka Chukwuma, Chris Green, Sophie Brannon] +reviewers: [Jamie Indigo] +analysts: [Augustin Delporte, Chris Green] +editors: [Michael Lewittes, Montserrat Cano, Sharon McClintic] +translators: [] results: https://docs.google.com/spreadsheets/d/1MoWoxogYWH6fv5r485EttvVgJuw7dMzzcot66X3MWu4/edit -featured_quote: ... -featured_stat_1: ... -featured_stat_label_1: ... -featured_stat_2: ... -featured_stat_label_2: ... -featured_stat_3: ... -featured_stat_label_3: ... -doi: ... + +augustin_delporte_bio: Technical SEO expert specializing in the data portion of things. Augustin has more than a decade of experience in the industry and has worked both agency and client side. + +amaka_chulwuma_bio: Amaka is an SEO and content strategist who has spent the last seven years shaping how brands show up online. She has worked with agencies in the UK, US, and Australia, including Whitecoat SEO and Switch Key Digital, where she builds content systems, technical SEO foundations, and search-led storytelling for clients in legal, health care, home services, and B2B. Her work reflects a mix of clarity, thoughtful strategy, and empathy. She brings those same qualities into her life outside work, especially when spending time with her daughter, who remains her favourite part of every day. + +chris_green_bio: Chris Green is a Technical Director at Torque Partnership and a search veteran of 15+ years. He advises Fortune 500 companies on search strategy and the evolving relationship between brands, algorithms, and AI systems. + +jamie_indigo_bio: Jamie Indigo isn't a robot, but speaks bot. As director of technical SEO at Cox Automotive, they study how search engines crawl, render, and index the web. Jamie loves to tame wild JavaScript and optimize rendering strategies. When not working, they like horror movies, graphic novels, and terrorizing lawful good paladins in Dungeons & Dragons. + +michael_lewittes_bio: Michael Lewittes is the founder of Ranktify, a software company that improves the quality and trustworthiness of content so that it can rise higher in search engine results as well as the LLMs. Michael previously founded and sold Gossip Cop to a PE-backed publisher, as well as wrote for and edited more than 75,000 articles for several major U.S. publications. This is the third time he's worked on the Web Almanac. + +montserrat_cano_bio: Montserrat Cano is an integrated digital manager, specialised in SEO and project management for product and ecommerce MC. International Digital Marketing. Montserrat brings in a strategic outlook and over 20+ years' experience to drive business results in both in-house and contract roles. A business and university trainer, she also mentors professionals to democratise digital and SEO, and strengthen the industry. This is her second time contributing to the Web Almanac. + +sharon_mcclintic_bio: Sharon McClintic is a B2B SaaS content and campaigns specialist currently based in England. As the Senior Content Marketing Lead at Lumar (formerly Deepcrawl), she oversees both editorial content and ad campaigns.. With a background that bridges both business strategy and creative writing, she's enthusiastic about bringing an editorial mindset to B2B communications. She holds an MBA in marketing, an MA in creative writing, and undergraduate degrees in journalism and literature, alongside 15+ years of international marketing experience across both the US and UK. You can connect with Sharon on LinkedIn here. + +sophie_brannon_bio: Sophie Brannon is the co-founder & director of StudioHawk US, based in Atlanta. With over a decade of SEO experience spanning agency, in-house, and consultancy roles, she has worked across a wide range of industries including eCommerce, finance, gaming, health, and SaaS. Sophie brings a strategic, data-driven approach to organic growth, helping brands scale sustainably through technical SEO, user experience testing, content and digital PR. She's passionate about making SEO accessible and mentoring the next generation of search marketers. + +featured_quote: As AI search reshapes how content is discovered, the web's fundamentals matter more than ever, and reassuringly, the data suggests those foundations are holding firm. + +featured_stat_1: 2.10% +featured_stat_label_1: 2.10% of mobile sites employ llms.txt files +featured_stat_2: 92% +featured_stat_label_2: HTTPS adoption reached \~91.7% (desktop) and \~91.5% (mobile). A stable but important step upward from \~89%. +featured_stat_3: 50% +featured_stat_label_3: Structured data adoption reached 50% of all pages +doi: "" --- + +## Introduction {#introduction} + +Search Engine Optimization (SEO) continues to play a central role in how information is discovered and understood online. It encompasses the technical, structural, and content practices that determine whether a website can be effectively crawled, indexed, and surfaced in search results. + +Strong SEO foundations not only support visibility in traditional search engines but are becoming increasingly important as AI systems begin to interpret and summarize web content in new ways. + +This 2025 SEO chapter of the Web Almanac draws its data and insights from the HTTP Archive crawl, Lighthouse reports, Chrome User Experience (CrUX) reports, and custom metrics. Our goal is to document how the technical state of the web is evolving and to identify the factors that most influence organic visibility today. + +While many SEO metrics have stabilized across the web, the context surrounding them is changing rapidly. The rise of AI crawlers, emergence of `llms.txt` and a growing emphasis on machine readability suggest that optimization is no longer only about being *found* by bots, but about being *understood* by them. How will these changing contexts in online search influence optimization moving forward and how do the latest data points already reflect AI's influence on SEO? + +## Crawlability & indexability {#crawlability-&-indexability} + +For web content to gain visibility in search results, it must first be crawled and indexed by search engine crawlers. Crawlability determines whether bots can find and access a page, while indexability defines whether that page is eligible to appear in search results. Together, these concepts form the foundational elements of search visibility. A page cannot rank or be served to users if it cannot first be found and understood by the bots. Similarly, content cannot be cited on AI platforms unless a site is indexable. + +[Google's documentation](https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt) clarifies that compliance with crawling rules depends not only on the presence of a `robots.txt` file but also on how correctly this file is structured. Search engines also apply practical limits and standardized caching behavior to ensure directives can be parsed efficiently and interpreted consistently. + +At the same time, as [Cloudflare explains](https://www.cloudflare.com/learning/bots/what-is-robots-txt/), `robots.txt` functions more as a code of conduct than a command that will always be obeyed. While reputable bots respect these signals, others may ignore them entirely. This mix of cooperation and unpredictability defines the modern crawling environment and sets the stage for examining how sites actually manage crawler access. + +### **Robots.txt** {#robots.txt} + +Serving as the web's de facto "visitors' center" for crawlers, the `robots.txt` file is where bots learn which parts of a site are open or restricted. Since the [IETF's standardization](https://www.ietf.org/about/) of the Robots Exclusion Protocol ([RFC 9309](https://datatracker.ietf.org/doc/html/rfc9309)) three years ago, its syntax, caching behavior, and error handling have been clearly defined, providing a stable framework for how crawlers interpret access rules. + +Efforts to refine that framework are ongoing. In late 2024, the [IETF introduced](https://garyillyes.github.io/ietf-rep-ext/draft-illyes-repext.html) a working draft known as REPext, which builds on RFC 9309 by exploring page-level crawl controls through response headers and HTML `meta` tags, an approach that could make future implementations more granular and flexible. + +For now, however, the `robots.txt` file remains the foundation of crawl management. Most websites now serve a valid file, with only a small minority omitting it entirely. Among those that do, site owners typically favor simple, universal directives rather than complex, bot-specific rules. The sections that follow examine how these preferences appear in the 2025 data. + +#### **Robots.txt status codes** {#robots.txt-status-codes} + +In 2025, 84.88% (desktop) and 84.94% (mobile) of requests for `robots.txt` files returned valid 200 status codes, up from 83.5% (desktop) and 83.9% (mobile) in 2024\. The steady rise suggests ongoing trickle-down from the 2022 standardization and wider CMS defaults that serve valid `robots.txt` files. Importantly, however, a 200 response only confirms that a file exists. It does not guarantee that its directives are correct or beneficial to the site. + +The mobile–desktop gap has effectively disappeared when it comes to valid (200) `robots.txt` status codes, with mobile now having just a 0.06% lead compared to desktop. This mirrors the industry's move away from separate m-dot sites toward responsive design with unified configurations. + +The rate of 404 errors for `robots.txt` files declined to 13.33% (desktop) and 13.21% (mobile) from 14.3% and 14.1% in 2024\. Fewer missing files imply that more sites are explicitly serving `robots.txt` files rather than leaving them absent, which would otherwise default to unrestricted crawling. + +Timeouts are \~1.0% (0.97% desktop; 1.05% mobile), 403 responses are \~0.5%, and 5xx are \~0.1%. Although uncommon, [5xx responses](https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#:~:text=Other%20errors-,A%20robots.,treated%20as%20a%20server%20error.) on a `robots.txt` file can cause search engines to temporarily treat the site as blocked until a cached file or a later successful fetch is available. + +{{ figure_markup( + image="", + caption="\`robots.txt\` status codes.", + description="Bar chart showing the distribution of HTTP status codes returned when accessing `robots.txt` files. A 200 status code (success) is returned for 84.88% of desktop sites and 84.94% of mobile sites. A 404 status code (not found) is returned for 13.33% of desktop sites and 13.21% of mobile sites. A 403 status code (forbidden) is returned for 0.52% of desktop sites and 0.53% of mobile sites. A 500 status code (server error) is returned for 0.10% of desktop sites and 0.09% of mobile sites.", + chart_url="[https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=1134590836\&format=interactive](https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=1134590836&format=interactive)", + sheets_gid="1895020036", + sql_file="robots-txt-status-codes \-2025.sql" + ) +}} + +#### **Robots.txt file size** {#robots.txt-file-size} + +Nearly all `robots.txt` files stay well under size limits ([Google enforces a 500 KB](https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#:~:text=Google%20enforces%20a%20robots.,the%20size%20of%20the%20robots.) parsing cutoff) and comply with standards such as not serving an empty file. + +A small share of sites serve completely empty `robots.txt` files, now 1.82% on desktop and 1.71% on mobile, slightly up from 2024\. While most [major crawlers treat an empty file as permissive](https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#:~:text=For%20the%20first%2012%20hours,checking%20for%20a%20new%20version\).), the standard ([RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html)) doesn't define this behavior explicitly, leaving room for inconsistent handling by lesser-known bots. A safer approach is to either return a valid file or a 404 if no restrictions are intended. + +In 2025, 97.59% of desktop and 97.51% of mobile files were under 100 KB, only slightly down from 97.80% (desktop) and 97.82% (mobile) in 2024\. The minor change year over year points to a stable and mature implementation pattern. + +Files between 100–200 KB accounted for 0.34% (desktop) and 0.33% (mobile) `robots.txt`, and 200–300 KB for just 0.11% (both desktop and mobile) of `robots.txt`, virtually unchanged from last year. Only 0.07% (desktop) and 0.11% of (mobile) sites exceeded the 500 KB parsing cutoff enforced by Google, confirming strong adherence to crawler limits and reinforcing that oversized `robots.txt` files are an edge case. + +Robots.txt file size rarely poses a barrier to crawlability. The more pressing issue continues to be empty or misconfigured files, which introduce uncertainty in how crawlers interpret site rules. + + +{{ figure_markup( + image="", + caption="\`robots.txt\` size.", + description="Bar chart showing the file size distribution of `robots.txt` files. Files with 0-100 bytes account for 97.51% of desktop sites and 97.59% of mobile sites. Files with 100-200 bytes account for 0.34% of desktop sites and 0.33% of mobile sites. Files with 200-300 bytes account for 0.11% of desktop sites and 0.11% of mobile sites. Files with 300-400 bytes account for 0.12% of desktop sites and 0.11% of mobile sites. Files with 400-500 bytes account for 0.03% of desktop sites and 0.03% of mobile sites. Files larger than 500 bytes account for 0.07% of desktop sites and 0.11% of mobile sites.", + chart_url="[https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=379008836\&format=interactive](https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=379008836&format=interactive)", + sheets_gid="1895020036", + sql_file="robots-txt-size-2025.sql" + ) +}} + +#### **Robots.txt user agent usage** {#robots.txt-user-agent-usage} + +The catch-all user agent `*` is the most common approach to crawler directives found in `robots.txt` files today. In 2025, it appeared in 77.04% of desktop and 77.14% of mobile files—up from 76.6% (desktop) and 76.9% (mobile) in 2024, and from the mid-70% range in 2022\. The steady rise indicates that most site owners prefer to implement broad, universal rules rather than maintaining complex bot-specific instructions. + +Where specific user agents *are* mentioned in `robots.txt` files, Google's advertising crawler (adsbot-google) and AhrefsBots take the lead again, as they did last year. + +Targeting of `adsbot-google` in `robots.txt` directives rose to 9.82% (desktop) and 9.51% (mobile) in 2025, compared with 9.1% (desktop) and 8.9% (mobile) in 2024\. + +AhrefsBot also remained a leading named crawler, specified in 9.29% of desktop and 9.50% of mobile `robots.txt` files. Combined with AhrefsSiteAudit (4.57% desktop / 4.27% mobile), this reflects the ongoing importance of controlling access by SEO tools, which can generate significant crawl activity. + +Other named crawlers that appeared in notable volumes this year include: + +* MJ12Bot (Majestic): 7.31% desktop / 7.28% mobile + +* Googlebot: 6.22% desktop / 6.66% mobile + +* Nutch: 5.03% desktop / 4.81% mobile + +##### **Bingbot rarely named in robots.txt** {#bingbot-rarely-named-in-robots.txt} + +Bingbot ranks 22nd among named user agents in `robots.txt` files and appears in less than 3% of `robots.txt` (2.67% desktop and 2.57% mobile). When a bot appears in `robots.txt` files, it means website managers care enough about that crawler to explicitly control its behavior, either allowing it, restricting it, or setting crawl rates. Low appearance rates suggest benign neglect. Despite Microsoft's massive [investment in AI](https://blogs.microsoft.com/on-the-issues/2025/01/03/the-golden-opportunity-for-american-ai/) and its integration of [ChatGPT into Bing](https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/), the crawler itself hasn't become more prominent in `robots.txt` files. Even with its AI enhancements, Bing's web footprint and importance to site operators remain largely unchanged, a quiet contrast to the rapid rise of AI-focused crawlers like GPTbot named in `robots.txt` files (more on that below). + +##### **Slight growth in use of named user agents** {#slight-growth-in-use-of-named-user-agents} + +Overall, compared to 2024, a slightly larger share of `robots.txt` files in 2025 include directives for named crawlers rather than relying solely on the universal wildcard \*. For instance, MJ12Bot rose from 6.6% (mobile) last year to 7.3% (mobile) in 2025, Googlebot rose from 6.4% (mobile) to 6.7% (mobile), and Nutch from 4.3% (mobile) to 4.81% (mobile) this year. These modest gains point to gradual refinement — more site owners are setting tailored crawl rules where it matters, without moving away from the simplicity of universal controls. + +The continued dominance of `*` alongside rising mentions of specific bots in `robots.txt` suggests a pragmatic balance. Universal directives remain the norm, but targeted rules are added where business concerns justify them. Not all crawlers interpret `*` consistently. Google's AdsBot ignores it, and Applebot falls back to Googlebot rules before applying `*` making explicit targeting necessary in certain cases. + + +{{ figure_markup( + image="", + caption="\`robots.txt\` user agents.", + description="Bar chart showing the most common user agents specified in `robots.txt` files. The wildcard user agent (\\\*) appears in 77.04% of desktop sites and 77.14% of mobile sites. adsbot-google appears in 9.82% of desktop sites and 9.51% of mobile sites. ahrefsbot appears in 9.29% of desktop sites and 9.50% of mobile sites. mj12bot appears in 7.31% of desktop sites and 7.28% of mobile sites. googlebot appears in 6.22% of desktop sites and 6.66% of mobile sites. nutch appears in 5.03% of desktop sites and 4.81% of mobile sites. dotbot appears in 4.58% of desktop sites and 5.02% of mobile sites. adsbot-google-mobile appears in 4.76% of desktop sites and 4.71% of mobile sites. pinterest appears in 4.59% of desktop sites and 4.35% of mobile sites. ahrefssiteaudit appears in 4.57% of desktop sites and 4.27% of mobile sites. Additional user agents are shown in the chart.", + chart_url="[https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=315238915\&format=interactive](https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=315238915&format=interactive)", + sheets_gid="1895020036", + sql_file="robots-txt-user-agent-usage-2025.sql" + ) +}} +*Description:* + +{{ figure_markup( + image="", + caption="\`robots.txt\` SEO tool related user agents.", + description="Bar chart showing the prevalence of SEO crawler bots mentioned in `robots.txt` files. AhrefsBot appears in 9.29% of desktop sites and 9.50% of mobile sites. AhrefsSiteAudit appears in 4.57% of desktop sites and 4.27% of mobile sites. MJ12Bot appears in 7.31% of desktop sites and 7.28% of mobile sites. SEMrushBot appears in 3.01% of desktop sites and 3.05% of mobile sites.", + chart_url="[https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=646503191\&format=interactive](https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=646503191&format=interactive)", + sheets_gid="1895020036", + sql_file="robots-txt-user-agent-usage-2025.sql" + ) +}} + +##### **AI crawlers named in robots.txt** {#ai-crawlers-named-in-robots.txt} + +The rise of AI crawlers has transformed `robots.txt` from a traditional search optimization tool into a broader mechanism for managing content permissions. In 2025, directives targeting AI-related bots showed substantial growth from 2024 levels. + +The year-over-year comparison reveals accelerating adoption: + +* **GPTBot**: 4.49% (desktop), 4.19% (mobile) is up from 2.9% (desktop) and 2.7% (mobile) in 2024, representing a \~55% increase +* **CCBot**: 3.50% (desktop), 3.23% (mobile) is up from 2.7% (desktop) and 2.4% (mobile) in 2024 +* **PetalBot**: 3.96% (desktop), 4.38% (mobile) this year; was not separately tracked in 2024 +* **ClaudeBot**: 3.64% (desktop), 3.43% (mobile) is up from 1.9% (desktop) and 1.6% (mobile) in 2024, nearly doubling + +Notably, 2024's data included broader categories like "anthropic-ai" (2.0% desktop and 1.7% mobile last year) and "chatgpt-user" (2.0% desktop and 1.7% mobile last year), while our 2025 data shows more specific bot targeting. + +Other entrants in the AI crawler space this year include **Amazonbot** (3.34% desktop, 3.0% mobile), **Google-Extended** (3.37% desktop, 2.96% mobile), **PerplexityBot** (2.81% desktop, 2.65% mobile), **ChatGPT-User** (2.84% desktop, 2.50% mobile), **FacebookBot** (2.86% desktop, 2.46% mobile), and **Meta-ExternalAgent** (2.81% desktop, 2.48% mobile). + +**While overall user agent targeting has grown only gradually year over year, the adoption of AI crawlers has been far more abrupt.** GPTBot's near-doubling since 2024, alongside measurable adoption for PetalBot, ClaudeBot, and others, represents one of the fastest expansions of robots.txt directives for named user agents in recent memory, moving from a marginal presence in 2023 to multi-percent adoption rates by 2025\. + +This shift introduces new complexity for site owners. Instead of only asking, "Should this page be indexed for search?" they must now also ask, "Should this content be used to train AI models?" These are distinct considerations with different business and ethical implications. + +Robots.txt has become a dual-purpose control point, balancing visibility in search with protection against large-scale data harvesting. This trend is likely to intensify as AI models and crawlers proliferate. + +*Description:* + +{{ figure_markup( + image="", + caption="\`robots.txt\` AI-related user agents.", + description="Bar chart showing the growth in named user agent usage in `robots.txt` files. GPTBot appears in 4.49% of desktop sites and 4.19% of mobile sites. PetalBot appears in 3.96% of desktop sites and 4.38% of mobile sites. ClaudeBot appears in 3.64% of desktop sites and 3.43% of mobile sites. CCBot appears in 3.50% of desktop sites and 3.23% of mobile sites. AmazonBot appears in 3.34% of desktop sites and 3.00% of mobile sites. Google-Extended appears in 3.37% of desktop sites and 2.96% of mobile sites. PerplexityBot appears in 2.81% of desktop sites and 2.65% of mobile sites. ChatGPT-User appears in 2.84% of desktop sites and 2.50% of mobile sites. FacebookBot appears in 2.86% of desktop sites and 2.46% of mobile sites. Meta-ExternalAgent appears in 2.81% of desktop sites and 2.48% of mobile sites.", + chart_url="[https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=725000646\&format=interactive](https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=725000646&format=interactive)", + sheets_gid="1895020036", + sql_file="robots-txt-user-agent-usage-2025.sql" + ) +}} + +## llms.txt {#llms.txt} + +The `llms.txt` file has been proposed as a new standard "to provide information to help LLMs use a website at inference time. (Per [llmstxt.org](http://llmstxt.org).)This text file contains a highly simplified version of the website's content in Markdown format, with a view toward making it easier for LLMs to ingest and subsequently use in generated responses. + +This standard, has, it must be noted, not been adopted widely and has become a point of controversy within the broader SEO industry. Google has often stated that they do not use `llms.txt`, and [no Google service currently does](https://www.seroundtable.com/google-ai-llms-txt-39607.html). Anthropic, however, has [taken a lead on `llms.txt`,](https://docs.claude.com/llms-full.txt), raising optimism that the format may evolve into a reliable mechanism for managing and optimizing content utilization during model inference. + +### llms.txt, adoption rate {#llms.txt,-adoption-rate} + +Regardless of whether using `llms.txt` proves to be a valid approach for SEO or AI optimization, the introduction of a new file format and proposed standard like this could influence how websites are built and optimized moving forward. + +As part of the 2025 Almanac, we have introduced monitors to assess the level of `llms.txt` adoption across the web. + + +{{ figure_markup( + image="", + caption="Valid llms.txt", + description="Bar chart showing the adoption of `llms.txt` \- which appear in 2.13% of desktop sites and 2.10% of mobile sites. This leaves 97.87% of desktop sites without an `llms.txt` and 97.90% of mobile without `llms.txt`", + chart_url="[https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=637355278\&format=interactive](https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=637355278&format=interactive)", + sheets_gid="1895020036", + sql_file="llms-status-2025.sql" + ) +}} + +We can see that the adoption rate is relatively low at 2.10% (desktop) and 2.13% (mobile) — but these figures are arguably still notable given how new the `llms.txt` format is and that a total of 583k valid files were identified across all sites analyzed (desktop and mobile combined). + +Digging into the content of these `llms.txt` files, we see some clues about the adoption of this standard so far, and likely what is going to make this move in the future. + +* 39.6% of `llms.txt` files are related to All in One SEO +* 3.6% of `llms.txt` files are related to Yoast SEO + +This was gleaned by the comments these CMS extensions leave (by default) in the files they generated, but it shows that a significant number of website owners with an `llms.txt` file in place are having them generated by their CMS/add-on extensions. Therefore, we cannot be sure this is always a conscious act or endorsement of the `llms.txt` standard or an unintentional inclusion. + +Over the last 9+ months (since January 2025), [interest in this emerging standard has grown](https://trends.google.com/trends/explore?q=llms.txt&hl=en-GB) and hinges on just one or two key AI companies' recognition. The 2026 numbers will be interesting to see, nonetheless. + +## Robots directives {#robots-directives} + +A robots directive provides page-level control over how a specific page is indexed and displayed in search results. While they have a similar function to robots.txt files, the two serve different purposes: + +* Robots *directives* influence **indexing and serving** +* Whereas robots.txt governs **crawling** + +For a directive to be applied, the crawler must be able to access the page. If a page is blocked by robots.txt, its directives may never be seen or obeyed. + +#### Robots directives implementation {#robots-directives-implementation} + +There are two main ways to implement robots directives: + +1. Using a [meta robots](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/meta/name/robots) tag (placed within the \
section of a webpage) +2. Using an [x-robots](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/X-Robots-Tag) HTTP header + +The method you choose depends on your specific use case, as well as the means and methods at your disposal. + +Meta robots tags are primarily for HTML pages and are widely supported by most major CMSs either natively or through add-ons, and other common, well-supported frameworks. + +X-Robots Tags have a major advantage in that they can be implemented for other non-HTMLs file types, such as PDFs or documents. However, setting them is often not as easy or simple for CMS users, so they may not always be an option. + +{{ figure_markup( + image="", + caption="Robots directive implementation", + description="Bar chart showing robots directive implementation methods across websites. Meta Robots directives are used on 47.0% of desktop sites and 47.9% of mobile sites. X-Robots-Tag directives are used on 0.6% of desktop sites and 0.7% of mobile sites. Both Meta Robots and X-Robots-Tag directives are used together on 0.4% of desktop sites and 0.4% of mobile sites.", + chart_url="[https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=736143902\&format=interactive](https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=736143902&format=interactive)", + sheets_gid="1895020036", + sql_file="seo-stats-2025.sql" + ) +}} + +By a large margin, Meta robots is the most widely used method for implementing robots directives, appearing on 47% of desktop pages and 47.9% of mobile pages. X-Robots come in a distant second at 0.6% (desktop) and 0.7% (mobile). + +The number of pages implementing meta robots increased from 45.5% (desktop) and 46.2% (mobile)recorded last year to 47% (desktop) and 47.9% (mobile) in 2025, showing growth year-over-year (YoY). In contrast, X-Robots-Tag implementation has stayed roughly the same YoY. Inner pages are 50% likely to use a robots directive, compared to home pages which are 46%. + +Some pages (0.4% on both desktop and mobile) have both Meta robots and X-Robot-Tag implemented at the same time. This figure of 0.4% is stable from last year but this is not a widely recommended practice, as it increases the likelihood of generating conflicting signals between the two methods of implementation. + +#### Robots directives rules {#robots-directives-rules} + +The method of implementation is only part of the picture when it comes to robots directives; [the directive rules](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#directives) determine how the page or document should be handled. + +Rules can be added to either Meta robots or X-Robots-Tag as comma-separated values. + +For our study of directive rules, we relied on the rendered HTML. + + +{{ figure_markup( + image="", + caption="Robots directive rules", + description="Bar chart showing the implementation of robots directives across pages. The follow directive appears in 64.0% of desktop sites and 60.5% of mobile sites. The index directive appears in 69.0% of desktop sites and 59.3% of mobile sites. The nofollow directive appears in 2.4% of desktop sites and 2.8% of mobile sites. The noindex directive appears in 3.5% of desktop sites and 2.4% of mobile sites. The max-image-preview directive appears in 5.0% of desktop sites and 2.8% of mobile sites. The max-snippet directive appears in 2.2% of desktop sites and 1.1% of mobile sites. The max-video-preview directive appears in 1.6% of desktop sites and 0.8% of mobile sites. The noarchive directive appears in 2.5% of desktop sites and 1.8% of mobile sites. The nosnippet directive appears in 0.1% of desktop sites and 0.1% of mobile sites. The notranslate directive appears in 0.0% of desktop sites and 0.0% of mobile sites. Additional directives are shown in the chart. +", + chart_url="[https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=779209791\&format=interactive](https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=779209791&format=interactive)", + sheets_gid="1895020036", + sql_file="robots-meta-usage-2025.sql" + ) +}} + +*Follow* and *index*, the two most-used rules, are considered the default values (and are ignored by Google) in the absence of other rules like *noindex* and *nofollow*, which have the opposite purpose. + +Their inclusion means that robots should index the page and follow the links from it. Their mobile usage at 60.5% and 59.3% for follow and index, respectively, implies these two tags are likely to be found together and are generally complementary. + +A possible cause for this high number of a technically unnecessary combination of Meta robots rules is Yoast SEO, [which applies "index,follow" by default](https://developer.yoast.com/features/seo-tags/meta-robots/functional-specification/#:~:text=Unless%20otherwise%20defined%20by%20the%20user%20\(or%20via%20page/template/filtering%20logic\)%2C%20%7B%7Bvalues%7D%7D%20outputs%20index%2C%20follow.). Yoast has an approximate 16% adoption (desktop and mobile) when looking at home page use of SEO tools/plugins and, of [all identified SEO tools](https://www.wappalyzer.com/technologies/seo/), it is used nearly 70% of the time. + +Nofollow and noindex, the next two most-used Meta robots rules, are used at a considerably lower frequency, showing up on 2.8% of desktop pages and 2.4% of mobile pages. + + +*Description:* + +{{ figure_markup( + image="", + caption="SEO Tools", + description="Bar chart showing the most common robots directive rules implemented across websites. Yoast SEO appears in 15.96% of desktop sites and 15.49% of mobile sites. RankMath SEO appears in 3.56% of desktop sites and 3.60% of mobile sites. All in One SEO appears in 2.96% of desktop sites and 2.92% of mobile sites. Yoast SEO Premium appears in 1.42% of desktop sites and 1.30% of mobile sites. Ahrefs appears in 0.28% of desktop sites and 0.25% of mobile sites. The SEO Framework appears in 0.18% of desktop sites and 0.15% of mobile sites. SEOmatic appears in 0.07% of desktop sites and 0.05% of mobile sites. Avada SEO appears in 0.06% of desktop sites and 0.06% of mobile sites. Yoast SEO for Shopify appears in 0.03% of desktop sites and 0.02% of mobile sites. BrightEdge appears in 0.01% of desktop sites and 0.01% of mobile sites. Additional SEO tools and plugins are shown in the chart.", + chart_url="[https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=1233372001\&format=interactive](https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=1233372001&format=interactive)", + sheets_gid="1895020036", + sql_file="wordpress_seo_plugin-2025.sql" + ) +}} + +Another element of Meta robots directives is the "name" attribute, which enables us to specify whether these directives address a specific user-agent of a robot/crawler. We can, for example, tailor behavior if some crawlers require specific rules and others do not. + +The most-named crawlers within this attribute are consistent with 2024: Bingbot, MSNbot, Googlebot, and Google-news appear most frequently, alongside the default "robots." + + +{{ figure_markup( + image="", + caption="Robots directive rules by name", + description="A bar chart comparing robots directive rules by crawler named in robots directives for mobile pages. The named bots are Bingbot, MSNBot, Googlebot, Googlebot-News. The values were applied as follows: follow: 94.9%, 78.8%, 61.0%, 75.6%, 54.0%. index: 92.6%, 66.8%, 60.7%, 77.3%, 58.3%. nofollow: 0.8%, 3.9%, 1.9%, 2.4%, 3.9%. noindex: 1.4%, 5.0%, 3.5%, 3.5%, 9.1%. max_image_preview: 84.4%, 0.5%, 71.1%, 25.2%, 3.8%. max_snippet: 84.4%, 0.5%, 42.1%, 24.9%, 1.6%. max_video_preview: 84.2%, 0.5%, 42.1%, 24.6%, 1.2%. noarchive: 2.0%, 3.5%, 0.9%, 4.5%, 1.1%. nosnippet: 0.1%, 0.1%, 0.1%, 0.8%, 5.5%. notranslate: 0.0%, 0.0%, 0.0%, 1.2%, 0.1%. noimageindex: 0.1%, 0.0%, 0.1%, 0.7%, 0.1%", + chart_url="[https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=1906268341\&format=interactive](https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=1906268341&format=interactive)", + sheets_gid="1895020036", + sql_file="robots-meta-usage-2025.sql" + ) +}} + +MSNbot is a legacy crawler that has been replaced by Bingbot. Its continued presence in robots directives suggests a delay in updating or removing outdated crawler names. Additional evidence of this lag can be seen in the fact that newer robots rules—such as "max_image_preview", "max_snippet", and "max_video_preview"—are commonly applied to Googlebot and Bingbot, but not to MSNbot. + +### Indexifembedded tag {#indexifembedded-tag} + +A now well-established Meta robots rule, **indexifembedded** is a highly specific tag that enables us to specify when we want iframe content to be treated as being part of the page which it is embedded on. This needs to be paired with "noindex" in order to work, and is a rule which only Google currently supports. + +{{ figure_markup( + image="", + caption="\`indexifembedded\` usage in \`iframe\` content", + description="Bar chart showing the prevalence of invalid HTML elements found within the head section of web pages. \`iframe\` elements with \`indexifembedded\` usage appear in 88.89% of desktop sites and 87.67% of mobile sites.", + chart_url="[https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=489480572\&format=interactive](https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=489480572&format=interactive)", + sheets_gid="1895020036", + sql_file="robots-meta-usage-2025.sql" + ) +}} + +Within iFrame content, the indexifembedded rule is found almost 90% of the time (88.89% on desktop, 87.67% on mobile). What is interesting is that after seeing raises in use from 2022 to 2024, 2025 saw a decline from 99.9% use. + +### Invalid \ elements {#invalid--elements} + +Search engine crawlers follow HTML standards when parsing content. One issue they may encounter is when invalid HTML elements are found within the \ of the page. This can cause the \ to be treated as implicitly ending early, and all remaining \ elements are then included within the \ of the page. + +Negative impacts to SEO are greatest here when important meta data, such as `title`, `canonical` tags, hreflang, Meta robots directives, etc., are located below the invalid `head` element (as their inclusion within the `body` renders them ineffective). + + +{{ figure_markup( + image="", + caption="Pages with invalid HTML in \`\\`", + description="Chart showing pages with invalid HTML in the head section. Invalid HTML elements are found in 10.10% of desktop sites and 10.33% of mobile sites.", + chart_url="[https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=1130119859\&format=interactive](https://docs.google.com/spreadsheets/d/e/2PACX-1vQUdZ1uaX5U0oLrHlWn8iYc1dhPthw59zy20QFdsYCgky7zaesRm8ctLSxQ9zjlapXCjo6Xd29w_xmB/pubchart?oid=1130119859&format=interactive)", + sheets_gid="1895020036", + sql_file="invalid-head-sites-2025.sql" + ) +}} + +Invalid head elements found this year are continuing the downward trend we saw in 2024's data; that is, we are once again seeing fewer invalid elements in pages' \ sections year over year. + +In 2025, we are seeing 10.10% invalid \ elements on desktop pages and 10.33% invalid elements on mobile pages. Compared to 2024's data (10.6% on desktop and 10.9% on mobile), this represents drops of 4.7% (desktop) and 5.2% (mobile). + +The definition of "invalid" \ elements includes anything included in the page's \ that is not based on W3C standards. There are eight valid elements that may be used in the \, according to [Google Search documentation](https://developers.google.com/search/docs/crawling-indexing/valid-page-metadata#use-valid-elements-in-the-head-element). These are: + +* \