Having spent 25 years analysing Google’s results pages, the leak of the PerDocData
model is nothing short of a Rosetta Stone.
At Hobo, I’ve always worked from the evidence we had, and to be fair, my analysis concludes that Google spokespeople has been open about this stu,ff in past briefings, albeit scattered across the web, but this leak provides the blueprint SEOs have been looking for.
This article analyses the PerDocData
structure, which can be understood as the comprehensive ‘digital dossier’ Google keeps on every single URL it indexes. It’s the central repository, the master file, that consolidates the vast array of signals we’ve spent our careers trying to influence.
This is no longer theory; this is documented data architecture.
The most critical finding from my analysis confirms a long-standing debate within the SEO community. Google’s ranking process is not a single, monolithic algorithm. Instead, it’s a pipeline.
A URL first achieves a relevance-based ranking, but crucially, it is then subjected to a series of re-ranking systems – internally called “Twiddlers” – that prioritise user-centric and quality-focused signals.
This is the mechanism behind why a keyword-optimised but low-quality page can get an initial foothold but will ultimately fail to maintain visibility.
For me and my team, this architecture fundamentally validates the strategic shift we’ve been advocating for years. The game is no longer just about establishing relevance. To achieve and maintain top rankings, you must prove your value to the subsequent “Twiddlers.”
This means our focus on user experience, content quality, and demonstrable authority isn’t just best practice—it’s a direct response to how Google’s core ranking pipeline is built.
Key Findings Synopsis
- Multi-Layered Ranking Confirmed: The search process involves an initial ranking pass by a system named Mustang, followed by a series of powerful re-ranking functions (“Twiddlers”) that adjust results based on factors like user engagement, freshness, and quality. Success requires optimising for both stages.
- Tiered Indexing System Named: The documentation confirms a tiered indexing system with specific names: “Base, Zeppelins, and Landfills.” A document’s
scaledSelectionTierRank
determines its position within these tiers, directly impacting its ranking potential. - Site-Level Authority is a Core Metric: Google calculates and stores potent site-level authority signals, referred to internally as
siteAuthority
andNSR
(Normalized Site Rank). These metrics contextualise the value of any individual page, confirming that the reputation of the entire domain is a critical, non-negotiable component of ranking potential. - User Clickstream Data is a Direct Ranking Input: The system stores granular, document-level click data, including
GoodClicks
,BadClicks
, andLastLongestClicks
. This provides definitive evidence that user engagement and post-click behaviour are direct inputs into ranking systems, validating the long-held hypothesis that user experience directly influences search performance. - Freshness Signals are Sophisticated and Nuanced: Google’s assessment of content freshness goes far beyond simple publication dates. It employs semantic analysis to understand the temporal context of content and uses a
lastSignificantUpdate
signal to differentiate between minor edits and substantial revisions, rewarding genuine content improvement. - Semantic Understanding Has Replaced Keyword Matching: The architecture is built around entity recognition (
EntityAnnotations
) and machine learning-derived vector embeddings (site2vecEmbeddingEncoded
). This marks a definitive shift from a keyword-centric model to one that understands topics, concepts, and the relationships between them, making topical authority a mathematically calculated attribute. - Internal Link Equity is Calculated via “Simulated Traffic”: A metric named
onsiteProminence
measures a page’s importance within its own site by simulating user traffic flow from the homepage and other high-traffic pages, confirming the critical role of internal linking strategy.
The Content Warehouse: Google’s Digital Brain
Architectural Context
To comprehend the significance of PerDocData
, one must first understand its environment: the Google Content Warehouse (leaked in 2024)..
This is not a simple database but a vast, sophisticated Application Programming Interface (API) and toolset designed for storing, managing, and analysing the web at an immense scale. It serves as the central repository where Google processes and organises all information it gathers about web content, acting as the foundational data layer for its search algorithms.
The Search Processing Pipeline
A document’s journey from discovery to being served in search results is a multi-stage pipeline. PerDocData
is the data object that is populated and referenced throughout this process.
Crawling & URL Discovery
The process begins with Google discovering URLs through various means, including following links from known pages, processing submitted sitemaps, and other proprietary methods. This is the initial entry point into the system.
Indexing & Storage
Once a URL is discovered and crawled, its content is fetched, rendered, and analysed. The processed document and its associated metadata are then stored in a suite of indexing systems. The documentation points to TeraGoogle
as a primary system for long-term storage, with other systems like Alexandria
also playing a role.
A critical component of this stage is a system named SegIndexer
, which is responsible for placing documents into different tiers within the index. The scaledSelectionTierRank
attribute provides a direct window into this system, confirming the long-held theory that Google maintains a tiered index with specific internal names: “Base, Zeppelins, and Landfills.”
A document’s rank within these serving tiers is a language-normalised score, indicating its fractional position within the index quality hierarchy.
This architecture dictates that links from documents residing in higher-quality tiers (like Base) carry significantly more weight than those from lower tiers (like Landfills).
This creates a distinct “link equity economy” where the value of a backlink is determined not just by the authority of the linking page itself, but also by the indexed “neighbourhood” it inhabits.
Initial Ranking (Mustang)
After indexing, the initial scoring and ranking of documents are handled by a primary system called Mustang
. This system conducts the first-pass evaluation, creating a provisional set of results based on a multitude of signals stored within the PerDocData
object for each document. This stage likely focuses on core relevance and foundational authority signals.
Re-ranking (Twiddlers)
The process does not end with Mustang. The provisional results are passed to a powerful subsequent layer of the system known as “Twiddlers.” These are re-ranking functions that adjust the order of search results after Mustang’s initial ranking is complete. Twiddlers act as a fine-tuning mechanism, applying boosts or demotions based on specific, often dynamic, criteria. Examples referenced in the documentation include a FreshnessTwiddler
, which boosts newer content, and a QualityBoost
function. Another specific example is the SiteBoostTwiddler
, which likely uses site-level signals to adjust rankings.
This multi-stage architecture reveals that search engine optimisation is not about solving for a single algorithm. It is a multi-stage optimisation problem. A document must first possess strong foundational relevance signals to pass the initial Mustang ranking.
Subsequently, it must exhibit the specific qualities – such as demonstrable user engagement, freshness for time-sensitive queries, or exceptional page experience—to be promoted by the various Twiddlers.
A page could rank well in the initial pass but be demoted by a Twiddler if it generates poor user click signals, or fail to be boosted if it is not considered fresh for a query that deserves it. A successful SEO strategy must therefore cater to both stages of this process.
Understanding the Vessel: PerDocData as a Protocol Buffer
Technical Primer on Protocol Buffers (Protobuf)
The PerDocData
object is structured as a Protocol Buffer, or Protobuf
. This is a language-neutral, platform-neutral, and extensible mechanism for serialising structured data, developed and used extensively across Google’s infrastructure. Its selection is not arbitrary; it is critical for operating at Google’s scale. Key characteristics that make it suitable include:
- Efficiency:
Protobuf
is significantly smaller, faster, and simpler to process than alternative formats like XML or JSON. This allows for compact data storage and extremely fast parsing, which is essential when dealing with trillions of documents. - Structure: Data schemas, known as “messages,” are strictly defined in
.proto
files. This enforces strong data typing and a consistent structure, ensuring that different systems interacting with the data do so reliably. - Extensibility: The
Protobuf
format is designed for seamless evolution. New fields and data points can be added to the message definition without breaking older systems or invalidating existing data, allowing Google to continuously add new signals to its models without re-architecting the entire system.
The Role of PerDocData
Within the Content Warehouse, the PerDocData
model is arguably the most interesting and critical Protobuf
message for SEO analysis. It is the primary container for the vast majority of document-level signals used for indexing and serving search results. It is a key component of a larger CompositeDoc
message, which aggregates all known information about a single URL. PerDocData
is where on-page factors, quality scores, spam signals, freshness metrics, and user engagement data are stored and made accessible to the ranking pipeline.
High-Impact Modules and Attributes in PerDocData
Authority & Trust
This category includes signals that measure the overall trust, authority, and reputation of a page or an entire domain. These are foundational to Google’s assessment of a source’s reliability.
Content Quality
These attributes focus on the quality, originality, and value of the content on the page itself, separate from site-level metrics.
Spam Detection
This group of signals is dedicated to identifying and filtering manipulative or low-value content designed to cheat the ranking systems.
User Engagement & Behaviour
These signals are derived from user interactions with the search results, providing direct feedback on the relevance and satisfaction of a given page.
Freshness & Timeliness
These attributes help Google determine how important recent information is for a given query and how up-to-date a specific document is.
Semantic & Topical Relevance
This category covers how Google understands the meaning, topics, and intent behind the content on a page, moving beyond simple keywords.
Technical & Page Experience
These signals relate to a page’s technical health, accessibility, and the user’s experience interacting with it, including speed and mobile-friendliness.
Geographic & Language Signals
This group includes attributes essential for local search and serving results to a global audience in the correct language.
Specialised Content & Niche Signals
These are classifiers and data stores for specific types of content that require unique ranking considerations, such as books, videos, or scientific papers.
Quantifying Authority and Trust
Site-Level Authority Signals
The PerDocData
model provides clear evidence that Google’s evaluation of a document is heavily contextualised by the authority of the domain on which it resides. This transcends individual page metrics and points to a holistic, site-wide assessment.
The existence of attributes like siteAuthority
and references to NSR
(Normalized Site Rank) confirms that Google calculates a proprietary, site-level quality score. NSR
is described as a sophisticated system for evaluating a website’s overall reliability, integrating a multitude of factors to assign a score that directly influences search rankings.
This definitively proves that while Google representatives correctly state they do not use third-party metrics like Moz’s Domain Authority, they have their own internal, and far more complex, equivalent. The long-debated concept of “domain authority” is therefore not a myth; it is a core, calculated metric within the Content Warehouse. This means that strategic activities aimed at building sitewide trust, brand recognition, and a clean backlink profile have a direct, measurable impact on a data point used in ranking. Further evidence of this holistic evaluation comes from attributes like fireflySiteSignal
, an internal project name for another set of site-level signals that contribute to ranking changes.
The PageRankPerDocData
module confirms that PageRank, while no longer a public-facing metric, remains a core ranking system. The documentation also references homepagePagerankNs
, indicating that the PageRank of a site’s homepage is stored as a distinct and important signal. Furthermore, the historical toolbarPagerank
attribute confirms that the public-facing 0-10 score was a stored value, cementing its past importance in the ecosystem. The role of PageRank has evolved significantly from its original conception. It is no longer a simple measure of link volume but has been integrated with anti-spam systems like Penguin to better combat link manipulation. It now serves as a foundational link equity signal that is factored into the broader calculation of a site’s overall authority.
The domainAge
and hostAge
attributes provide concrete evidence that Google tracks the inception date of hosts and domains, using this data specifically to “sandbox fresh spam.” This confirms that while age itself may not be a direct ranking boost, it is used as a trust signal in spam evaluation.
Finally, the queriesForWhichOfficial
attribute is a powerful signal, storing the specific query, country, and language combinations for which a document is considered the definitive “official page.” This is a direct mechanism for ensuring that brand homepages or official entity sites rank for their primary navigational queries.
Translating Quality Guidelines into Data: E-E-A-T and YMYL
Google’s public-facing Search Quality Rater Guidelines provide a conceptual framework for content quality through concepts like E-E-A-T and YMYL. The PerDocData
structure reveals how these abstract concepts are likely translated into concrete data points.
E-E-A-T (Experience, Expertise, Authoritativeness, and Trustworthiness) is not a single, direct score but a classification derived from an aggregation of many underlying signals. The various metrics within PerDocData
serve as the inputs to a model that determines a page’s E-E-A-T level. For example:
- Authoritativeness and Trustworthiness are likely informed by quantitative signals like
siteAuthority
,NSR
, and the quality and quantity of backlinks as measured byPageRankPerDocData
. - Expertise is likely derived from semantic analysis, including the author’s entity recognition (connecting content to a known expert in the Knowledge Graph via
authorObfuscatedGaiaStr
) and the site’s topical focus, as measured bysite2vecEmbeddingEncoded
. - Experience, the newest addition, is likely assessed through analysis of first-person language, original imagery, and user-generated content signals.
For topics classified as YMYL (Your Money or Your Life)—such as health, finance, and safety—Google’s systems hold content to a significantly higher standard. The documentation provides concrete evidence of this with specific attributes like ymylHealthScore
and ymylNewsScore
. These fields store the outputs of dedicated classifiers for YMYL content in the health and news verticals. For documents identified as YMYL, the weighting of signals like siteAuthority
, author credibility (via entity analysis), and factual accuracy is almost certainly increased dramatically.
Semantic Understanding: Beyond the Keyword
The Shift to Entities
The PerDocData
model illustrates a fundamental evolution in Google’s content analysis: a definitive shift from matching keyword strings to understanding real-world entities and concepts.
The EntityAnnotations
module is central to this process. It attaches specific Knowledge Graph entities that have been extracted from the page’s content. This transforms a simple document from a collection of words into an interconnected node in a web of knowledge. It allows Google to understand the things a page is about (e.g., the person “Harrison Ford,” the film series “Star Wars”) rather than just the text strings it contains. This process is facilitated by an internal system likely referred to as Webref
, which provides the unique machine-readable IDs for entities, enabling the system to disambiguate between concepts with the same name (e.g., Apple the company versus apple the fruit).
Furthering this semantic understanding is the site2vecEmbeddingEncoded
attribute. This represents a compressed vector embedding—a numerical representation—of an entire site’s content. In this machine learning model, the site’s collective themes and topics are mapped into a multi-dimensional space. This allows Google to mathematically measure the topical similarity between documents and even entire websites. It provides a quantifiable way to determine a site’s core focus and assess whether a new piece of content is topically consistent with the rest of the domain.
This technical implementation confirms that “topical authority” is not a vague marketing term but an algorithmically calculated concept. A website that maintains a tight focus on a specific set of related topics will generate a more coherent and powerful vector representation in this embedding space. Conversely, if a website focused on finance were to publish an article about gardening, the vector for that new article would be mathematically distant from the site’s established vector. This “topical deviation” can be measured and is likely used as a negative or dilutive signal, providing a technical basis for the long-standing strategic advice to maintain a clear topical focus and prune content that deviates from a website’s core subject area.
The granularity of analysis extends to the most basic on-page elements. Attributes like originalTitleHardTokenCount
and titleHardTokenCountWithoutStopwords
show that Google is not just reading titles, but analysing their structure and composition, counting the number of “hard tokens” (meaningful words) they contain.
Ultimately, the heavy reliance on entity annotation and vector embeddings indicates that modern on-page SEO is becoming an exercise in curating a knowledge graph. The primary task is no longer to optimise for keyword density but to clearly define the entities present on a page and their relationships to one another, making the mapping process for Google’s systems as clear and unambiguous as possible. This is achieved through precise language, the use of structured data to explicitly define entities, and a logical internal linking structure that reinforces the relationships between related concepts.
The User as a Ranking Signal: Clicks, Engagement, and Navboost
Direct Evidence of Clickstream Data
The PerDocData
model provides irrefutable evidence that user behaviour signals are collected, stored at the document level, and used directly in ranking. This ends years of speculation and confirms that how users interact with search results is a primary input for Google’s systems.
The documentation reveals several core click signals:
impressions
: The total number of times a URL is shown in the search results pages (SERPs), which serves as the denominator for calculating click-through rates.GoodClicks
andBadClicks
: A classification of user clicks that likely distinguishes between a satisfying interaction and a “pogo-stick” event, where the user clicks a result and then immediately returns to the SERP to choose another.LastLongestClicks
: A particularly powerful signal that identifies the last result a user clicked on in a session and on which they dwelled for a significant period. This strongly implies that the user’s query was successfully answered by that page, making it a potent indicator of relevance and quality.
These signals are the primary inputs for a ranking system known as Navboost, which is hypothesised to be one of the most powerful re-ranking “Twiddlers”. The data flow is unambiguous: users interact with the SERPs, this generates clickstream data, the data is stored in the PerDocData
object for the corresponding URL, and systems like Navboost use this data to adjust rankings up or down.
The presence of this granular click data elevates user experience (UX) from a peripheral “good practice” to a direct and measurable ranking factor. A poor on-page experience that causes users to leave quickly will generate BadClicks
and short dwell times. These negative signals are recorded in the document’s permanent record and are used to demote its ranking over time. This means that optimising title tags and meta descriptions to win the initial click is only half the battle; the other, equally important half is satisfying the user’s intent post-click to earn the GoodClicks
and LastLongestClicks
signals.
This click-based re-ranking system effectively functions as a massive, real-time quality control feedback loop. It allows Google to use the collective, demonstrated behaviour of millions of users to fine-tune and validate its own algorithmic rankings. If the initial Mustang algorithm places a document at position #1, but users consistently ignore it and instead award LastLongestClicks
to the document at position #3, the system learns that the #3 result is likely a better answer for that query. Over time, this feedback will promote the preferred result. In essence, Google uses its users as the final and most scalable layer of quality raters, constantly refining the SERPs based on real-world preference.
The Pulse of the Web: Freshness and Temporal Signals
Dissecting Freshness Signals
The PerDocData
model reveals a sophisticated approach to quantifying the timeliness and relevance of content, moving far beyond a simple reliance on publication dates. Google employs multiple methods to determine a document’s temporal context:
- Date Extraction: The system identifies dates from multiple sources, including the
bylineDate
(the date explicitly stated in an article’s byline), thesyntacticDate
(a date parsed from the URL structure or title), and, most importantly, thesemanticDate
(a date that is understood from the context of the content itself using Natural Language Processing). - Update Significance: The presence of a
lastSignificantUpdate
signal is a critical revelation. It indicates that Google’s systems can differentiate between minor cosmetic changes (like fixing a typo) and substantial content revisions. This confirms that simply changing a publication date without making meaningful updates is an ineffective tactic. An update’s “value” is algorithmically determined, likely by comparing document versions and calculating a change-score. If this score passes a certain threshold, thelastSignificantUpdate
timestamp is refreshed, making the page eligible for a freshness boost. - Freshness Scoring: The
freshboxArticleScores
module stores specific scores from freshness-related classifiers, which are then used by theFreshnessTwiddler
to boost timely content. A signal likeisHotdoc
may be used to flag content that is currently trending or newsworthy.
The existence of a semanticDate
signal demonstrates that Google’s NLP capabilities can override explicit dates that may be manipulated. For instance, a publisher could set a bylineDate
to the current day, but if the text of the article uses past-tense language to discuss events from several years ago, the semantic analysis will identify the content as old. When a conflict arises, the system will likely trust the semantic interpretation, making it much harder to game freshness signals with misleading timestamps.
Query Deserves Freshness (QDF)
These sophisticated freshness signals are not applied universally. They are connected to the long-standing concept of Query Deserves Freshness (QDF). This model dictates that freshness is not a global ranking factor but is heavily weighted for specific types of queries, such as those concerning recent events, regularly recurring events (like elections or conferences), or topics that require frequent updates to remain accurate (like product reviews or technical guides).
The SpamBrain Sentinel: Demotion Signals in SpamPerDocData
Overview of SpamPerDocData
The SpamPerDocData
module is the document-level repository for signals related to webspam. It serves as the record of assessments made by Google’s comprehensive, AI-driven anti-spam system, SpamBrain. Launched in 2018, SpamBrain uses machine learning to identify spam patterns, manipulative link schemes, and low-quality content with remarkable accuracy. The scores and flags stored within SpamPerDocData
are the direct outputs of this system.
Specific Spam Signals
The data within this module reflects a wide and granular range of spam tactics that Google actively detects and penalises:
- Link Spam: The
spamrank
attribute specifically measures “the likelihood that this document links to known spammers,” showing that outbound link quality is a measured risk factor. - Content Spam: The system stores numerous specific content spam scores. These include
KeywordStuffingScore
,GibberishScore
, andSpamWordScore
, all represented as 7-bit integers. This demonstrates a multi-faceted approach to identifying low-quality content, moving beyond a single “spam” label to classify the specific type of violation. TheOriginalContentScore
is used for pages with very little content to measure originality and combat thin content, whilespamtokensContentScore
specifically measures spam in user-generated content sections. - Behavioural & Technical Spam: The
spamMuppetSignals
module is used to store signals related to hacked sites, allowing for query-time identification. ThetrendspamScore
tracks the count of matching queries related to trending spam topics, showing an ability to react to new spam waves. - Reputation & Behavioural Spam: SpamBrain is continuously updated to combat emerging spam trends. This includes detecting scaled content abuse (mass-producing low-value content), site reputation abuse (“parasite SEO,” where third parties publish on a reputable domain), and the abuse of expired domains.
The reference to a “likelihood of a page being webspam” suggests that spam assessment is probabilistic, not a binary yes/no decision. This allows for a spectrum of penalties. A page engaging in borderline tactics might receive a low-grade spam score that acts as a slight negative weighting in the ranking algorithm. In contrast, a page with blatant and numerous violations would receive a very high spam score, resulting in a severe demotion or complete removal from the index.
Furthermore, the system is designed to be proactive. SpamBrain can act as a gatekeeper during the indexing process, not just a janitor that cleans up later. Some content detected as spam during the initial crawl is never even added to the index, meaning its SpamPerDocData
module is populated with negative signals from the very beginning of its lifecycle, preventing it from ever gaining ranking traction.
Technical Foundations and Page Experience
MobilePerDocData and Mobile-First Indexing
The PerDocData
structure includes a dedicated MobilePerDocData
module, which stores mobile-friendliness scores and a list of specific compatibility issues for a given URL. The existence of this module confirms that technical performance and mobile usability are not fleeting, real-time calculations but are persistent, foundational attributes of a document stored within the Content Warehouse. This elevates technical SEO from a simple checklist to a fundamental aspect of how Google perceives and categorises a document. Poor mobile performance is a negative data point permanently attached to a URL’s record in the index.
Connecting Internal Data to External Signals
The data stored internally is directly linked to the public-facing Page Experience signals that Google has emphasised in recent years.
- Core Web Vitals (CWV): While metrics like Largest Contentful Paint (LCP), Interaction to Next Paint (INP), and Cumulative Layout Shift (CLS) are measured in the field via the Chrome User Experience Report (CrUX), it is highly probable that this data is ingested, aggregated, and stored as a summary score or classification within
PerDocData
. The presence ofvoltData
, which contains page UX signals for an internal project named “VOLT,” further confirms that multiple layers of UX signals are collected and contribute to this overall assessment. - Other Page Experience Factors: Other key components of page experience, such as the use of HTTPS, safe browsing status, and the absence of intrusive interstitials, are also likely stored as flags or scores within the document’s data. The penalty for intrusive interstitials, in particular, is a direct ranking signal, and a page’s compliance is a stored attribute.
It is logical to conclude that these various component scores are aggregated into a single, weighted “page experience score.” Google’s systems favour efficiency, and rather than evaluating multiple separate metrics in real-time for every document in a SERP, it is more scalable to use a pre-calculated, composite score. This score can then be easily applied as a boost or demotion factor by a dedicated Twiddler during the final re-ranking stage.
Specialised Content & Niche Signals
The PerDocData
model also reveals that Google does not use a one-size-fits-all approach to document evaluation. It contains numerous modules and attributes tailored for specific content types and purposes.
Commercial Intent (commercialScore
)
The commercialScore
attribute is a direct measure of a document’s commerciality. A score greater than zero indicates the page “sells something.” This confirms that Google’s systems actively classify pages based on their position in the marketing funnel. This score is likely used to better match pages to queries with clear commercial intent and may also be a factor in how other quality signals are weighted.
Vertical-Specific Data Modules
The presence of dedicated data modules like BookCitationData
, videodata
, imagedata
, BlogData
, and scienceDoctype
demonstrates that Google applies specialised analysis to different content verticals. A scientific paper is evaluated differently from a blog post, and a video’s properties are distinct from a book’s citation record. This indicates that optimisation strategies should be tailored to the specific content format, as Google is not evaluating them with the same generic lens.
Internationalisation and Localisation (localizedCluster
)
For websites that operate in multiple languages or regions, the localizedCluster
attribute is highly significant. It stores information about the “relationship of translated and/or localized pages.” This confirms that Google actively attempts to map different language versions of the same content together. Correctly implementing hreflang and other internationalisation signals is therefore critical to help Google build these clusters accurately, ensuring the correct language version is served to the appropriate user and that authority signals are consolidated across different versions of a page.
Advanced Signals and Nuanced Ranking Mechanics
Further analysis of the PerDocData
model reveals several highly specific and powerful signals that offer a more granular view into the sophistication of Google’s ranking systems.
Internal Prominence and Simulated Traffic (onsiteProminence
)
The onsiteProminence
attribute provides a definitive confirmation of how Google calculates internal link equity. The documentation describes this as a measure of a document’s importance within its own site. Crucially, it is “computed by propagating simulated traffic from the homepage and high craps click pages.” This reveals two key insights:
- Google runs a simulation of user flow through a website to determine which pages are most important.
- This simulation starts from key entry points: the homepage and, significantly, pages that already receive high volumes of search clicks.
This confirms that internal links are not treated equally. A link from a high-traffic page to another page on the same site passes more “prominence” than a link from an obscure, rarely visited page. This provides a technical basis for the strategic advice to internally link from your highest-performing pages to other pages you wish to boost.
Document Intent Classification: The “Asteroid Belt” (asteroidBeltIntents
)
The asteroidBeltIntents
attribute, an internal project name, points to a highly granular system for document intent classification. This system moves far beyond the traditional SEO model of informational, navigational, and transactional intent. Instead, it appears to assign a list of multiple, specific intents to a single document, each with a corresponding confidence score. This suggests that Google understands that a single page can serve multiple purposes. For example, a product page can be both transactional (“buy this”) and informational (“read reviews,” “compare specifications”). This system allows Google to match a page to a wider and more nuanced range of queries by understanding all the potential user needs it can satisfy.
Advanced Content & Duplicate Analysis (shingleInfo
, bodyWordsToTokensRatio
)
The documentation reveals sophisticated methods for content analysis that go beyond simple keyword counting.
shingleInfo
: This attribute confirms the use of “shingling,” a well-established computer science technique for detecting near-duplicate content. The process involves breaking a document down into small, overlapping chunks of text (shingles) and creating a unique fingerprint. By comparing these fingerprints, Google can identify pages that are substantially similar, even if they are not exact copies. This is the technical underpinning of how Google handles duplicate and thin content.bodyWordsToTokensRatio
: This metric measures the ratio of meaningful words to the total number of “tokens” (words, punctuation, etc.) on a page. The documentation also specifies that this ratio is calculated separately for the beginning of the document (bodyWordsToTokensRatioBegin
) and the document as a whole. This suggests a nuanced analysis of content quality and density, with particular attention paid to the content that appears “above the fold.” A low ratio could signal thin, boilerplate, or auto-generated content.
SERP Diversity and Host Crowding (crowdingdata
)
The presence of a crowdingdata
module suggests a system designed to manage search result diversity. This is likely the mechanism that prevents a single domain from dominating the search results for a particular query, a phenomenon often referred to as “host crowding.” By limiting the number of results from any one site, this system ensures users are presented with a variety of sources and perspectives, improving the overall quality of the search experience.
Strategic Synthesis: A Unified SEO Framework for 2025 and Beyond
The analysis of the PerDocData
model provides a clear and detailed blueprint of how Google evaluates web pages. It confirms a sophisticated, multi-faceted, and data-driven system that measures and stores information on everything from brand authority and user clicks to content freshness and semantic entities. The era of attempting to manipulate a simplistic algorithm is definitively over. The Content Warehouse documentation reveals a system that is increasingly reliant on measuring real-world authority, genuine user satisfaction, and demonstrable expertise.
Based on this evidence, a unified and sustainable SEO framework for the future must be built upon three core pillars.
Pillar 1: Foundational Authority & Trust
Building a strong siteAuthority
and NSR
score is the non-negotiable foundation of modern SEO. This requires a long-term commitment to establishing the entire domain as a credible and trustworthy source. Key activities include creating a topically focused site to build a coherent site2vecEmbeddingEncoded
, earning high-quality backlinks from other authoritative sites to build PageRankPerDocData
, and demonstrating E-E-A-T through clear authorship, transparent business practices, and factually accurate content.
Pillar 2: Content that Satisfies and Engages
Content must be created with the primary goal of satisfying user intent and generating positive click signals (GoodClicks
, LastLongestClicks
) to perform well in systems like Navboost. This necessitates a profound focus on user experience, from the SERP snippet to the on-page journey. The content itself must provide the best, most comprehensive answer to a user’s query. For topics where timeliness matters, content must be kept demonstrably fresh, triggering the lastSignificantUpdate
signal to earn a boost from freshness systems.
Pillar 3: Technical Excellence & Semantic Clarity
A website must be technically flawless to ensure a positive Page Experience score is recorded in its PerDocData
. This includes optimising for Core Web Vitals, ensuring mobile-friendliness, and avoiding intrusive elements. Beyond technical performance, content must be structured to clearly communicate its meaning to Google’s machine-learning models. This involves using structured data (Schema.org) to explicitly define entities, employing a logical internal linking strategy to reinforce relationships between concepts and boost onsiteProminence
, and writing with clarity to aid the EntityAnnotations
process.
Ultimately, the blueprint revealed by the Content Warehouse documentation confirms that sustainable success in Google’s ecosystem is less about manipulating an opaque algorithm and more about building a genuine, authoritative brand that users actively seek, trust, and engage with. Every one of these positive interactions is measured, stored, and used to rank content, creating a system that increasingly rewards authenticity and user value above all else.
Disclosure: I use generative AI when specifically writing about my own experiences, ideas, stories, concepts, tools, tool documentation or research. My tool of choice for this process is Google Gemini Pro 2.5 Deep Research. I have over 20 years writing about accessible website development and SEO (search engine optimisation). This assistance helps ensure our customers have clarity on everything we are involved with and what we stand for. It also ensures that when customers use Google Search to ask a question about Hobo Web software, the answer is always available to them, and it is as accurate and up-to-date as possible. All content was conceived ad edited verified as correct by me (and is under constant development). See my AI policy.
Disclaimer: Any article (like this) dealing with the Google Content Data Warehouse leak is going to use a lot of logical inference when putting together the framework for SEOs, as I have done with this article. I urge you to double-check my work and use critical thinking when applying anything for the leaks to your site. My aim with these articles is essentially to confirm that Google does, as it claims, try to identify trusted sites to rank in its index. The aim is to irrefutably confirm white hat SEO has purpose in 2025.
References
-
- https://www.infintechdesigns.com/12-key-insights-from-googles-documentation-leak-for-link-builders-and-digital-prs/
- https://www.schemaapp.com/schema-markup/what-is-an-entity-in-seo/
- https://www.mediatraining.ltd.uk/blogs/seo-and-google-eat-what-it-is-and-why-it-is-important
- https://www.horizonwebref.com/
- https://moz.com/learn/seo/google-eat
- https://seotactica.com/seo/authoritative-websites/
- https://robpowellbizblog.com/google-freshness-algorithm/
- https://www.searchlogistics.com/learn/seo/link-building/leaked-lessons/
- https://www.wordtracker.com/academy/seo/site-optimization/what-is-google-e-a-t-and-how-does-it-work
- https://rush-analytics.com/blog/google-seo-entities
- https://marketbrew.ai/the-evolution-of-the-pagerank-algorithm
- https://searchengineland.com/links-digital-pr-key-takeaways-google-documentation-leak-442905
- https://cloud.google.com/natural-language/docs/analyzing-entities
- https://www.google.com/support/enterprise/static/gsa/docs/admin/76/admin_console_help/crawl_entity_recognition.html
- https://www.kopp-online-marketing.com/google-api-leak-ranking-relevant-systems-and-metrics
- https://www.smartsight.in/technology/the-biggest-revelations-from-the-google-search-leak-2024/
- https://searchengineland.com/how-google-search-ranking-works-445141
- https://searchengineland.com/semantic-search-entity-based-search-388221
- https://www.ux-republic.com/en/data-leak-from-the-google-content-warehouse-api-can-seo-still-evolve/
- https://digitaloft.co.uk/what-does-google-documentation-leak-mean-for-digital-pr/
- https://www.mariehaynes.com/twiddlers/
- https://blogs.cornell.edu/info2040/2017/10/25/the-history-of-pagerank-and-iterative-searching-algorithms/
- https://kalicube.com/learning-spaces/faq-list/seo-glossary/entity-signals-on-google-what-you-need-to-know/
- https://whitebeardstrategies.com/blog/how-does-google-measure-authority-deciphering-googles-methods-how-search-giant-evaluates-website-authority/
- https://www.searchenginejournal.com/google-pagerank/483521/
- https://www.resoneo.com/google-leak-part-2-understanding-the-twiddler-framework/
- https://www.toprankmarketing.com/blog/eeat-seo-google-guidelines-experience-expertise-authority-trust/
- https://www.taboola.com/marketing-hub/ymyl/
- https://mojodojo.io/blog/googleapi-content-warehouse-leak-an-ongoing-analysis
- https://www.hobo-web.co.uk/the-google-content-warehouse-leak-2024/
- https://jordanstevens.ca/google-freshness-system-explained
- https://varn.co.uk/insights/optimise-for-entity-seo/
- https://seranking.com/blog/keyword-stuffing/
- https://keywordspeopleuse.com/seo/guides/google-api-leak
- https://inlinks.com/help/entity-based-seo/
- https://en.wikipedia.org/wiki/Asteroid_mining
- https://inmarketingwetrust.com.au/google-algo-leak-summary-of-the-most-interesting-aspects/
- https://developers.google.com/search/blog/2021/04/more-details-page-experience
- https://www.searchenginejournal.com/google-e-e-a-t-how-to-demonstrate-first-hand-experience/474446/
- https://www.searchenginewatch.com/2016/09/26/guide-to-google-ranking-factors-part-4-content-freshness/
- https://code.store/blog/inside-googles-content-warehouse-leak-implications-for-seo-publishers-and-the-future-of-search
- https://developers.google.com/search/docs/appearance/page-experience
- https://medium.com/@jonathankoren/near-duplicate-detection-b6694e807f7a
- https://stackoverflow.com/questions?tab=frequent&page=6333
- https://www.stanventures.com/news/top-10-google-ranking-factors-leaked-in-2024-284/
- https://moz.com/learn/seo/domain-authority
- https://www.webfx.com/blog/seo/what-is-google-indexing/
- https://www.nearmedia.co/googles-api-leak-and-local-search/
- https://www.google.com/intl/en_us/search/howsearchworks/how-search-works/ranking-results
- https://help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/dot-net-core 69.(https://www.scribd.com/document/442014554/Syncfusion-File-Formats)
- https://research.google/research-areas/natural-language-processing/
- https://www.vproexpert.com/dive-into-googles-nsr-algorithm-foundation-of-site-reliability/
- https://www.fdv.uni-lj.si/docs/default-source/zalozba/csrcomproceedings-2024.pdf?sfvrsn=0
- https://growfusely.com/blog/google-api-leak/
- https://prominentweb.com/blog/what-is-googles-freshness-signal-and-why-is-it-important/
- https://www.commonplaces.com/blog/what-is-page-experience
- https://www.searchenginejournal.com/google-algorithm-history/
- https://contentwriters.com/blog/keyword-stuffing-avoid/
- https://www.seobythesea.com/2013/10/google-gibberish-content-to-demote-pages/
- https://developers.google.com/search/docs/essentials/spam-policies
- https://www.google.com/intl/en_uk/search/howsearchworks/how-search-works/detecting-spam/
- https://stackoverflow.com/questions/7670427/how-does-language-detection-work
- https://www.hobo-web.co.uk/core-web-vitals-seo-after-the-google-content-warehouse-api-data-leaks/
- https://www.newbrewmedia.com/blog/googleapi-contentwarehouse-v1-explained-a-guide-for-seos-and-marketersere
- https://www.searchenginejournal.com/ranking-factors/domain-authority/
- https://blog.zibster.com/blog-post/how-does-google-index-your-website
- https://medium.com/@ecommerce.by.alyson/top-3-takeaways-from-googles-seo-data-leak-2024-seo-756bad6d251b
- https://cloud.google.com/retail/docs/entities
- https://dreamwarrior.com/blog/what-is-the-impact-of-semantic-seo-in-the-search-everything-age/
- https://developers.google.com/search/docs/appearance/ranking-systems-guide