The PerDocData: Google's Leaked Core Document Model

Having spent 25 years analysing Google’s results pages, the leak of the PerDocData model is nothing short of a Rosetta Stone.

At Hobo, I’ve always worked from the evidence we had, and to be fair, my analysis concludes that Google spokespeople has been open about this stu,ff in past briefings, albeit scattered across the web, but this leak provides the blueprint SEOs have been looking for.

This article analyses the PerDocData structure, which can be understood as the comprehensive ‘digital dossier’ Google keeps on every single URL it indexes. It’s the central repository, the master file, that consolidates the vast array of signals we’ve spent our careers trying to influence.

This is no longer theory; this is documented data architecture.

The most critical finding from my analysis confirms a long-standing debate within the SEO community. Google’s ranking process is not a single, monolithic algorithm. Instead, it’s a pipeline.

A URL first achieves a relevance-based ranking, but crucially, it is then subjected to a series of re-ranking systems – internally called “Twiddlers” – that prioritise user-centric and quality-focused signals.

This is the mechanism behind why a keyword-optimised but low-quality page can get an initial foothold but will ultimately fail to maintain visibility.

For me and my team, this architecture fundamentally validates the strategic shift we’ve been advocating for years. The game is no longer just about establishing relevance. To achieve and maintain top rankings, you must prove your value to the subsequent “Twiddlers.”

This means our focus on user experience, content quality, and demonstrable authority isn’t just best practice—it’s a direct response to how Google’s core ranking pipeline is built.

Key Findings Synopsis

Multi-Layered Ranking Confirmed: The search process involves an initial ranking pass by a system named Mustang, followed by a series of powerful re-ranking functions (“Twiddlers”) that adjust results based on factors like user engagement, freshness, and quality. Success requires optimising for both stages.
Tiered Indexing System Named: The documentation confirms a tiered indexing system with specific names: “Base, Zeppelins, and Landfills.” A document’s scaledSelectionTierRank determines its position within these tiers, directly impacting its ranking potential.
Site-Level Authority is a Core Metric: Google calculates and stores potent site-level authority signals, referred to internally as siteAuthority and NSR (Normalized Site Rank). These metrics contextualise the value of any individual page, confirming that the reputation of the entire domain is a critical, non-negotiable component of ranking potential.
User Clickstream Data is a Direct Ranking Input: The system stores granular, document-level click data, including GoodClicks, BadClicks, and LastLongestClicks. This provides definitive evidence that user engagement and post-click behaviour are direct inputs into ranking systems, validating the long-held hypothesis that user experience directly influences search performance.
Freshness Signals are Sophisticated and Nuanced: Google’s assessment of content freshness goes far beyond simple publication dates. It employs semantic analysis to understand the temporal context of content and uses a lastSignificantUpdate signal to differentiate between minor edits and substantial revisions, rewarding genuine content improvement.
Semantic Understanding Has Replaced Keyword Matching: The architecture is built around entity recognition (EntityAnnotations) and machine learning-derived vector embeddings (site2vecEmbeddingEncoded). This marks a definitive shift from a keyword-centric model to one that understands topics, concepts, and the relationships between them, making topical authority a mathematically calculated attribute.
Internal Link Equity is Calculated via “Simulated Traffic”: A metric named onsiteProminence measures a page’s importance within its own site by simulating user traffic flow from the homepage and other high-traffic pages, confirming the critical role of internal linking strategy.

The Content Warehouse: Google’s Digital Brain

Architectural Context

To comprehend the significance of PerDocData, one must first understand its environment: the Google Content Warehouse (leaked in 2024)..

This is not a simple database but a vast, sophisticated Application Programming Interface (API) and toolset designed for storing, managing, and analysing the web at an immense scale. It serves as the central repository where Google processes and organises all information it gathers about web content, acting as the foundational data layer for its search algorithms.

The Search Processing Pipeline

A document’s journey from discovery to being served in search results is a multi-stage pipeline. PerDocData is the data object that is populated and referenced throughout this process.

Crawling & URL Discovery

The process begins with Google discovering URLs through various means, including following links from known pages, processing submitted sitemaps, and other proprietary methods. This is the initial entry point into the system.

Indexing & Storage

Once a URL is discovered and crawled, its content is fetched, rendered, and analysed. The processed document and its associated metadata are then stored in a suite of indexing systems. The documentation points to TeraGoogle as a primary system for long-term storage, with other systems like Alexandria also playing a role.

A critical component of this stage is a system named SegIndexer, which is responsible for placing documents into different tiers within the index. The scaledSelectionTierRank attribute provides a direct window into this system, confirming the long-held theory that Google maintains a tiered index with specific internal names: “Base, Zeppelins, and Landfills.”

A document’s rank within these serving tiers is a language-normalised score, indicating its fractional position within the index quality hierarchy.

This architecture dictates that links from documents residing in higher-quality tiers (like Base) carry significantly more weight than those from lower tiers (like Landfills).

This creates a distinct “link equity economy” where the value of a backlink is determined not just by the authority of the linking page itself, but also by the indexed “neighbourhood” it inhabits.

Initial Ranking (Mustang)

After indexing, the initial scoring and ranking of documents are handled by a primary system called Mustang. This system conducts the first-pass evaluation, creating a provisional set of results based on a multitude of signals stored within the PerDocData object for each document. This stage likely focuses on core relevance and foundational authority signals.

Re-ranking (Twiddlers)

The process does not end with Mustang. The provisional results are passed to a powerful subsequent layer of the system known as “Twiddlers.” These are re-ranking functions that adjust the order of search results after Mustang’s initial ranking is complete. Twiddlers act as a fine-tuning mechanism, applying boosts or demotions based on specific, often dynamic, criteria. Examples referenced in the documentation include a FreshnessTwiddler, which boosts newer content, and a QualityBoost function. Another specific example is the SiteBoostTwiddler, which likely uses site-level signals to adjust rankings.

This multi-stage architecture reveals that search engine optimisation is not about solving for a single algorithm. It is a multi-stage optimisation problem. A document must first possess strong foundational relevance signals to pass the initial Mustang ranking.

Subsequently, it must exhibit the specific qualities – such as demonstrable user engagement, freshness for time-sensitive queries, or exceptional page experience—to be promoted by the various Twiddlers.

A page could rank well in the initial pass but be demoted by a Twiddler if it generates poor user click signals, or fail to be boosted if it is not considered fresh for a query that deserves it. A successful SEO strategy must therefore cater to both stages of this process.

Understanding the Vessel: PerDocData as a Protocol Buffer

Technical Primer on Protocol Buffers (Protobuf)

The PerDocData object is structured as a Protocol Buffer, or Protobuf. This is a language-neutral, platform-neutral, and extensible mechanism for serialising structured data, developed and used extensively across Google’s infrastructure. Its selection is not arbitrary; it is critical for operating at Google’s scale. Key characteristics that make it suitable include:

Efficiency: Protobuf is significantly smaller, faster, and simpler to process than alternative formats like XML or JSON. This allows for compact data storage and extremely fast parsing, which is essential when dealing with trillions of documents.
Structure: Data schemas, known as “messages,” are strictly defined in .proto files. This enforces strong data typing and a consistent structure, ensuring that different systems interacting with the data do so reliably.
Extensibility: The Protobuf format is designed for seamless evolution. New fields and data points can be added to the message definition without breaking older systems or invalidating existing data, allowing Google to continuously add new signals to its models without re-architecting the entire system.

The Role of PerDocData

Within the Content Warehouse, the PerDocData model is arguably the most interesting and critical Protobuf message for SEO analysis. It is the primary container for the vast majority of document-level signals used for indexing and serving search results. It is a key component of a larger CompositeDoc message, which aggregates all known information about a single URL. PerDocData is where on-page factors, quality scores, spam signals, freshness metrics, and user engagement data are stored and made accessible to the ranking pipeline.

High-Impact Modules and Attributes in PerDocData

Authority & Trust

This category includes signals that measure the overall trust, authority, and reputation of a page or an entire domain. These are foundational to Google’s assessment of a source’s reliability.

Module / Attribute	Data Type	Hypothesised Function & Strategic Relevance
PageRank	Nested Message	Contains the PageRank score, a foundational signal for link equity. Though its calculation has evolved, it remains a core system for understanding link-based authority.
siteAuthority	Integer	A site-level authority score that contextualises the trust and ranking potential of all pages on the domain. A direct, internal measure of “domain authority”.
nsrDataProto	Nested Message	Contains the Normalized Site Rank (NSR), a sophisticated site-level quality and reliability score. This is a primary measure of a site’s overall quality.
onsiteProminence	Integer	Measures a page’s importance within its own site by simulating traffic flow from the homepage and other high-traffic pages. A measure of internal link equity.
queriesForWhichOfficial	Nested Message	Stores a list of specific queries for which this page is considered the official result, a powerful signal for brand and entity authority.
homepagePagerankNs	Integer	The PageRank of the site’s homepage, stored as a distinct and important signal.
domainAge / hostAge	Integer	Tracks the inception date of hosts and domains, used as a trust signal in spam evaluation, particularly to “sandbox fresh spam.”
authorObfuscatedGaiaStr	String list	Obfuscated ID of the content’s author, linking content to author entities for E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) evaluation.
fireflySiteSignal	Nested Message	Contains site-level signals for the “Firefly” ranking system. Likely another comprehensive measure of site quality and trust, similar to NSR.
toolbarPagerank	Integer	A copy of the public-facing PageRank score (0-10) historically shown in the Google Toolbar. While the toolbar is gone, this likely serves as a legacy or parallel authority signal.

Content Quality

These attributes focus on the quality, originality, and value of the content on the page itself, separate from site-level metrics.

Module / Attribute	Data Type	Hypothesised Function & Strategic Relevance
OriginalContentScore	Integer (0-512)	A score applied to pages with little content, measuring originality. A low score likely indicates thin, duplicate, or low-value content that should be pruned or improved.
shingleInfo	Nested Message	Contains data from “shingling,” a technique used to create a fingerprint of the document to detect near-duplicate content.
bodyWordsToTokensRatio	Number	Measures the ratio of meaningful words to total tokens, likely as a signal for content quality and readability. The start of the document is measured separately.
ymylHealthScore / ymylNewsScore	Integer	Dedicated classifier scores for “Your Money or Your Life” content in health and news verticals, indicating higher quality standards are being met.
titleHardTokenCountWithoutStopwords	Integer	Counts meaningful words in a title, suggesting analysis of title quality and conciseness for better user understanding and click-through rates.
TagPageScore	Integer	A score measuring the quality of a “tag page” (a page that aggregates content with a specific tag). A lower score indicates a less useful, potentially thin-content page.

Spam Detection

This group of signals is dedicated to identifying and filtering manipulative or low-value content designed to cheat the ranking systems.

Module / Attribute	Data Type	Hypothesised Function & Strategic Relevance
spambrainData / spambrainTotalDocSpamScore	Nested Message / Number	Contains a collection of signals from Google’s AI-powered SpamBrain system at both the site and page level. This is a primary defense against webspam.
spamrank	Integer (0-65535)	A specific score that measures the likelihood that a document links out to known spam sites, penalizing pages that associate with bad neighborhoods.
spamtokensContentScore	Number	A specific score measuring user-generated content (UGC) spam, crucial for forums, comment sections, and social platforms.
spamMuppetSignals	Nested Message	Contains signals related to hacked sites for query-time identification, preventing compromised pages from ranking.
KeywordStuffingScore	Integer	A specific score to detect and penalise the overuse of keywords in content.
GibberishScore	Integer	A score to identify auto-generated or nonsensical content, filtering out low-quality machine-written text.
trendspamScore	Integer	Tracks the count of matching queries related to trending spam topics, showing an ability to react to new spam waves in real-time.
spamCookbookAction	Nested Message	Actions based on “Cookbook recipes,” an internal system for identifying and acting on specific, known spam patterns.
QuarantineInfo	Integer	A bitmask used to store quarantine-related information, flagging pages for various violations and potentially removing them from the index.
urlPoisoningData	Nested Message	Contains data used to suppress documents with manipulative URLs (e.g., keyword-stuffed subdomains or URL paths).
IsAnchorBayesSpam	Boolean	A flag indicating if the page is considered spam by a classifier that specifically analyzes anchor text from inbound links. This targets link spam schemes.
uacSpamScore	Integer	A spam score likely derived from a “User Action Corpus” or user feedback signals (e.g., users blocking a site or reporting spam). A direct measure of user dissatisfaction.

User Engagement & Behaviour

These signals are derived from user interactions with the search results, providing direct feedback on the relevance and satisfaction of a given page.

Module / Attribute	Data Type	Hypothesised Function & Strategic Relevance
impressions	Number	Likely stores the total impressions a URL receives in search results. This is a foundational metric for calculating click-through rates (CTR).
GoodClicks, BadClicks	Number	Measures of positive and negative user clicks. A BadClick likely corresponds to “pogo-sticking,” where a user quickly returns to the SERP, signalling dissatisfaction.
LastLongestClicks	Number	A powerful signal indicating the last result a user clicked on and dwelled on, suggesting the query was successfully resolved by that page. A strong indicator of relevance and quality.
socialgraphNodeNameFp	String	A fingerprint related to the Social Graph, likely used in personalized search to surface content from connected entities or authors.

Freshness & Timeliness

These attributes help Google determine how important recent information is for a given query and how up-to-date a specific document is.

Module / Attribute	Data Type	Hypothesised Function & Strategic Relevance
freshboxArticleScores	Nested Message	A container for scores from freshness classifiers, including specific scores for news articles and live blogs. Essential for Query Deserves Freshness (QDF) systems.
semanticDateInfo	Integer	Stores confidence scores for date components (day/month/year) derived from content analysis. Used by “Freshness Twiddler” systems to determine true timeliness.
lastSignificantUpdate	String	A timestamp (in seconds) indicating the last time the document underwent a substantial content change, distinguishing it from minor cosmetic edits.
timeSensitivity	Integer	An encoded signal representing the document’s overall time sensitivity, likely influencing how heavily freshness is weighted as a ranking factor for it.
isHotdoc	Boolean	A flag set by the FreshDocs system to identify a document as being extremely new and trending, likely giving it a significant short-term ranking boost.

Semantic & Topical Relevance

This category covers how Google understands the meaning, topics, and intent behind the content on a page, moving beyond simple keywords.

Module / Attribute	Data Type	Hypothesised Function & Strategic Relevance
webrefEntities / EntityAnnotations	Nested Message	Attaches Knowledge Graph entities extracted from the page’s content. This is fundamental to Google’s semantic understanding of what the page is about.
site2vecEmbedding	Encoded String	A compressed vector embedding of the entire site. Used by machine learning models to determine the site’s overall theme, measure topical similarity, and identify topical deviations.
asteroidBeltIntents	Nested Message	An internal system for granular document intent classification, assigning multiple intent scores to a page beyond simple informational/transactional labels.
commercialScore	Number	A direct measure of a page’s commercial intent, classifying whether it “sells something.” Used to balance informational and commercial results.
topPetacatTaxId	Integer	The top category ID for the site (from an internal “Petacat” taxonomy), used to determine query/result matching and topical relevance.
mediaOrPeopleEntities	Nested Message	Identifies the most prominent media or people entities on a page, used in Image Search to ensure result diversity and avoid showing only one person or topic.
fringeQueryPrior	Nested Message	Contains information used for ranking on “fringe” queries—very rare, long-tail, or obscure searches. This shows Google’s focus on providing relevant results for the entire spectrum of queries.

Technical & Page Experience

These signals relate to a page’s technical health, accessibility, and the user’s experience interacting with it, including speed and mobile-friendliness.

Module / Attribute	Data Type	Hypothesised Function & Strategic Relevance
MobileData	Nested Message	Stores the mobile-friendliness score and a list of specific mobile compatibility issues. A direct data point for mobile-first indexing.
voltData	Nested Message	Contains page UX signals for the “VOLT” system, contributing to the overall Page Experience score (likely related to Core Web Vitals).
crowdingdata	Nested Message	Data used to manage SERP diversity and prevent too many results from the same host (“host crowding”) for a single query.
scaledSelectionTierRank	Integer	A score determining the document’s position within Google’s tiered index (“Base, Zeppelins, Landfills”), directly impacting its ranking potential and how frequently it is served.
pageregions	String	Encodes the positional layout of different content regions (e.g., header, footer, body), allowing for more granular analysis (e.g., giving more weight to body content).
servingTimeClusterIds	Nested Message	Contains IDs used to de-duplicate results in real-time at the moment a search is performed, ensuring a cleaner SERP.

Geographic & Language Signals

This group includes attributes essential for local search and serving results to a global audience in the correct language.

Module / Attribute	Data Type	Hypothesised Function & Strategic Relevance
countryInfo	Nested Message	Stores country information for the document, helping to determine geographic relevance for country-specific queries.
brainloc	Nested Message	Contains more granular location information for the document (likely cities, states, etc.), vital for “near me” and local search ranking.
localizedCluster	Nested Message	Stores information about clusters of translated and/or localized pages, helping Google serve the correct language version of a page to the right user.
rosettaLanguages	String List	Stores the top document language codes as identified by Google’s “Rosetta” system, ensuring accurate language targeting.

Specialised Content & Niche Signals

These are classifiers and data stores for specific types of content that require unique ranking considerations, such as books, videos, or scientific papers.

Module / Attribute	Data Type	Hypothesised Function & Strategic Relevance
BookCitationData	Nested Message	Stores book citation data for a web page, used in academic and book-related search to measure scholarly impact.
videodata / imagedata	Nested Message	Contains specific metadata and quality signals for video and image content, powering vertical search engines like Google Images and Videos.
scienceDoctype	Integer	A classifier for scientific documents, used in systems like Google Scholar to identify and rank research papers.
productSitesInfo	Nested Message	Stores specific information about product-focused websites, likely feeding into shopping and product-review ranking systems.
travelGoodSitesInfo	Nested Message	Stores specific information about high-quality travel websites, indicating a specialized classifier for the travel vertical.
PremiumData	Nested Message	A data container for documents classified as “Premium.” This could be for content from high-authority publishers, subscription sources, or partners that undergo special indexing.

Quantifying Authority and Trust

Site-Level Authority Signals

The PerDocData model provides clear evidence that Google’s evaluation of a document is heavily contextualised by the authority of the domain on which it resides. This transcends individual page metrics and points to a holistic, site-wide assessment.

The existence of attributes like siteAuthority and references to NSR (Normalized Site Rank) confirms that Google calculates a proprietary, site-level quality score. NSR is described as a sophisticated system for evaluating a website’s overall reliability, integrating a multitude of factors to assign a score that directly influences search rankings.

This definitively proves that while Google representatives correctly state they do not use third-party metrics like Moz’s Domain Authority, they have their own internal, and far more complex, equivalent. The long-debated concept of “domain authority” is therefore not a myth; it is a core, calculated metric within the Content Warehouse. This means that strategic activities aimed at building sitewide trust, brand recognition, and a clean backlink profile have a direct, measurable impact on a data point used in ranking. Further evidence of this holistic evaluation comes from attributes like fireflySiteSignal, an internal project name for another set of site-level signals that contribute to ranking changes.

The PageRankPerDocData module confirms that PageRank, while no longer a public-facing metric, remains a core ranking system. The documentation also references homepagePagerankNs, indicating that the PageRank of a site’s homepage is stored as a distinct and important signal. Furthermore, the historical toolbarPagerank attribute confirms that the public-facing 0-10 score was a stored value, cementing its past importance in the ecosystem. The role of PageRank has evolved significantly from its original conception. It is no longer a simple measure of link volume but has been integrated with anti-spam systems like Penguin to better combat link manipulation. It now serves as a foundational link equity signal that is factored into the broader calculation of a site’s overall authority.

The domainAge and hostAge attributes provide concrete evidence that Google tracks the inception date of hosts and domains, using this data specifically to “sandbox fresh spam.” This confirms that while age itself may not be a direct ranking boost, it is used as a trust signal in spam evaluation.

Finally, the queriesForWhichOfficial attribute is a powerful signal, storing the specific query, country, and language combinations for which a document is considered the definitive “official page.” This is a direct mechanism for ensuring that brand homepages or official entity sites rank for their primary navigational queries.

Translating Quality Guidelines into Data: E-E-A-T and YMYL

Google’s public-facing Search Quality Rater Guidelines provide a conceptual framework for content quality through concepts like E-E-A-T and YMYL. The PerDocData structure reveals how these abstract concepts are likely translated into concrete data points.

E-E-A-T (Experience, Expertise, Authoritativeness, and Trustworthiness) is not a single, direct score but a classification derived from an aggregation of many underlying signals. The various metrics within PerDocData serve as the inputs to a model that determines a page’s E-E-A-T level. For example:

Authoritativeness and Trustworthiness are likely informed by quantitative signals like siteAuthority, NSR, and the quality and quantity of backlinks as measured by PageRankPerDocData.
Expertise is likely derived from semantic analysis, including the author’s entity recognition (connecting content to a known expert in the Knowledge Graph via authorObfuscatedGaiaStr) and the site’s topical focus, as measured by site2vecEmbeddingEncoded.
Experience, the newest addition, is likely assessed through analysis of first-person language, original imagery, and user-generated content signals.

For topics classified as YMYL (Your Money or Your Life)—such as health, finance, and safety—Google’s systems hold content to a significantly higher standard. The documentation provides concrete evidence of this with specific attributes like ymylHealthScore and ymylNewsScore. These fields store the outputs of dedicated classifiers for YMYL content in the health and news verticals. For documents identified as YMYL, the weighting of signals like siteAuthority, author credibility (via entity analysis), and factual accuracy is almost certainly increased dramatically.

Semantic Understanding: Beyond the Keyword

The Shift to Entities

The PerDocData model illustrates a fundamental evolution in Google’s content analysis: a definitive shift from matching keyword strings to understanding real-world entities and concepts.

The EntityAnnotations module is central to this process. It attaches specific Knowledge Graph entities that have been extracted from the page’s content. This transforms a simple document from a collection of words into an interconnected node in a web of knowledge. It allows Google to understand the things a page is about (e.g., the person “Harrison Ford,” the film series “Star Wars”) rather than just the text strings it contains. This process is facilitated by an internal system likely referred to as Webref, which provides the unique machine-readable IDs for entities, enabling the system to disambiguate between concepts with the same name (e.g., Apple the company versus apple the fruit).

Furthering this semantic understanding is the site2vecEmbeddingEncoded attribute. This represents a compressed vector embedding—a numerical representation—of an entire site’s content. In this machine learning model, the site’s collective themes and topics are mapped into a multi-dimensional space. This allows Google to mathematically measure the topical similarity between documents and even entire websites. It provides a quantifiable way to determine a site’s core focus and assess whether a new piece of content is topically consistent with the rest of the domain.

This technical implementation confirms that “topical authority” is not a vague marketing term but an algorithmically calculated concept. A website that maintains a tight focus on a specific set of related topics will generate a more coherent and powerful vector representation in this embedding space. Conversely, if a website focused on finance were to publish an article about gardening, the vector for that new article would be mathematically distant from the site’s established vector. This “topical deviation” can be measured and is likely used as a negative or dilutive signal, providing a technical basis for the long-standing strategic advice to maintain a clear topical focus and prune content that deviates from a website’s core subject area.

The granularity of analysis extends to the most basic on-page elements. Attributes like originalTitleHardTokenCount and titleHardTokenCountWithoutStopwords show that Google is not just reading titles, but analysing their structure and composition, counting the number of “hard tokens” (meaningful words) they contain.

Ultimately, the heavy reliance on entity annotation and vector embeddings indicates that modern on-page SEO is becoming an exercise in curating a knowledge graph. The primary task is no longer to optimise for keyword density but to clearly define the entities present on a page and their relationships to one another, making the mapping process for Google’s systems as clear and unambiguous as possible. This is achieved through precise language, the use of structured data to explicitly define entities, and a logical internal linking structure that reinforces the relationships between related concepts.

The User as a Ranking Signal: Clicks, Engagement, and Navboost

Direct Evidence of Clickstream Data

The PerDocData model provides irrefutable evidence that user behaviour signals are collected, stored at the document level, and used directly in ranking. This ends years of speculation and confirms that how users interact with search results is a primary input for Google’s systems.

The documentation reveals several core click signals:

impressions: The total number of times a URL is shown in the search results pages (SERPs), which serves as the denominator for calculating click-through rates.
GoodClicks and BadClicks: A classification of user clicks that likely distinguishes between a satisfying interaction and a “pogo-stick” event, where the user clicks a result and then immediately returns to the SERP to choose another.
LastLongestClicks: A particularly powerful signal that identifies the last result a user clicked on in a session and on which they dwelled for a significant period. This strongly implies that the user’s query was successfully answered by that page, making it a potent indicator of relevance and quality.

These signals are the primary inputs for a ranking system known as Navboost, which is hypothesised to be one of the most powerful re-ranking “Twiddlers”. The data flow is unambiguous: users interact with the SERPs, this generates clickstream data, the data is stored in the PerDocData object for the corresponding URL, and systems like Navboost use this data to adjust rankings up or down.

The presence of this granular click data elevates user experience (UX) from a peripheral “good practice” to a direct and measurable ranking factor. A poor on-page experience that causes users to leave quickly will generate BadClicks and short dwell times. These negative signals are recorded in the document’s permanent record and are used to demote its ranking over time. This means that optimising title tags and meta descriptions to win the initial click is only half the battle; the other, equally important half is satisfying the user’s intent post-click to earn the GoodClicks and LastLongestClicks signals.

This click-based re-ranking system effectively functions as a massive, real-time quality control feedback loop. It allows Google to use the collective, demonstrated behaviour of millions of users to fine-tune and validate its own algorithmic rankings. If the initial Mustang algorithm places a document at position #1, but users consistently ignore it and instead award LastLongestClicks to the document at position #3, the system learns that the #3 result is likely a better answer for that query. Over time, this feedback will promote the preferred result. In essence, Google uses its users as the final and most scalable layer of quality raters, constantly refining the SERPs based on real-world preference.

The Pulse of the Web: Freshness and Temporal Signals

Dissecting Freshness Signals

The PerDocData model reveals a sophisticated approach to quantifying the timeliness and relevance of content, moving far beyond a simple reliance on publication dates. Google employs multiple methods to determine a document’s temporal context:

Date Extraction: The system identifies dates from multiple sources, including the bylineDate (the date explicitly stated in an article’s byline), the syntacticDate (a date parsed from the URL structure or title), and, most importantly, the semanticDate (a date that is understood from the context of the content itself using Natural Language Processing).
Update Significance: The presence of a lastSignificantUpdate signal is a critical revelation. It indicates that Google’s systems can differentiate between minor cosmetic changes (like fixing a typo) and substantial content revisions. This confirms that simply changing a publication date without making meaningful updates is an ineffective tactic. An update’s “value” is algorithmically determined, likely by comparing document versions and calculating a change-score. If this score passes a certain threshold, the lastSignificantUpdate timestamp is refreshed, making the page eligible for a freshness boost.
Freshness Scoring: The freshboxArticleScores module stores specific scores from freshness-related classifiers, which are then used by the FreshnessTwiddler to boost timely content. A signal like isHotdoc may be used to flag content that is currently trending or newsworthy.

The existence of a semanticDate signal demonstrates that Google’s NLP capabilities can override explicit dates that may be manipulated. For instance, a publisher could set a bylineDate to the current day, but if the text of the article uses past-tense language to discuss events from several years ago, the semantic analysis will identify the content as old. When a conflict arises, the system will likely trust the semantic interpretation, making it much harder to game freshness signals with misleading timestamps.

Query Deserves Freshness (QDF)

These sophisticated freshness signals are not applied universally. They are connected to the long-standing concept of Query Deserves Freshness (QDF). This model dictates that freshness is not a global ranking factor but is heavily weighted for specific types of queries, such as those concerning recent events, regularly recurring events (like elections or conferences), or topics that require frequent updates to remain accurate (like product reviews or technical guides).

The SpamBrain Sentinel: Demotion Signals in SpamPerDocData

Overview of SpamPerDocData

The SpamPerDocData module is the document-level repository for signals related to webspam. It serves as the record of assessments made by Google’s comprehensive, AI-driven anti-spam system, SpamBrain. Launched in 2018, SpamBrain uses machine learning to identify spam patterns, manipulative link schemes, and low-quality content with remarkable accuracy. The scores and flags stored within SpamPerDocData are the direct outputs of this system.

Specific Spam Signals

The data within this module reflects a wide and granular range of spam tactics that Google actively detects and penalises:

Link Spam: The spamrank attribute specifically measures “the likelihood that this document links to known spammers,” showing that outbound link quality is a measured risk factor.
Content Spam: The system stores numerous specific content spam scores. These include KeywordStuffingScore, GibberishScore, and SpamWordScore, all represented as 7-bit integers. This demonstrates a multi-faceted approach to identifying low-quality content, moving beyond a single “spam” label to classify the specific type of violation. The OriginalContentScore is used for pages with very little content to measure originality and combat thin content, while spamtokensContentScore specifically measures spam in user-generated content sections.
Behavioural & Technical Spam: The spamMuppetSignals module is used to store signals related to hacked sites, allowing for query-time identification. The trendspamScore tracks the count of matching queries related to trending spam topics, showing an ability to react to new spam waves.
Reputation & Behavioural Spam: SpamBrain is continuously updated to combat emerging spam trends. This includes detecting scaled content abuse (mass-producing low-value content), site reputation abuse (“parasite SEO,” where third parties publish on a reputable domain), and the abuse of expired domains.

The reference to a “likelihood of a page being webspam” suggests that spam assessment is probabilistic, not a binary yes/no decision. This allows for a spectrum of penalties. A page engaging in borderline tactics might receive a low-grade spam score that acts as a slight negative weighting in the ranking algorithm. In contrast, a page with blatant and numerous violations would receive a very high spam score, resulting in a severe demotion or complete removal from the index.

Furthermore, the system is designed to be proactive. SpamBrain can act as a gatekeeper during the indexing process, not just a janitor that cleans up later. Some content detected as spam during the initial crawl is never even added to the index, meaning its SpamPerDocData module is populated with negative signals from the very beginning of its lifecycle, preventing it from ever gaining ranking traction.

Technical Foundations and Page Experience

MobilePerDocData and Mobile-First Indexing

The PerDocData structure includes a dedicated MobilePerDocData module, which stores mobile-friendliness scores and a list of specific compatibility issues for a given URL. The existence of this module confirms that technical performance and mobile usability are not fleeting, real-time calculations but are persistent, foundational attributes of a document stored within the Content Warehouse. This elevates technical SEO from a simple checklist to a fundamental aspect of how Google perceives and categorises a document. Poor mobile performance is a negative data point permanently attached to a URL’s record in the index.

Connecting Internal Data to External Signals

The data stored internally is directly linked to the public-facing Page Experience signals that Google has emphasised in recent years.

Core Web Vitals (CWV): While metrics like Largest Contentful Paint (LCP), Interaction to Next Paint (INP), and Cumulative Layout Shift (CLS) are measured in the field via the Chrome User Experience Report (CrUX), it is highly probable that this data is ingested, aggregated, and stored as a summary score or classification within PerDocData. The presence of voltData, which contains page UX signals for an internal project named “VOLT,” further confirms that multiple layers of UX signals are collected and contribute to this overall assessment.
Other Page Experience Factors: Other key components of page experience, such as the use of HTTPS, safe browsing status, and the absence of intrusive interstitials, are also likely stored as flags or scores within the document’s data. The penalty for intrusive interstitials, in particular, is a direct ranking signal, and a page’s compliance is a stored attribute.

It is logical to conclude that these various component scores are aggregated into a single, weighted “page experience score.” Google’s systems favour efficiency, and rather than evaluating multiple separate metrics in real-time for every document in a SERP, it is more scalable to use a pre-calculated, composite score. This score can then be easily applied as a boost or demotion factor by a dedicated Twiddler during the final re-ranking stage.

Specialised Content & Niche Signals

The PerDocData model also reveals that Google does not use a one-size-fits-all approach to document evaluation. It contains numerous modules and attributes tailored for specific content types and purposes.

Commercial Intent (`commercialScore`)

The commercialScore attribute is a direct measure of a document’s commerciality. A score greater than zero indicates the page “sells something.” This confirms that Google’s systems actively classify pages based on their position in the marketing funnel. This score is likely used to better match pages to queries with clear commercial intent and may also be a factor in how other quality signals are weighted.

Vertical-Specific Data Modules

The presence of dedicated data modules like BookCitationData, videodata, imagedata, BlogData, and scienceDoctype demonstrates that Google applies specialised analysis to different content verticals. A scientific paper is evaluated differently from a blog post, and a video’s properties are distinct from a book’s citation record. This indicates that optimisation strategies should be tailored to the specific content format, as Google is not evaluating them with the same generic lens.

Internationalisation and Localisation (`localizedCluster`)

For websites that operate in multiple languages or regions, the localizedCluster attribute is highly significant. It stores information about the “relationship of translated and/or localized pages.” This confirms that Google actively attempts to map different language versions of the same content together. Correctly implementing hreflang and other internationalisation signals is therefore critical to help Google build these clusters accurately, ensuring the correct language version is served to the appropriate user and that authority signals are consolidated across different versions of a page.

Advanced Signals and Nuanced Ranking Mechanics

Further analysis of the PerDocData model reveals several highly specific and powerful signals that offer a more granular view into the sophistication of Google’s ranking systems.

Internal Prominence and Simulated Traffic (`onsiteProminence`)

The onsiteProminence attribute provides a definitive confirmation of how Google calculates internal link equity. The documentation describes this as a measure of a document’s importance within its own site. Crucially, it is “computed by propagating simulated traffic from the homepage and high craps click pages.” This reveals two key insights:

Google runs a simulation of user flow through a website to determine which pages are most important.
This simulation starts from key entry points: the homepage and, significantly, pages that already receive high volumes of search clicks.

This confirms that internal links are not treated equally. A link from a high-traffic page to another page on the same site passes more “prominence” than a link from an obscure, rarely visited page. This provides a technical basis for the strategic advice to internally link from your highest-performing pages to other pages you wish to boost.

Document Intent Classification: The “Asteroid Belt” (`asteroidBeltIntents`)

The asteroidBeltIntents attribute, an internal project name, points to a highly granular system for document intent classification. This system moves far beyond the traditional SEO model of informational, navigational, and transactional intent. Instead, it appears to assign a list of multiple, specific intents to a single document, each with a corresponding confidence score. This suggests that Google understands that a single page can serve multiple purposes. For example, a product page can be both transactional (“buy this”) and informational (“read reviews,” “compare specifications”). This system allows Google to match a page to a wider and more nuanced range of queries by understanding all the potential user needs it can satisfy.

Advanced Content & Duplicate Analysis (`shingleInfo`, `bodyWordsToTokensRatio`)

The documentation reveals sophisticated methods for content analysis that go beyond simple keyword counting.

shingleInfo: This attribute confirms the use of “shingling,” a well-established computer science technique for detecting near-duplicate content. The process involves breaking a document down into small, overlapping chunks of text (shingles) and creating a unique fingerprint. By comparing these fingerprints, Google can identify pages that are substantially similar, even if they are not exact copies. This is the technical underpinning of how Google handles duplicate and thin content.
bodyWordsToTokensRatio: This metric measures the ratio of meaningful words to the total number of “tokens” (words, punctuation, etc.) on a page. The documentation also specifies that this ratio is calculated separately for the beginning of the document (bodyWordsToTokensRatioBegin) and the document as a whole. This suggests a nuanced analysis of content quality and density, with particular attention paid to the content that appears “above the fold.” A low ratio could signal thin, boilerplate, or auto-generated content.

SERP Diversity and Host Crowding (`crowdingdata`)

The presence of a crowdingdata module suggests a system designed to manage search result diversity. This is likely the mechanism that prevents a single domain from dominating the search results for a particular query, a phenomenon often referred to as “host crowding.” By limiting the number of results from any one site, this system ensures users are presented with a variety of sources and perspectives, improving the overall quality of the search experience.

Strategic Synthesis: A Unified SEO Framework for 2025 and Beyond

The analysis of the PerDocData model provides a clear and detailed blueprint of how Google evaluates web pages. It confirms a sophisticated, multi-faceted, and data-driven system that measures and stores information on everything from brand authority and user clicks to content freshness and semantic entities. The era of attempting to manipulate a simplistic algorithm is definitively over. The Content Warehouse documentation reveals a system that is increasingly reliant on measuring real-world authority, genuine user satisfaction, and demonstrable expertise.

Based on this evidence, a unified and sustainable SEO framework for the future must be built upon three core pillars.

Pillar 1: Foundational Authority & Trust

Building a strong siteAuthority and NSR score is the non-negotiable foundation of modern SEO. This requires a long-term commitment to establishing the entire domain as a credible and trustworthy source. Key activities include creating a topically focused site to build a coherent site2vecEmbeddingEncoded, earning high-quality backlinks from other authoritative sites to build PageRankPerDocData, and demonstrating E-E-A-T through clear authorship, transparent business practices, and factually accurate content.

Pillar 2: Content that Satisfies and Engages

Content must be created with the primary goal of satisfying user intent and generating positive click signals (GoodClicks, LastLongestClicks) to perform well in systems like Navboost. This necessitates a profound focus on user experience, from the SERP snippet to the on-page journey. The content itself must provide the best, most comprehensive answer to a user’s query. For topics where timeliness matters, content must be kept demonstrably fresh, triggering the lastSignificantUpdate signal to earn a boost from freshness systems.

Pillar 3: Technical Excellence & Semantic Clarity

A website must be technically flawless to ensure a positive Page Experience score is recorded in its PerDocData. This includes optimising for Core Web Vitals, ensuring mobile-friendliness, and avoiding intrusive elements. Beyond technical performance, content must be structured to clearly communicate its meaning to Google’s machine-learning models. This involves using structured data (Schema.org) to explicitly define entities, employing a logical internal linking strategy to reinforce relationships between concepts and boost onsiteProminence, and writing with clarity to aid the EntityAnnotations process.

Ultimately, the blueprint revealed by the Content Warehouse documentation confirms that sustainable success in Google’s ecosystem is less about manipulating an opaque algorithm and more about building a genuine, authoritative brand that users actively seek, trust, and engage with. Every one of these positive interactions is measured, stored, and used to rank content, creating a system that increasingly rewards authenticity and user value above all else.

Disclosure: I use generative AI when specifically writing about my own experiences, ideas, stories, concepts, tools, tool documentation or research. My tool of choice for this process is Google Gemini Pro 2.5 Deep Research. I have over 20 years writing about accessible website development and SEO (search engine optimisation). This assistance helps ensure our customers have clarity on everything we are involved with and what we stand for. It also ensures that when customers use Google Search to ask a question about Hobo Web software, the answer is always available to them, and it is as accurate and up-to-date as possible. All content was conceived ad edited verified as correct by me (and is under constant development). See my AI policy.

Disclaimer: Any article (like this) dealing with the Google Content Data Warehouse leak is going to use a lot of logical inference when putting together the framework for SEOs, as I have done with this article. I urge you to double-check my work and use critical thinking when applying anything for the leaks to your site. My aim with these articles is essentially to confirm that Google does, as it claims, try to identify trusted sites to rank in its index. The aim is to irrefutably confirm white hat SEO has purpose in 2025.

Key Findings Synopsis

The Content Warehouse: Google’s Digital Brain

Architectural Context

The Search Processing Pipeline

Crawling & URL Discovery

Indexing & Storage

Initial Ranking (Mustang)

Re-ranking (Twiddlers)

Understanding the Vessel: PerDocData as a Protocol Buffer

Technical Primer on Protocol Buffers (Protobuf)

The Role of PerDocData

High-Impact Modules and Attributes in PerDocData

Authority & Trust

Content Quality

Spam Detection

User Engagement & Behaviour

Freshness & Timeliness

Semantic & Topical Relevance

Technical & Page Experience

Geographic & Language Signals

Specialised Content & Niche Signals

Quantifying Authority and Trust

Site-Level Authority Signals

Translating Quality Guidelines into Data: E-E-A-T and YMYL

Semantic Understanding: Beyond the Keyword

The Shift to Entities

The User as a Ranking Signal: Clicks, Engagement, and Navboost

Direct Evidence of Clickstream Data

The Pulse of the Web: Freshness and Temporal Signals

Dissecting Freshness Signals

Query Deserves Freshness (QDF)

The SpamBrain Sentinel: Demotion Signals in SpamPerDocData

Overview of SpamPerDocData

Specific Spam Signals

Technical Foundations and Page Experience

MobilePerDocData and Mobile-First Indexing

Connecting Internal Data to External Signals

Specialised Content & Niche Signals

Commercial Intent (commercialScore)

Vertical-Specific Data Modules

Internationalisation and Localisation (localizedCluster)

Advanced Signals and Nuanced Ranking Mechanics

Internal Prominence and Simulated Traffic (onsiteProminence)

Document Intent Classification: The “Asteroid Belt” (asteroidBeltIntents)

Advanced Content & Duplicate Analysis (shingleInfo, bodyWordsToTokensRatio)

SERP Diversity and Host Crowding (crowdingdata)

Strategic Synthesis: A Unified SEO Framework for 2025 and Beyond

Pillar 1: Foundational Authority & Trust

Pillar 2: Content that Satisfies and Engages

Pillar 3: Technical Excellence & Semantic Clarity

References

Commercial Intent (`commercialScore`)

Internationalisation and Localisation (`localizedCluster`)

Internal Prominence and Simulated Traffic (`onsiteProminence`)

Document Intent Classification: The “Asteroid Belt” (`asteroidBeltIntents`)

Advanced Content & Duplicate Analysis (`shingleInfo`, `bodyWordsToTokensRatio`)

SERP Diversity and Host Crowding (`crowdingdata`)