Clicky

How Human Quality Raters Are Used – New Evidence From DOJ V Google Antitrust Trial

DOJ

DOJ v. Google – Case No. 20-cv-3010 (Remedial Phase Opinion)

The court document reveals that scores from human quality raters are a direct and foundational input for training Google’s core ranking models.

Key Takeaways

  • Human quality rater scores are a direct training input for Google’s core ranking models (RankEmbed, RankEmbedBERT).

  • These scores are foundational data, not peripheral feedback, combined with query and click logs.

  • Court testimony shows rater-trained models improved Google’s performance on long-tail queries.

  • This contradicts Google’s long-standing public stance that rater scores only serve as indirect benchmarks.

The Role of Human Quality Raters

The DOJ v. Google remedial opinion makes clear that human quality raters are not just external evaluators – their judgments directly shape the very core of Google’s ranking systems. The opinion reveals that the RankEmbed and RankEmbedBERT models, which are central to Google’s AI-based ranking, are trained on two primary sources of data: search logs and human rater scores. This elevates rater input from “guidance” to direct training data.

The testimony of Google’s Vice President of Search, Dr. Pandu Nayak, further highlights their impact: rater-trained RankEmbedBERT models significantly improved Google’s ability to process complex, long-tail queries, where language understanding is essential.

The court emphasised that these scores form a foundational dataset in combination with user interaction logs. The data pipeline for RankEmbed models explicitly relies on the scoring of web pages by raters, embedding their judgments into machine learning systems that decide how billions of pages are ranked.

This stands in contrast to Google’s public communications, which have long maintained that rater scores do not directly affect site rankings. While technically true at the individual page level, the opinion shows that, in aggregate, rater scores are systemic training inputs that define how the search engine learns to rank. The models built from this data have “directly contributed to Google’s quality edge over competitors,” underscoring just how central rater input is to the evolution of Google Search.

Direct Training Data for Ranking Models

The document explicitly states that human rater scores are one of two primary data sources used to train the RankEmbed and RankEmbedBERT models. These are described as sophisticated, AI-based systems critical to Google’s search quality:

RankEmbed and its later iteration RankEmbedBERT are ranking models that rely on two main sources of data: % of 70 days of search logs plus scores generated by human raters and used by Google to measure the quality of organic search results.
View in PDF

Improving Performance on Difficult Queries

The impact of these rater-trained models is significant, particularly in improving Google’s ability to handle complex and less common searches. Testimony from Google’s Vice President of Search confirms:

RankEmbedBERT was again one of those very strong impact things, and it particularly helped with long-tail queries where language understanding is that much more important.
View in PDF

Providing Foundational Data for Machine Learning

The document clarifies that rater scores are not just casual feedback but a fundamental dataset for these AI systems:

The data underlying RankEmbed models is a combination of click-and-query data and scoring of web pages by human raters.
View in PDF

This establishes human judgments as a core component used to teach the models how to rank search results.

Deviation From Google’s Past Statements

The finding that rater scores are a direct training input for a core ranking model like RankEmbed clarifies and arguably deviates from the spirit of Google’s long-held public statements.

Google’s Public Stance

For years, Google has consistently stated that quality rater scores do not directly impact the ranking of any individual website. The company’s official guidance describes the raters’ role as providing feedback to help “benchmark the quality of our results” and “evaluate changes.” This has often been interpreted to mean their influence is indirect — more like feedback that helps engineers tune the system overall, rather than a direct ranking signal.

The Deviation Revealed in Court

The court document clarifies that this influence is far more direct and systemic than previously understood. While a single rater’s score may not manually move a page up or down, the aggregated scores are a foundational dataset used to build and train an automated ranking system that is a core part of the algorithm:

The RankEmbed models trained on that data have directly contributed to the company’s quality edge over competitors.
View in PDF

In essence, Google’s statements are technically correct in that no single rater score directly changes a site’s ranking. But the court’s findings show a direct, systemic link where the collective judgment of raters is used to train core AI ranking models.

This role is far more influential than the “feedback” or “benchmarking” role Google has historically emphasised.

Hobo
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.