Chrome’s New Shopping Classifier
One of our AI SEO hall-of-famers, Olivier de Segonzac from RESONEO has managed to gain access to Google’s shopping classifier model. We’ve examined the model, reverse engineered its inference pipeline and this article is what we found.
TL;DR
- Newly shipped in Chrome.
- Determines whether a web page is a shopping page or not.
- Every page you visit gets scored.
- Score is stored in Chrome’s history database.
- Used to personalize user experience and recommendations.
- The model splits your page into 10 chunks of ~100 words each and truncates every chunk to 64 tokens.
- Roughly half the words never reach the model.
Model Demo
Below is a real-world implementation of the model tested by loading a shopping-related page and following Chrome’s native 10 passage, 64 tokens per-passage logic.
The Pipeline
The classifier doesn’t look at raw HTML. It doesn’t look at the DOM directly either. Chrome uses a structured content extraction system called AnnotatedPageContent, accessible via the Chrome DevTools Protocol method Page.getAnnotatedPageContent. This system walks the rendered page and produces a tree of typed content nodes: text, tables, image captions.
The full pipeline looks like this:
Rendered Page
→ Blink AnnotatedPageContent extraction (5 seconds after load)
→ Text nodes collected from content tree
→ Greedy word-count chunking into passages
→ SentencePiece tokenization (64 tokens per passage)
→ Passage Embedder (TFLite) → 768-dim vectors
→ Mean pooling + title/URL embedding concatenation → 1536-dim input
→ Shopping Classifier (TFLite) → probability score (0 to 1)
How Pages Are Chunked
There is no semantic segmentation. Chrome uses a greedy word counter. Text items from the content tree are accumulated into a passage until the word count reaches 100, then a new passage starts. Items shorter than 5 words are always appended to the current passage rather than starting a new one.
The limits:
- 100 words max per passage
- 5 words min per text item to trigger a new passage
- 10 passages max per page
Everything beyond the first 10 passages is discarded.
The Tokenizer Bottleneck
Each passage is tokenized with SentencePiece and then truncated to 64 tokens. An EOS token is appended if there’s room, and shorter sequences are zero-padded.
64 tokens translates to roughly 35–50 English words depending on vocabulary complexity. Product names and brand-heavy text tokenize less efficiently (around 35 words), while natural prose gets closer to 50.
This means each 100-word passage loses roughly half its content at the tokenizer stage. Across 10 passages, the model effectively sees about 400–450 words of a page that may contain thousands.
The Embedder
The passage embedder (OPTIMIZATION_TARGET_PASSAGE_EMBEDDER) is a TFLite DualEncoder transformer model. It takes int32[1, 64] token IDs as input and outputs a float32[1, 768] embedding vector. The same model embeds both the page passages and the title/URL string.
The title/URL input is constructed by concatenating the page title and URL with a separator: "Page Title - https://example.com/path".
The Classifier
The shopping classifier takes a float32[1, 1536] input vector, which is two 768-dim embeddings concatenated:
- First 768 dimensions: title/URL embedding
- Last 768 dimensions: mean-pooled passage embeddings
Multiple passage embeddings are combined using element-wise mean pooling. This is specified in the model’s metadata (pooling_strategy = POOLING_STRATEGY_MEAN, max_passages = 10).
The output is a single float between 0 and 1 representing the probability that the page is a shopping page.
Testing It
I extracted both models from Chrome and built a Streamlit app that replicates the full pipeline. It uses Selenium to launch Chrome Canary, calls Page.getAnnotatedPageContent via CDP to get the same structured content Chrome uses internally, then runs the chunking, tokenization, embedding, and classification steps.
Results on a few test inputs:
| Input | Score |
|---|---|
| “Breaking news: earthquake hits California coast” | 0.0000 |
| “How to learn Python programming for beginners” | 0.0000 |
| “Wikipedia – History of the Roman Empire” | 0.0000 |
| “BBC Sport – Premier League results and fixtures” | 0.0000 |
| “Amazon.com: Apple iPhone 15 Pro Max 256GB” | 1.0000 |
| “Best deals on laptops this Black Friday – up to 50% off” | 1.0000 |
| dejan.ai | 0.0000 |
| owayo.com/custom-cycling-jerseys.htm | 0.9998 |
The model produces sharp, confident separations despite the lossy input pipeline.
What Chrome Does With the Score
The shopping classification feeds two systems:
Per-page annotation. The score is stored in Chrome’s history database as part of VisitContentAnnotations. This is used by History Journeys to cluster shopping visits together.
User-level segmentation. Scores are aggregated over time by Chrome’s Segmentation Platform into a separate model (OPTIMIZATION_TARGET_SEGMENTATION_SHOPPING_USER). If a user is classified as a “shopping user,” Chrome enables commerce features: price tracking in the omnibox, price drop notifications, shopping insights in the side panel, and shopping cards on the new tab page.
The per-page classifier is a signal collector that builds a user-level shopping profile, which in turn gates which commerce features Chrome presents.
Why This Matters for E-Commerce SEO
If Chrome can’t identify your page as a shopping page from the first ~450 words of visible content, your users won’t see commerce features like price tracking and shopping insights. Navigation menus, cookie banners, and boilerplate that appear early in the DOM consume your token budget before the model reaches your product information. E-commerce sites that bury product signals below heavy navigation and promotional blocks risk being invisible to the classifier entirely.