Official statement
Other statements from this video 7 ▾
- 65:36 Can WordPress Site Kit really enhance your organic SEO?
- 74:07 Can Site Kit truly transform your Search Console data into a winning content strategy?
- 155:26 Is it true that Google indexes the Shadow DOM?
- 257:15 Why do Google search results change depending on when you ask the same query?
- 271:20 Does Google really overlook the scripts and extra content on your pages?
- 326:30 How does Google query billions of pages in less than a second?
- 334:42 How does Google truly identify relevant documents for a query?
Google breaks documents down into tokens during indexing and does not retain all of the raw HTML. Only certain specific HTML elements, the actual words, and their exact positions are stored, as the position of terms directly impacts rankings. This means that any non-strategic HTML optimization can be completely ignored by the index.
What you need to understand
What is tokenization and why does Google use it?
Tokenization is the process by which Google breaks down a document into basic units called tokens. A token can be a word, a part of a word, a number, or even a symbol. This segmentation allows the engine to process and analyze content algorithmically rather than just storing entire pages.
The benefit for Google? Massive reduction in storage needs and optimization of processing times during ranking. Instead of keeping billions of HTML pages with all their tags, attributes, and scripts, the index retains only what matters: meaningful terms and their positional context.
Which HTML elements are actually retained?
Gary Illyes notes that certain HTML elements are retained for specific reasons, without detailing which ones exactly. One can reasonably think of semantic structure tags: <title>, <h1>-<h6>, <strong>, <a> with their href attributes, or even emphasis and list tags.
Structured data from schema.org, although in JSON-LD, are also likely tokenized and retained separately. What disappears? Probably the majority of inline CSS, style attributes, div/span tags without semantic value, and many custom data-* attributes that are of no use for ranking.
Why is the position of terms so important?
Google explicitly stores the exact position of each word in the document. This information serves various ranking algorithms: the proximity of terms in multi-word queries, detection of hot zones (title, beginning of paragraphs, anchors), and analysis of positional density.
A term appearing within the first 100 words of a document will likely carry different weight than the same term appearing at position 2000. Passage ranking algorithms heavily rely on this positional data to identify the most relevant sections of long content.
- Google tokenizes documents to reduce storage and speed up algorithmic processing
- Only certain HTML elements are kept, likely those with semantic or structural value
- The exact position of words is stored because it directly influences ranking
- The complete raw HTML is not retained in the main index
- HTML optimizations with no semantic impact are likely to be completely ignored
SEO Expert opinion
Is this statement consistent with field observations?
Yes, and it explains many phenomena observed over the years. Tests showing that overloaded HTML code does not directly penalize ranking (as long as the content remains accessible) can be explained by this tokenization: Google simply discards what it doesn't care about.
This also explains why certain cosmetic optimizations — adding span tags with fanciful classes, multiplying aria attributes inconsistently, or endlessly nesting divs — have absolutely no measurable impact. If Google does not retain these elements, they are invisible for ranking.
What gray areas remain in this statement?
Gary Illyes remains intentionally vague about which specific HTML elements are retained and why. [To verify]: are <em> and <strong> tags really distinguished from <b> and <i> tags during tokenization? What about alt, title, aria-label attributes?
Another point not clarified: how does Google manage modern JavaScript in this process? If content is dynamically injected after rendering, is the position of tokens calculated on the final DOM or on the initial HTML? Can rendering latency affect positional accuracy? [To verify] on complex sites in React or Vue.
In what cases might this logic not apply?
This tokenization pertains to the main search index, but other Google systems may operate differently. Google Discover, Google News, featured snippets, or rich results likely rely on specialized indexing pipelines that retain more HTML structure or metadata.
Interactive elements (forms, buttons) or attributes related to accessibility might also be processed by parallel systems — it’s known that Google uses accessibility as an indirect quality signal. So, beware of oversimplifying: tokenization does not mean that everything else is useless.
Practical impact and recommendations
What should you prioritize optimizing in terms of HTML?
Focus on high-value semantic tags: <title>, <h1>-<h6>, <strong>, <a>, and attributes that carry meaning like href, alt, or structured data. These elements are likely to be retained and analyzed during tokenization.
Place your strategic keywords at the beginning of the document, within the first 100-200 words, and in semantically important areas (titles, beginning of paragraphs, internal link anchors). Position matters, so write accordingly: no vague introductory fluff before getting to the point.
What mistakes should be absolutely avoided?
Stop wasting time on cosmetic HTML optimizations with no semantic value. Multiplying divs with SEO-friendly classes, adding fanciful data-* attributes, or over-structuring the markup to "help Google" does absolutely nothing if these elements are not retained.
Also, avoid hiding important content in heavy JavaScript that delays rendering. If Google tokenizes after rendering but your content takes 5 seconds to appear, the calculated position may be skewed, or the content may be partially ignored if the rendering timeout is reached.
How can you check if your HTML structure is being utilized correctly?
Use the URL Inspection Tool in the Search Console and look at the "Crawled" version of your page. Compare the source HTML to the final DOM: if strategic content appears only after JavaScript rendering, measure the time needed and check for consistency.
Also, test the positional density of your key terms using tools like Screaming Frog or custom scripts that calculate the exact position (in terms of number of words since the beginning of the <body>) of each occurrence. If your strategic terms only show up after 1000 words of fluff, it’s a structural problem.
- Favor recognized semantic tags (<h1>, <strong>, <a>) over generic divs
- Place key terms at the beginning of the document and in structural areas
- Clean HTML of attributes and tags without semantic value
- Ensure strategic content is accessible without JavaScript or appears quickly after rendering
- Test the exact position of terms with scraping tools or custom scripts
- Compare the source version and the crawled version by Googlebot in the Search Console
❓ Frequently Asked Questions
Google conserve-t-il vraiment tous les mots d'une page ou peut-il en ignorer certains ?
Les balises <strong> et <em> ont-elles encore un impact SEO si Google tokenise ?
Si Google jette une partie du HTML, pourquoi optimiser le code source ?
La position des mots dans le DOM final (après JavaScript) ou dans le HTML source compte-t-elle ?
Comment savoir quels éléments HTML Google conserve exactement lors de la tokenisation ?
🎥 From the same video 7
Other SEO insights extracted from this same Google Search Central video · duration 434h25 · published on 23/02/2021
🎥 Watch the full video on YouTube →
💬 Comments (0)
Be the first to comment.