How does Google truly identify relevant documents for a query?

Official statement

Google uses posting lists that identify documents containing certain keywords. For example, for a search 'oatmeal cookies', the posting list indicates which documents contain 'oatmeal' and which contain 'cookies', and then Google sends the intersection of these two sets to the service system.

334:42

🎥 Source video

Extracted from a Google Search Central video

⏱ 434h25 💬 EN 📅 23/02/2021 ✂ 8 statements

Watch on YouTube (334:42) →

✂ Other statements from this video 7 ▾

📅

Official statement from February 23, 2021 (5 years ago)

⚠ A more recent statement exists on this topic How does Google choose the top 1000 results? Gary Illyes · June 17, 2021 View statement →

TL;DR

Google relies on posting lists — inverted indexes that list for each keyword all documents that contain it. During a multi-term search, the algorithm calculates the intersection of these lists to isolate the candidate pages. Essentially, this means that the literal presence of the query terms in your content remains a fundamental prerequisite, even if other ranking factors come into play later.

What you need to understand

What is a posting list and why is this mechanism fundamental?

A posting list (or inverted list) is a data structure that associates each unique term in the index with the list of documents in which it appears. When you type "oatmeal cookies", Google checks two separate lists: the one for documents containing "oatmeal" and the one for documents containing "cookies".

The intersection of these two sets — the pages that have both terms — forms the pool of candidates sent to the ranking system. This is a matter of massive preliminary filtering, which drastically reduces the number of documents that need to be evaluated before moving on to complex relevance signals.

Why is Google revealing this internal mechanism now?

Gary Illyes has always been transparent about the fundamentals of indexing, but this statement comes at a time when many practitioners are focusing solely on semantic search and neglecting the literal presence of keywords. Google reminds us that, despite the advancements of BERT and MUM, step zero remains lexical matching.

This does not mean that Google ignores synonyms or rephrasing — additional layers come into play after this initial filter — but without initial matching in the posting lists, a document will not even be considered. It’s a non-negotiable technical prerequisite.

Are synonyms and variations considered in these lists?

Google enriches its posting lists with morphological variants (plurals, conjugations) and, to some extent, with known synonyms. However, this expansion is not unlimited: if you target "hiking shoes" and your page only mentions "mountain boots", there’s no guarantee that Google will establish equivalence at the posting list level.

Language models come into play later in the ranking pipeline to refine relevance. Don’t rely on them to compensate for a total absence of target keywords in your HTML — this is a common mistake since the arrival of generative AI in the SERPs.

The posting lists are inverted indexes: each keyword points to a list of documents.
The intersection of multiple lists drastically reduces the number of candidates before ranking.
The literal presence of the query terms remains a technical prerequisite, even if semantic layers exist afterward.
Morphological variants are supported, but distant synonyms are not always at the posting list level.
Optimizing for the initial lexical matching remains a non-negotiable foundation of on-page SEO.

SEO Expert opinion

Is this statement consistent with real-world observations?

Completely. A/B tests on pages where exact keywords are removed or reintroduced show an immediate impact on traffic for long-tail queries. If Google relied solely on semantic understanding, we wouldn’t see such clear variations.

But beware: this does not mean we should revert to keyword stuffing. The presence of the terms must be natural and contextualized — the posting lists qualify you for the competition, but it’s the ranking that determines your position. A document stuffed with keywords without coherence will be filtered out by the subsequent layers.

What nuances should we add to this explanation?

Google doesn’t specify how it handles ambiguous intentions, named entities, or conversational queries here. Posting lists are a lexical filter, not a relevance engine. Once the intersection is calculated, dozens of other signals come into play: authority, freshness, UX, user signals.

Furthermore, Gary Illyes does not clarify whether the posting lists integrate structured metadata (schema.org, meta tags) or are limited to visible text content. [To be checked] — in practice, we observe that title tags and H1s seem weighted differently, suggesting some enrichment beyond raw text.

In what cases might this mechanism be insufficient?

For very low volume queries or neologisms, Google may not have constructed a robust posting list. In these cases, it relies more on language models to guess intent, risking delivering approximate results.

Another limitation: multilingual searches or queries where the user mixes multiple languages. Posting lists are generally compartmentalized by language, and the intersection may fail if Google does not correctly detect the query language. This is a known blind spot, especially for content in regional or minority languages.

Warning: This statement covers only the candidate retrieval phase. It says nothing about the ranking itself or how Google prioritizes documents once the intersection is calculated. Don’t overestimate the weight of lexical matching — it's a necessary but not sufficient condition.

Practical impact and recommendations

What actions should you take to optimize your content?

Start with a lexical audit of your strategic pages: identify target queries and check that the exact terms appear in the HTML (title, H1, body text). Use tools like Screaming Frog or Oncrawl to cross-reference your priority keywords with the indexed content.

Do not limit yourself to synonyms or "intelligent" rephrasings — if your target keyword is "young driver car insurance", make sure this phrase literally appears in your content. Language models do not always compensate for the absence of initial lexical matching.

What mistakes should you absolutely avoid?

The first mistake: believing that AI-generated content is sufficient because it "understands the subject". If the produced text constantly paraphrases without ever using your target query’s exact terms, you will not appear in the intersection of posting lists for that query.

The second mistake: neglecting query variations. "Running shoes" and "race sneakers" may seem equivalent, but Google builds distinct posting lists. Cover both in your content if you aim for both traffic segments — don’t rely on automatic equivalence.

How can I check if my site properly utilizes this mechanism?

Use Search Console to cross-reference the actual queries generating impressions with your page content. If you have impressions without clicks on strategic terms, you are in the posting lists but poorly ranked — a ranking issue, not a matching issue.

If you have no impressions on a keyword you thought you were targeting, either Google has not indexed it (crawling issue) or your page does not contain the exact term and therefore does not appear in the corresponding posting list. A simple test: search for "site:yourdomain.com exact-keyword" — if no results, you have a lexical matching issue.

Check the literal presence of target keywords in title, H1, and body text
Cross-reference Search Console queries with indexed content to identify lexical gaps
Cover morphological variants and direct synonyms within the same content
Do not delegate lexical optimization to generative AI without SEO proofreading
Regularly test with "site:" searches to verify matching
Prioritize strategic pages for a thorough lexical audit before scaling

The mechanism of posting lists reminds us that on-page SEO remains fundamental: without initial lexical matching, your content isn’t even in the race. Combine this optimization with a solid content strategy and impeccable technical architecture. These projects can quickly become complex to orchestrate alone, especially at the scale of a site with thousands of pages — enlisting a specialized SEO agency can help industrialize these checks and avoid costly blind spots.

❓ Frequently Asked Questions

Les posting lists prennent-elles en compte les synonymes ?

Google enrichit ses posting lists avec des variantes morphologiques (pluriels, conjugaisons) et certains synonymes directs, mais cette expansion est limitée. Ne comptez pas sur elle pour compenser l'absence totale d'un mot-clé cible dans votre contenu.

Si ma page contient un seul des deux mots d'une requête, apparaît-elle quand même ?

Non, pour une requête multi-termes comme "oatmeal cookies", Google calcule l'intersection des posting lists. Si votre page ne contient qu'un seul des deux termes, elle ne sera pas dans le pool de candidats envoyés au ranking.

Les balises meta et schema.org sont-ils pris en compte dans les posting lists ?

Google ne le précise pas officiellement, mais les observations terrain suggèrent que title et H1 sont pondérés différemment du corps de texte, ce qui implique un enrichissement des posting lists au-delà du texte brut. À vérifier par des tests.

Dois-je réintégrer mes mots-clés exacts même si mon contenu est sémantiquement riche ?

Oui. Les modèles de langage interviennent après le filtre des posting lists. Sans matching lexical initial, votre page ne sera même pas évaluée par les couches de ranking sémantiques, aussi pertinente soit-elle sur le plan conceptuel.

Comment vérifier si Google a bien indexé mes mots-clés cibles ?

Utilisez une recherche "site:votredomaine.com mot-cle-exact" dans Google. Si aucun résultat n'apparaît, soit le terme n'est pas présent dans votre HTML indexé, soit Google ne l'a pas crawlé. Croisez ensuite avec la Search Console pour confirmer.

🏷 Related Topics

posting lists indexation mots-clés matching lexical crawl ranking SEO on-page index inversé

AI & SEO PDF & Files

🎥 From the same video 7

Other SEO insights extracted from this same Google Search Central video · duration 434h25 · published on 23/02/2021

🎥 Watch the full video on YouTube →

Related statements

« Previous

Google's service index utilizes distributed shards...

Google tokenizes documents and retains certain HTM...

« Back to results