How does Google truly identify duplicate pages on your site?

Official statement

To identify duplicate pages, Google checks the content while ignoring repetitive, value-less parts like boilerplate. 'Soft' errors can disrupt this identification, so it's advisable to return appropriate HTTP errors.

3:07

🎥 Source video

Extracted from a Google Search Central video

⏱ 8:02 💬 EN 📅 31/03/2020 ✂ 12 statements

Watch on YouTube (3:07) →

✂ Other statements from this video 11 ▾

2:35 Pourquoi les redirections sont-elles vraiment indispensables lors d'une refonte de site ?
3:35 Pourquoi les redirections sont-elles critiques lors d'une refonte de site ?
3:50 Faut-il vraiment renvoyer un code 500 plutôt qu'un 200 pour une page d'erreur ?
4:10 Les balises rel=canonical sont-elles vraiment un signal fiable pour contrôler le clustering ?
4:46 Le rel=canonical est-il vraiment indispensable pour éviter les erreurs d'indexation ?
5:14 Le contenu localisé peut-il être considéré comme du duplicate content par Google ?
5:25 Hreflang peut-il vraiment empêcher Google de dédupliquer vos pages localisées ?
5:50 Comment Google choisit-il vraiment l'URL représentative à indexer ?
6:19 Comment Google choisit-il l'URL canonique dans un cluster de pages similaires ?
8:02 Pourquoi vos signaux canoniques contradictoires sabotent-ils votre indexation ?
8:02 Que se passe-t-il quand vos signaux canoniques se contredisent ?

What you need to understand

What does Google mean by 'content without boilerplate'?

Google does not blindly compare the raw HTML of two pages to decide if they are duplicates. The engine first extracts the unique main content — what Google calls the 'main content' — by discarding everything that repeats across multiple pages: navigation, sidebars, footers, recurring headers.

Specifically, if your template includes 600 words of boilerplate and only 150 words change between two pages, Google only looks at those 150 words. If those 150 words are identical or nearly identical across two different URLs, you have a duplicate in the eyes of the algorithm — even if the visual appearance is different.

What are the issues caused by soft 404 errors?

A soft 404 error is a page that no longer exists or contains nothing useful but returns a 200 HTTP code ('all is well') instead of a 404 or 410. Typically: empty category pages, search results with no results, deleted product sheets still accessible with a generic message.

The trap: all these pages often display the same error message or the same empty template. Google extracts the main content, finds 50 identical words across 20 different URLs, and concludes that there is a cluster of duplicates. The result: these pages are likely to be grouped together, and only one version will be indexed — often not the one you would have chosen.

What is the reasoning behind this approach?

Google aims to save its crawl budget and not pollute its index with redundant content. If the engine detects that 80% of your pages share the same main content, it will naturally reduce the crawl frequency and limit the number of indexed pages.

This reasoning is based on the assumption that duplication is rarely intentional and signals an architecture problem. By ignoring boilerplate, Google tries to focus on the real added value of each page — but this approach assumes that your HTTP codes are correct and that your templates are well thought out.

Google compares the main content, not the full HTML or visual layout.
Boilerplate is ignored: menus, footers, repeated elements do not count in duplicate detection.
Soft 404 errors create false duplicates if they share the same generic message across multiple URLs.
Returning correct HTTP codes (404, 410, 301) is essential to prevent Google from confusing errors with legitimate content.
The architecture of your templates directly influences Google's ability to distinguish your pages.

SEO Expert opinion

Is this statement consistent with real-world observations?

Yes, but with important nuances. Tests indeed show that Google ignores repeated blocks when comparing content. However, the boundary between 'boilerplate' and 'main content' is not always clear to the algorithm.

We regularly observe cases where information-rich blocks — recurring FAQs, standard comparison tables, partially templated product descriptions — are treated as boilerplate because they repeat across multiple pages. [To be verified]: Google does not publish any precise threshold for what shifts from 'useful content' to 'ignored boilerplate'. The opacity remains total on the exact criteria.

What concrete risks are there if your HTTP codes are misconfigured?

An e-commerce site with 500 permanently out-of-stock product sheets, all returning a 200 code with the message 'This product is no longer available', ends up with 500 almost identical pages in the eyes of Google. The engine will then group these URLs, index only a handful, and drastically reduce the crawl of other sections.

Worse still: if these soft 404 pages receive backlinks or are included in your XML sitemap, Google will allocate budget to crawl them regularly — even though they shouldn't exist. You waste your resources on dead URLs, and your true strategic pages may be explored less frequently.

In what cases does this rule become a trap?

Sites with significant pagination or dynamically generated content are particularly exposed. Imagine a listings site: each results page shows 10 listings + 300 words of boilerplate (filters, generic SEO text). If a search returns no results, the page displays 'No listings found' — and Google sees almost no main content.

The result: hundreds of empty results pages are treated as duplicates of each other, even if the URL parameters differ. If you do not block these pages (robots.txt, noindex, 404 code), your site becomes a crawl budget sinkhole.

Warning: this deduplication logic can also affect legitimate pages with little unique content if your template is too verbose.

Practical impact and recommendations

What immediate steps should be taken to avoid false duplicates?

First, audit your HTTP codes. Any page that no longer has a reason to exist should return a 404 or 410, not a 200. Use a crawler like Screaming Frog or OnCrawl to identify pages with little unique content and check their response codes.

Next, ensure that the unique content / boilerplate ratio is favorable on each type of page. If your template includes 500 words of repeated text and your product sheets only add 100 unique words, Google may group them. Turn the tide: reduce boilerplate or enhance the specific content.

What mistakes should absolutely be avoided?

Never leave empty or nearly empty pages accessible with a 200 code. This is the typical case of category pages without products, author pages without articles, orphan tag pages. Either remove them (404/410), redirect them (301), or block them upfront (noindex, robots.txt).

Avoid also multiplying identical generic error messages. If you need to display a 'product unavailable' page, customize the content with alternative suggestions, similar products, or redirect directly to a relevant category. The worst scenario: 200 pages displaying 'Sorry, this page does not exist' with zero variation.

How to check if your site complies?

Crawl your site and filter pages by content similarity rate. Most SEO tools (Oncrawl, Sitebulb, Botify) offer this feature. Identify clusters of pages with nearly identical content and verify their legitimacy.

Then, cross-reference with Google Search Console reports: 'Coverage' section, 'Excluded' tab. If you see 'Detected, currently not indexed' or 'Crawled, currently not indexed' on strategic pages, it often signifies that Google is treating them as duplicates or low-value pages.

Configure appropriate HTTP codes: 404 for deleted pages, 410 for permanent deletions, 301 for redirects.
Reduce boilerplate in your templates or enhance the unique content for each page.
Block or delete empty pages, search results with no results, or pagination without content.
Customize error messages to avoid creating clusters of identical pages.
Crawl your site regularly to detect low unique content pages.
Monitor Search Console reports for mistakenly excluded pages.

This approach from Google relies on clean architecture and consistent HTTP signals. If your site has thousands of pages, generates dynamic content, or uses complex templates, compliance can quickly become technical. In this case, contacting a specialized SEO agency allows you to benefit from a thorough audit and tailored support to sustainably optimize your crawl budget and indexing.

❓ Frequently Asked Questions

Google compare-t-il le contenu visible ou le code HTML pour détecter les doublons ?

Google extrait le contenu principal visible en ignorant le boilerplate (menus, footers, sidebars). Il ne compare ni le HTML brut ni la mise en page, mais uniquement le texte unique de chaque page.

Qu'est-ce qu'une erreur soft 404 et pourquoi pose-t-elle problème ?

C'est une page qui n'a plus de contenu utile mais renvoie un code 200 au lieu de 404. Google la crawle, la compare avec d'autres pages similaires, et peut les regrouper comme des doublons, gaspillant ainsi du crawl budget.

Dois-je supprimer mes fiches produit en rupture définitive ou les laisser en 200 avec un message ?

Supprimez-les avec un code 404 ou 410, ou redirigez-les (301) vers une catégorie pertinente. Laisser des centaines de pages affichant « produit indisponible » avec un code 200 crée des doublons artificiels.

Le contenu dupliqué dans les FAQ ou descriptions produit est-il un problème ?

Oui, si ce contenu se répète sur plusieurs pages sans variation. Google peut le classer comme boilerplate et ne comparer que le texte restant. Si celui-ci est trop maigre, vos pages risquent d'être regroupées.

Comment savoir si Google traite mes pages comme des doublons ?

Consultez Search Console, section Couverture, onglet Exclues. Les statuts « Détectée, actuellement non indexée » ou « Explorée, actuellement non indexée » signalent souvent des pages considérées comme dupliquées ou à faible valeur.

🎥 From the same video 11

Other SEO insights extracted from this same Google Search Central video · duration 8 min · published on 31/03/2020

🎥 Watch the full video on YouTube →