How does Google determine the canonical URL among a cluster of similar pages?

Quick SEO Quiz

Test your SEO knowledge in 5 questions

Less than a minute. Find out how much you really know about Google search.

🕒 ~1 min 🎯 5 questions

Official statement

To choose the representative URL in a cluster, Google employs a machine learning system that takes into account various signals such as site security, secure dependencies, and the configurability of the page to avoid directing users to poor experiences.

6:19

🎥 Source video

Extracted from a Google Search Central video

⏱ 8:02 💬 EN 📅 31/03/2020 ✂ 12 statements

Watch on YouTube (6:19) →

✂ Other statements from this video 11 ▾

📅

Official statement from March 31, 2020 (6 years ago)

⚠ A more recent statement exists on this topic How does Google actually choose the canonical page in a duplicate cluster? Gary Illyes · April 4, 2024 View statement →

TL;DR

Google employs a machine learning system to select the representative URL among duplicate pages, relying on signals such as site security, secure dependencies, and user experience quality. Specifically, even if you specify a canonical URL, Google may choose another if it deems it more appropriate. This mechanism explains why your canonical tags are sometimes overlooked.

What you need to understand

What is a URL cluster and why does Google need to choose just one?

When multiple pages on your site (or other sites) have nearly identical content, Google groups them into a cluster. This occurs with HTTP/HTTPS variants, URLs with or without www, tracking parameters, separate mobile versions, or poorly configured paginated pages.

The engine will not index all these variants. It selects a representative URL (canonical) that will be displayed in search results. The other URLs in the cluster are grouped under this main URL — consolidating ranking signals and preventing dilution.

What signals does the algorithm consider for this choice?

Google mentions a machine learning system that analyzes various signals. Site security is mentioned first: HTTPS is favored over HTTP. Secure dependencies (likely external resources loaded over HTTPS) also play a role.

The "configurability of the page" is less clear — it likely refers to the stability of the URL, the absence of chain redirects, the cleanliness of parameters, and the consistency of canonical tags. The ultimate goal remains clear: to avoid sending users to a poor experience (broken, slow, or insecure pages).

Does this logic also apply to duplicated pages across different sites?

Yes, without a doubt. When content is syndicated or copied across multiple domains, Google forms an inter-domain cluster. It then chooses the original source or the most authoritative version based on signals like content age, domain popularity, and backlinks pointing to each version.

This is why a site that scrapes your content won't necessarily steal your rankings — unless its authority far surpasses yours or if your own site exhibits failing technical signals.

Google groups similar URLs into clusters and selects a representative URL
The algorithm favors HTTPS, secure dependencies, and pages that provide a good experience
Your canonical tags are recommendations, not absolute directives
Clusters can form between different domains (syndication, scraping)
Technical stability and security are decisive criteria in this choice

SEO Expert opinion

Does this statement align with what we observe in the field?

Overall, yes. It has been known for years that Google does not always respect the canonicals we indicate. There are numerous scenarios: a mobile AMP version selected while pointing to the desktop, a URL with parameters chosen instead of the clean version, or even an HTTP page indexed despite the redirect to HTTPS.

Machine learning explains this autonomy: Google makes its own calculations and sometimes determines that your choice is not optimal. The issue is that this logic remains a black box. We do not know precisely how much weight each signal carries, nor how the algorithm arbitrates between an explicit canonical tag and its own preferences.

What grey areas persist in this explanation?

The notion of "configurability of the page" remains terribly vague. Does it include the presence of coherent hreflang tags? The structure of URLs (with or without a trailing slash)? Loading speed? The presence of poorly managed dynamic content? [To verify] — Google provides no exploitable details.

Similarly, no hierarchy among the signals is specified. If an HTTPS page is slow and poorly configured, and an HTTP version is fast and clean, which one prevails? We assume that security takes precedence, but without certainty. This opacity complicates audits when Google ignores your canonical directives.

In what cases does this logic pose problems for SEOs?

The first problematic case concerns multilingual or multi-regional sites. If Google arbitrarily decides that a .com version is more relevant than a .fr version for a French user, you lose control over user experience. Hreflang tags are meant to manage this, but if the clustering algorithm ignores them, you are stuck.

The second case: migrating from HTTP to HTTPS. Even with perfect 301 redirects and canonicals pointing to HTTPS, some sites see Google continue to index HTTP URLs for weeks. Machine learning can be slow to re-evaluate a long-established cluster.

Warning: If you find that Google systematically ignores your canonicals, do not multiply conflicting directives (canonical + redirect + different sitemap). This muddles the signals and slows down the reevaluation by the algorithm.

Practical impact and recommendations

How can you ensure Google chooses the right canonical URL?

The first rule is absolute consistency among all your signals. Your canonical tag, 301 redirects, XML sitemap, and internal links must all point to the same URL version. If your canonical states HTTPS but your internal links point to HTTP, Google receives conflicting signals.

Next, secure your entire site. Switch to HTTPS everywhere, including external resources (images, scripts, CSS). An HTTPS page that loads HTTP dependencies sends a mixed security signal that Google may penalize in its choice of canonical.

What mistakes should you absolutely avoid?

Never allow multiple accessible versions of the same page to coexist. If you have migrated to HTTPS, all HTTP URLs must 301 redirect to HTTPS. No duplicated content should be accessible via both protocols.

Avoid chains of redirects. If A redirects to B which redirects to C, Google might choose B as canonical instead of C. Redirect A directly to C. Likewise, do not place a canonical tag on a page that is itself a redirect — this sends an inconsistent signal.

How to check if your canonical URLs are being respected?

Use Google Search Console to identify indexed URLs versus those that you have declared as canonical. The "URL Inspection" tool tells you which URL Google considers representative and why it made that choice.

Also, monitor your server logs. If Googlebot continues to crawl URLs that you thought were consolidated, then the clustering is not functioning as intended. This may reveal orphaned internal links or poorly cleaned sitemaps.

Ensure that HTTPS is enabled throughout your site and its external resources
Verify that canonical, 301 redirects, sitemap, and internal links point to the same URL version
Eliminate all chains of redirects and temporary redirects (302)
Check in Search Console which URL Google has selected as representative
Analyze logs to spot outdated URLs still being crawled by Googlebot
Clean your XML sitemap of all non-canonical URLs

Google's selection of the canonical URL relies on a complex machine learning mechanism where the consistency of your technical signals plays a crucial role. HTTPS, URL cleanliness, absence of chain redirects, and alignment between canonical/sitemap/internal links are your levers for action. If Google continues to ignore your directives despite this, a thorough technical audit is necessary — and this analysis often requires an expert eye to detect subtle inconsistencies. If managing these optimizations seems complex or time-consuming, working with a specialized SEO agency can help you secure these technical decisions and regain control over your indexed URLs.

❓ Frequently Asked Questions

Google respecte-t-il toujours la balise canonical que je spécifie ?

Non. Google considère la balise canonical comme une recommandation, pas une directive absolue. Si son algorithme estime qu'une autre URL du cluster offre une meilleure expérience (sécurité, performance, cohérence), il la choisira comme représentative même si elle contredit votre balise.

Pourquoi Google indexe-t-il encore mes URLs HTTP malgré mes redirections HTTPS ?

Plusieurs raisons possibles : des liens internes pointent encore vers HTTP, votre sitemap contient des URLs HTTP, ou des sites externes continuent de linker vers l'ancienne version. L'algorithme de clustering peut aussi être lent à réévaluer un cluster établi depuis longtemps.

Qu'est-ce que Google entend par "configurabilité de la page" ?

Google reste vague sur ce terme. On suppose qu'il inclut la stabilité de l'URL, l'absence de paramètres inutiles, la cohérence des balises techniques (canonical, hreflang), et peut-être la performance de chargement. C'est un signal composite dont le détail n'est pas documenté.

Si un site copie mon contenu, peut-il me voler mes positions Google ?

Rarement, sauf si son autorité de domaine surpasse largement la vôtre. Google forme un cluster entre les deux versions et choisit généralement la source originale ou la plus autoritaire. Vos signaux techniques (HTTPS, vitesse, liens internes cohérents) renforcent vos chances d'être sélectionné.

Comment savoir quelle URL Google a choisi comme canonique pour ma page ?

Utilisez l'outil "Inspection d'URL" dans la Google Search Console. Il indique l'URL que Google considère comme représentative et affiche dans les résultats de recherche, même si elle diffère de celle que vous avez spécifiée dans votre balise canonical.

🏷 Related Topics

canonical clustering indexation HTTPS duplicate content crawl URL normalization machine learning

Domain Age & History AI & SEO Domain Name

🎥 From the same video 11

Other SEO insights extracted from this same Google Search Central video · duration 8 min · published on 31/03/2020

🎥 Watch the full video on YouTube →

Related statements

« Previous

Negative Effects of Error Loops in JavaScript...

Continuous Evolution of the Google Search Engine...

« Back to results