Official statement
Other statements from this video 12 ▾
- 4:12 Google indexe-t-il vraiment le JavaScript comme le HTML classique ?
- 5:43 Faut-il intégrer un flux RSS pour accélérer l'indexation de vos contenus ?
- 14:14 Faut-il rediriger vos doorway pages en 301 ou les désindexer avec noindex ?
- 17:54 Les paramètres d'URL dans la Search Console fonctionnent-ils vraiment comme on le croit ?
- 22:01 Les traductions sont-elles vraiment exemptes de pénalité pour contenu dupliqué ?
- 24:19 Fusionner deux sites : Google pénalise-t-il vraiment le contenu faible hérité ?
- 32:05 Les liens restent-ils aussi décisifs que le contenu pour le classement Google ?
- 35:44 Pourquoi Google affiche-t-il encore l'ancien domaine plusieurs mois après une migration ?
- 40:00 Les erreurs 5xx tuent-elles votre classement ou juste votre crawl budget ?
- 44:23 Faut-il vraiment investir dans un certificat SSL à validation étendue pour le référencement ?
- 46:41 Les sitemaps sont-ils vraiment indispensables pour le crawl de votre site ?
- 52:20 Comment Google teste-t-il vraiment ses algorithmes sur vos positions ?
Google cannot read a rel=canonical tag if the page containing it is blocked via robots.txt. The direct consequence: ranking signals (backlinks, authority, traffic) from duplicate pages are not merged to the canonical version. This configuration error leads to a dilution of PageRank and undermines the consolidation of signals among variations of the same page.
What you need to understand
Why can't Google see a canonical if the page is blocked?
The principle is mechanical and non-negotiable. For a search engine to process an HTML directive like rel=canonical, it must first access the source code of the page. The robots.txt file is consulted before any crawling attempt: if a URL is blocked via Disallow, Googlebot does not send any HTTP requests to that resource.
No requests mean no reading of the HTML. The canonical tag remains invisible, just like it doesn’t exist. Google indexes the duplicate pages as separate entities, without consolidating their ranking signals.
What really happens when a canonical is blocked?
Variant pages (UTM parameters, paginated versions, product variations) continue to be crawled and indexed independently if discovered through other paths. They compete against each other in the SERPs instead of pooling their juice.
Backlinks pointing to these variants do not pass their equity to the master page. The PageRank fragments instead of concentrating. It's a silent hemorrhage of authority, particularly critical on e-commerce sites with thousands of references or editorial platforms with paginated archives.
In which scenarios does this error frequently occur?
First classic case: URLs with tracking parameters (utm_source, fbclid, gclid) that some block reflexively in robots.txt to avoid polluting the logs. However, these URLs often carry canonicals to the clean version.
Second situation: test pages or staging environments publicly accessible, blocked in robots.txt but containing canonicals pointing to production. If Google discovers them via an external link, it will merge nothing. Third trap: separate mobile URLs (m.site.com) mistakenly blocked while they canonicalize to the desktop version.
- Robots.txt blocks crawling before HTML reading: no on-page directive is visible if the page is forbidden
- Ignored canonicals cause PageRank dilution among duplicate variants
- Backlinks to blocked pages do not pass their equity to the canonical version
- Common error on parameter URLs, test environments, and separate mobile versions
- Google indexes variants as independent pages if discovered through other paths
SEO Expert opinion
Is this statement consistent with real-world observations?
Absolutely. I have audited dozens of sites where this robots.txt/canonical conflict created massive duplication clusters. A recent case: a pure player in fashion with 12,000 product sheets, each listed in 4 URLs (colors as parameters). The webmaster had blocked all parameters "?color=" in robots.txt to "clean up the crawl".
Result: 48,000 indexed pages instead of 12,000. The canonicals were invisible to Google. The site cannibalized its own ranking, with sometimes 3 variants of the same product competing for the same query. After correction (removing the robots.txt block + maintaining the canonicals), consolidation was observed within 6 weeks.
What nuance should be made regarding the merging of signals?
Mueller talks about "merging signals," but let's be precise: Google does not apply a strict mathematical merge. It selects a canonical URL and transfers most of the juice to it, but with losses. [To be verified]: Google has never communicated an exact percentage, but tests show that a well-respected canonical transmits about 85-95% of the equity, not 100%.
Another point: even without a blocking robots.txt, Google can choose not to respect your canonical if it deems it incorrect (too much different content between source and target, canonical pointing to a 404 page, etc.). Mueller's statement assumes that the tag is technically valid. If it is not, the robots.txt blocking becomes anecdotal in the face of a deeper structural issue.
In which cases does this rule not apply or become secondary?
If your strategy is precisely to completely deindex pages (obsolete variants, unwanted duplicate content), blocking in robots.txt AND avoiding the canonical can be justified. But beware: this is a minority approach. The orthodox method remains noindex + allow robots.txt, not the opposite.
Edge case: e-commerce filter facets generating thousands of combinations. Some prefer to heavily block in robots.txt rather than canonicalizing each variant. It's a budget management choice, but one must then accept to lose all equity from backlinks to those facets. It's a conscious decision, not a configuration error.
Practical impact and recommendations
What should you prioritize auditing on your site?
First check: cross your robots.txt file with your canonical tags. Export all URLs containing rel=canonical from a Screaming Frog or OnCrawl crawl. At the same time, list all Disallow directives from your robots.txt. Identify overlaps: any URL that is blocked AND carries a canonical is a conflict to resolve.
Second audit: duplicate URLs indexed in Search Console. Go to Coverage > Excluded > "Duplicate pages, Google did not select the canonical page indicated by the user". If you see this status, systematically check the robots.txt. In 40% of the cases I have dealt with, this was the root cause.
What correction should be applied depending on the context?
If the page needs to be crawled and consolidated: remove the corresponding Disallow line from robots.txt. Keep only the canonical tag. Test using the URL inspection tool in GSC that Googlebot can access the page and detects the tag. Then request accelerated reindexing.
If the page must be completely invisible: switch to noindex + allow robots.txt, remove the canonical (it no longer makes sense for a noindex page). A more radical alternative: pure removal and a 301 redirect to the master version. The choice depends on your volume and ability to manage large-scale redirects.
How can this error be prevented during a migration or development?
Integrate an automated test into your CI/CD pipeline: a script that parses robots.txt and your XML sitemap, then checks that no URL in the sitemap is blocked. For large sites, add a rule comparing crawled canonicals with active Disallow directives.
During migration, explicitly document each line of your robots.txt with a comment indicating why it exists. I've seen far too many legacy Disallow lines whose origins no one knew, maintained out of superstition. Audit your robots.txt every 6 months: it's a critical file that evolves little, so every line counts.
- Crawl the site and export all URLs with rel=canonical
- Cross this list with the Disallow directives in robots.txt to identify conflicts
- Check in Search Console for duplicate pages where Google did not follow your canonical
- Remove unnecessary robots.txt blocks that prevent reading of canonicals
- Replace Disallow in robots.txt with noindex + allow for pages to be deindexed
- Automate a non-regression test of robots.txt vs sitemap in your deployment workflow
❓ Frequently Asked Questions
Peut-on utiliser à la fois robots.txt Disallow et une balise canonical sur la même page ?
Que se passe-t-il si une page bloquée en robots.txt reçoit des backlinks de qualité ?
Comment Google découvre-t-il une page bloquée en robots.txt si elle contient une canonical ?
Faut-il privilégier robots.txt ou noindex pour empêcher l'indexation d'une page ?
Une canonical dans le sitemap XML est-elle lue même si la page est bloquée en robots.txt ?
🎥 From the same video 12
Other SEO insights extracted from this same Google Search Central video · duration 57 min · published on 11/08/2015
🎥 Watch the full video on YouTube →
💬 Comments (0)
Be the first to comment.