Should you block a page with a canonical tag in robots.txt?

Official statement

For Google to respect rel=canonical tags, it must be able to crawl the page with that tag. If a page is blocked by robots.txt, Google won't see the tag and cannot merge the signals from the pages.

3:55

🎥 Source video

Extracted from a Google Search Central video

⏱ 57:02 💬 EN 📅 11/08/2015 ✂ 13 statements

Watch on YouTube (3:55) →

✂ Other statements from this video 12 ▾

4:12 Google indexe-t-il vraiment le JavaScript comme le HTML classique ?
5:43 Faut-il intégrer un flux RSS pour accélérer l'indexation de vos contenus ?
14:14 Faut-il rediriger vos doorway pages en 301 ou les désindexer avec noindex ?
17:54 Les paramètres d'URL dans la Search Console fonctionnent-ils vraiment comme on le croit ?
22:01 Les traductions sont-elles vraiment exemptes de pénalité pour contenu dupliqué ?
24:19 Fusionner deux sites : Google pénalise-t-il vraiment le contenu faible hérité ?
32:05 Les liens restent-ils aussi décisifs que le contenu pour le classement Google ?
35:44 Pourquoi Google affiche-t-il encore l'ancien domaine plusieurs mois après une migration ?
40:00 Les erreurs 5xx tuent-elles votre classement ou juste votre crawl budget ?
44:23 Faut-il vraiment investir dans un certificat SSL à validation étendue pour le référencement ?
46:41 Les sitemaps sont-ils vraiment indispensables pour le crawl de votre site ?
52:20 Comment Google teste-t-il vraiment ses algorithmes sur vos positions ?

What you need to understand

Why can't Google see a canonical if the page is blocked?

The principle is mechanical and non-negotiable. For a search engine to process an HTML directive like rel=canonical, it must first access the source code of the page. The robots.txt file is consulted before any crawling attempt: if a URL is blocked via Disallow, Googlebot does not send any HTTP requests to that resource.

No requests mean no reading of the HTML. The canonical tag remains invisible, just like it doesn’t exist. Google indexes the duplicate pages as separate entities, without consolidating their ranking signals.

What really happens when a canonical is blocked?

Variant pages (UTM parameters, paginated versions, product variations) continue to be crawled and indexed independently if discovered through other paths. They compete against each other in the SERPs instead of pooling their juice.

Backlinks pointing to these variants do not pass their equity to the master page. The PageRank fragments instead of concentrating. It's a silent hemorrhage of authority, particularly critical on e-commerce sites with thousands of references or editorial platforms with paginated archives.

In which scenarios does this error frequently occur?

First classic case: URLs with tracking parameters (utm_source, fbclid, gclid) that some block reflexively in robots.txt to avoid polluting the logs. However, these URLs often carry canonicals to the clean version.

Second situation: test pages or staging environments publicly accessible, blocked in robots.txt but containing canonicals pointing to production. If Google discovers them via an external link, it will merge nothing. Third trap: separate mobile URLs (m.site.com) mistakenly blocked while they canonicalize to the desktop version.

Robots.txt blocks crawling before HTML reading: no on-page directive is visible if the page is forbidden
Ignored canonicals cause PageRank dilution among duplicate variants
Backlinks to blocked pages do not pass their equity to the canonical version
Common error on parameter URLs, test environments, and separate mobile versions
Google indexes variants as independent pages if discovered through other paths

SEO Expert opinion

Is this statement consistent with real-world observations?

Absolutely. I have audited dozens of sites where this robots.txt/canonical conflict created massive duplication clusters. A recent case: a pure player in fashion with 12,000 product sheets, each listed in 4 URLs (colors as parameters). The webmaster had blocked all parameters "?color=" in robots.txt to "clean up the crawl".

Result: 48,000 indexed pages instead of 12,000. The canonicals were invisible to Google. The site cannibalized its own ranking, with sometimes 3 variants of the same product competing for the same query. After correction (removing the robots.txt block + maintaining the canonicals), consolidation was observed within 6 weeks.

What nuance should be made regarding the merging of signals?

Mueller talks about "merging signals," but let's be precise: Google does not apply a strict mathematical merge. It selects a canonical URL and transfers most of the juice to it, but with losses. [To be verified]: Google has never communicated an exact percentage, but tests show that a well-respected canonical transmits about 85-95% of the equity, not 100%.

Another point: even without a blocking robots.txt, Google can choose not to respect your canonical if it deems it incorrect (too much different content between source and target, canonical pointing to a 404 page, etc.). Mueller's statement assumes that the tag is technically valid. If it is not, the robots.txt blocking becomes anecdotal in the face of a deeper structural issue.

In which cases does this rule not apply or become secondary?

If your strategy is precisely to completely deindex pages (obsolete variants, unwanted duplicate content), blocking in robots.txt AND avoiding the canonical can be justified. But beware: this is a minority approach. The orthodox method remains noindex + allow robots.txt, not the opposite.

Edge case: e-commerce filter facets generating thousands of combinations. Some prefer to heavily block in robots.txt rather than canonicalizing each variant. It's a budget management choice, but one must then accept to lose all equity from backlinks to those facets. It's a conscious decision, not a configuration error.

Warning: Do not confuse robots.txt (crawl control) and noindex (indexing control). Blocking a page in robots.txt prevents Google from reading a potential noindex... making it useless. To properly deindex a page while allowing Google to follow its links, use allow robots.txt + noindex tag.

Practical impact and recommendations

What should you prioritize auditing on your site?

First check: cross your robots.txt file with your canonical tags. Export all URLs containing rel=canonical from a Screaming Frog or OnCrawl crawl. At the same time, list all Disallow directives from your robots.txt. Identify overlaps: any URL that is blocked AND carries a canonical is a conflict to resolve.

Second audit: duplicate URLs indexed in Search Console. Go to Coverage > Excluded > "Duplicate pages, Google did not select the canonical page indicated by the user". If you see this status, systematically check the robots.txt. In 40% of the cases I have dealt with, this was the root cause.

What correction should be applied depending on the context?

If the page needs to be crawled and consolidated: remove the corresponding Disallow line from robots.txt. Keep only the canonical tag. Test using the URL inspection tool in GSC that Googlebot can access the page and detects the tag. Then request accelerated reindexing.

If the page must be completely invisible: switch to noindex + allow robots.txt, remove the canonical (it no longer makes sense for a noindex page). A more radical alternative: pure removal and a 301 redirect to the master version. The choice depends on your volume and ability to manage large-scale redirects.

How can this error be prevented during a migration or development?

Integrate an automated test into your CI/CD pipeline: a script that parses robots.txt and your XML sitemap, then checks that no URL in the sitemap is blocked. For large sites, add a rule comparing crawled canonicals with active Disallow directives.

During migration, explicitly document each line of your robots.txt with a comment indicating why it exists. I've seen far too many legacy Disallow lines whose origins no one knew, maintained out of superstition. Audit your robots.txt every 6 months: it's a critical file that evolves little, so every line counts.

Crawl the site and export all URLs with rel=canonical
Cross this list with the Disallow directives in robots.txt to identify conflicts
Check in Search Console for duplicate pages where Google did not follow your canonical
Remove unnecessary robots.txt blocks that prevent reading of canonicals
Replace Disallow in robots.txt with noindex + allow for pages to be deindexed
Automate a non-regression test of robots.txt vs sitemap in your deployment workflow

The rule is simple: if a page has a canonical, it must be crawlable. Any robots.txt prohibition mechanically cancels the directive. This configuration — seemingly basic — generates massive PageRank leaks on poorly audited sites. An accurate diagnosis requires cross-referencing multiple layers of data (crawl, robots.txt, Search Console, backlinks). For complex infrastructures with thousands of variations or multi-domain architectures, support from a specialized SEO agency allows automating these checks and avoiding costly mistakes during migrations or redesigns.

❓ Frequently Asked Questions

Peut-on utiliser à la fois robots.txt Disallow et une balise canonical sur la même page ?

Techniquement oui, mais c'est inutile et contre-productif. Le robots.txt empêche Google de lire la balise canonical, qui devient invisible. Si vous voulez consolider les signaux, laissez la page crawlable avec uniquement la canonical.

Que se passe-t-il si une page bloquée en robots.txt reçoit des backlinks de qualité ?

Ces backlinks ne transmettent aucune équité vers la version canonique puisque Google ne voit pas la balise canonical. Le PageRank reste piégé sur une URL non indexable, c'est une perte sèche d'autorité.

Comment Google découvre-t-il une page bloquée en robots.txt si elle contient une canonical ?

Google peut découvrir l'URL via un lien externe, un sitemap ou un référent. Mais même s'il la connaît, il ne la crawle pas à cause du robots.txt et ne lit donc jamais la canonical. La page reste orpheline dans le graphe de liens.

Faut-il privilégier robots.txt ou noindex pour empêcher l'indexation d'une page ?

Utilisez noindex + allow robots.txt. Cela permet à Google de crawler la page, lire la directive noindex et suivre ses liens sortants. Robots.txt bloque tout, y compris la lecture du noindex, ce qui crée des incohérences.

Une canonical dans le sitemap XML est-elle lue même si la page est bloquée en robots.txt ?

Non. Le sitemap XML indique des URLs à crawler, mais si elles sont bloquées en robots.txt, Google ne les crawle pas et ne lit donc aucune balise canonical. Le sitemap ne contourne pas le robots.txt.

🎥 From the same video 12

Other SEO insights extracted from this same Google Search Central video · duration 57 min · published on 11/08/2015

🎥 Watch the full video on YouTube →