Why does robots.txt block noindex and canonical directives as well?

Official statement

If any URLs are blocked by robots.txt, Google does not analyze them and does not see any noindex or canonical tags on those pages. The content and links of these pages blocked by robots.txt remain invisible to Google.

13:45

🎥 Source video

Extracted from a Google Search Central video

⏱ 57:02 💬 EN 📅 12/12/2017 ✂ 14 statements

Watch on YouTube (13:45) →

✂ Other statements from this video 13 ▾

2:10 Vos pages de localisation risquent-elles d'être pénalisées comme des doorway pages ?
5:30 Les alertes HTTPS de Search Console influencent-elles vraiment votre classement Google ?
6:58 Pourquoi Google ajoute-t-il votre nom de marque dans les titres de page ?
11:37 Pourquoi Google désindexe-t-il des pages après une migration HTTPS ?
15:05 Faut-il vraiment bloquer les facettes de navigation dans robots.txt ?
16:57 Faut-il signaler le spam des concurrents à Google pour gagner des positions ?
19:44 Est-ce que le noindex supprime vraiment le PageRank transmis par vos liens internes ?
25:19 Faut-il montrer à Googlebot les bannières anti-bloqueurs de pub ?
28:26 Faut-il vraiment optimiser ses sitemaps pour influencer le crawl de Google ?
30:01 Les méta descriptions longues génèrent-elles vraiment plus de clics ?
36:49 Peut-on vraiment transformer un site éditorial en site transactionnel sans pénalité SEO ?
44:22 Faut-il vraiment cacher du contenu à Googlebot pour optimiser l'expérience géolocalisée ?
53:55 Googlebot indexe-t-il vraiment tout le contenu JavaScript sans interaction utilisateur ?

What you need to understand

Mueller clarifies a common misunderstanding here: many SEOs believe that blocking with robots.txt is enough to deindex a page. This is false.

The robots.txt file controls crawling, not indexing. When Google respects a Disallow directive, it does not visit the URL and therefore cannot read any directives present in the HTML.

What actually happens when a URL is blocked by robots.txt?

Google stops crawling when it reads robots.txt. It never sends an HTTP request to that URL.

As a result: it cannot discover a noindex tag, a canonical tag, a 301 redirect, or the outgoing links of that page. The textual content remains invisible. On-page signals simply do not exist for the engine.

Why does this confusion persist among practitioners?

Because historically, Google could index URLs blocked by robots.txt if they received backlinks. The URL appeared in the SERPs with a generic snippet "No information available".

This practice has sown doubt: some thought that blocking with robots.txt prevented indexing. Others understood that a blocked URL could still be indexed. Both statements are partially true, which creates a constant ambiguity.

What is the practical difference between crawling and indexing?

Crawling is the technical visiting of a URL. Indexing is the decision to store that URL in the index and show it in the results.

A URL can be crawled without being indexed (noindex respected). It can be indexed without being crawled (if it receives links and is not blocked). But if it is blocked in robots.txt, Google can never crawl it, so it never sees the internal directives that could refine indexing.

Robots.txt blocks crawling: Google does not visit the URL and does not see its HTML content
Noindex controls indexing: Google crawls the URL, reads the tag, and decides not to index it
A robots.txt block prevents reading any directives: noindex, canonical, hreflang, meta robots, server redirects
The internal links of the blocked page remain invisible: PageRank does not circulate, internal linking is broken
External backlinks remain visible: Google can index the blocked URL if it receives links, but without a snippet or correct title

SEO Expert opinion

Is this statement consistent with field observations?

Yes, perfectly. I have observed hundreds of cases where pages blocked in robots.txt remained indexed despite having a noindex tag present in the HTML.

The SEO blocks the URL, adds noindex, waits for deindexing... which never happens. Google displays the URL in the SERPs with an empty snippet. The noindex was never read because crawling was forbidden beforehand. This is a classic mistake in migration or staging management.

What nuances should be added to this rule?

Mueller says "Google does not analyze them". This is true for HTML content, but Google still analyzes the existence of the URL via backlinks and the XML sitemap.

If a URL blocked in robots.txt is present in the sitemap or receives external links, Google may choose to index it with a generic snippet. Blocking with robots.txt is therefore not a guarantee of non-indexation. It is protection against crawling, nothing more.

Another nuance: some third-party bots (non-Google) ignore robots.txt. The robots.txt file is a directive, not a firewall. A sensitive site should never rely solely on robots.txt to protect confidential content.

When does this rule become a critical issue?

Migrations, redesigns, handling duplicate content. An SEO who blocks old URLs in robots.txt thinking it will force deindexing creates a nightmare.

Old pages remain indexed indefinitely, the canonical links to the new URLs are never read, and PageRank remains blocked. The migration fails because Google cannot follow the directives given to it.

Another problematic case: sites with filtered facets blocked in robots.txt. If these URLs receive backlinks, they get indexed with an empty snippet. The site loses control over the presentation of its pages in the SERPs. [To be verified]: some SEOs report that Google may ignore a robots.txt block if the URL is deemed strategic, but Google has never officially confirmed this practice.

Attention: blocking a page intended for deindexing in robots.txt is the number one mistake in technical SEO. The right method is always: leave crawlable, add noindex, wait for deindexing, then optionally block in robots.txt if necessary.

Practical impact and recommendations

What should be done concretely to deindex a URL?

Remove the robots.txt block if present. Verify that Googlebot can access the URL without restriction.

Add a meta robots noindex tag in the <head> or return an HTTP header X-Robots-Tag: noindex. Submit the URL in Search Console to speed up recrawling. Wait for Google to visit the page, read the noindex, and remove the URL from the index. This process may take several days to several weeks depending on crawl frequency.

What mistakes should absolutely be avoided?

Never block in robots.txt a URL that you want to deindex. It’s counterproductive: Google will never be able to read the noindex.

Never combine robots.txt Disallow and noindex on the same URL. This is redundant and creates inconsistency: you are asking Google not to crawl, but you are still giving it a directive that it cannot read. Choose one or the other, never both.

Avoid blocking entire sections in robots.txt without checking the backlinks. If these URLs receive external links, they may get indexed with an empty snippet and harm the user experience. Audit backlinks before any massive blocking.

How can I check if my site complies with this logic?

Extract all URLs blocked in robots.txt. Cross-reference with indexed URLs via site: or Search Console. Identify blocked URLs that are indexed: this is a sign of a problem.

Check if these URLs have a noindex tag. If yes, remove the robots.txt block so that Google can read it. If not, decide: either you want to index it (remove the block), or you want to deindex it (remove the block, add noindex). There is never a good reason to keep a URL blocked in robots.txt AND indexed.

Audit robots.txt: list all Disallow and User-agent rules
Extract indexed URLs: use Search Console or a Screaming Frog crawl with GSC data
Identify conflicts: URLs blocked in robots.txt but present in the index
Check backlinks: use Ahrefs, Majestic, or Search Console to detect links to blocked URLs
Correct inconsistencies: remove the robots.txt block, add noindex if necessary, or let it index properly
Monitor recrawling: track changes in Search Console to confirm that Google has read the new directives

The rule is simple: robots.txt controls crawling, noindex controls indexing. Never mix the two. To deindex, allow crawling and use noindex. To save crawl budget, only block in robots.txt URLs that you never want to see indexed AND that do not receive backlinks. These optimizations may seem obvious in theory, but implementation on a site with thousands of pages often requires a thorough technical audit, specialized tools, and deep expertise to avoid critical errors. If your architecture is complex or you are managing a sensitive migration, consulting with a specialized SEO agency for personalized guidance can save you months of accidental deindexing or traffic loss.

❓ Frequently Asked Questions

Peut-on désindexer une URL en la bloquant dans robots.txt ?

Non. Bloquer une URL en robots.txt empêche Google de la crawler, donc il ne peut pas lire une éventuelle balise noindex. La méthode correcte est de laisser l'URL crawlable et d'ajouter une directive noindex dans le HTML ou les en-têtes HTTP.

Pourquoi mes URL bloquées en robots.txt apparaissent-elles encore dans Google ?

Si une URL bloquée en robots.txt reçoit des backlinks externes, Google peut choisir de l'indexer avec un snippet générique. Le blocage robots.txt n'empêche pas l'indexation, il empêche le crawl. Pour forcer la désindexation, il faut retirer le blocage et ajouter noindex.

Que se passe-t-il si je bloque une URL avec un canonical vers une autre page ?

Google ne verra jamais le canonical parce qu'il ne crawle pas l'URL bloquée. Le canonical ne sera pas pris en compte, et l'URL bloquée restera potentiellement indexée si elle reçoit des liens, sans que Google puisse la consolider avec la version canonique.

Dois-je bloquer en robots.txt les pages en noindex pour économiser du crawl budget ?

C'est une pratique avancée qui peut être utile sur de très gros sites, mais seulement après que Google a désindexé les pages. Il faut d'abord laisser Google crawler les pages noindex, attendre la désindexation, puis optionnellement bloquer en robots.txt pour éviter le recrawl inutile.

Les liens internes d'une page bloquée en robots.txt sont-ils suivis par Google ?

Non. Si Google ne crawle pas la page, il ne découvre pas les liens présents dans son HTML. Ces liens ne transmettent pas de PageRank et les URL de destination ne sont pas découvertes via cette page. Le maillage interne est rompu.

🎥 From the same video 13

Other SEO insights extracted from this same Google Search Central video · duration 57 min · published on 12/12/2017

🎥 Watch the full video on YouTube →