Official statement
Other statements from this video 8 ▾
- 9:57 Le JavaScript bloque-t-il vraiment l'indexation de votre contenu ?
- 20:31 Faut-il retirer les balises noindex sur les pages hreflang pour que ça fonctionne ?
- 24:07 Les balises alt peuvent-elles bloquer l'indexation de vos images en mobile-first ?
- 27:13 Combien de temps avant qu'un code 503 détruise votre indexation ?
- 29:16 L'hébergement mutualisé nuit-il vraiment au référencement de votre site ?
- 33:09 Un rollback de site peut-il pénaliser votre référencement dans Google ?
- 41:08 Comment Google récrawle-t-il vraiment les pages soft 404 après correction ?
- 52:31 Comment Google choisit-il vraiment la version canonique quand vos signaux se contredisent ?
Google emphasizes a fundamental distinction: robots.txt blocks crawling but does not prevent indexing, while the noindex tag prohibits indexing of a crawled page. In practical terms, a URL blocked by robots.txt can appear in search results if it receives external backlinks. To exclude a page from the index, you must allow it to be crawled and place a noindex tag on it.
What you need to understand
What is the fundamental difference between robots.txt and noindex?
The robots.txt file acts as a barrier at the crawl level. When Googlebot reads a Disallow directive in this file, it refuses to access the concerned URL. However, this crawling prohibition does not mean the URL disappears from Google's radar.
If there are external backlinks pointing to this blocked URL, Google may still decide to index it based solely on the anchor texts and the context of the incoming links. The URL will appear in the results with the note "No information available for this page".
How does the noindex tag address this issue?
The meta robots noindex tag (or the HTTP header X-Robots-Tag: noindex) works differently. For Google to read it, it must first crawl the page. Once crawled, it detects the noindex instruction and removes the page from its index.
The paradox is this: to prevent indexing, you must allow crawling. If you block a page with robots.txt AND add a noindex to it, Googlebot will never see the noindex. Therefore, the page may end up being indexed if it receives links anyway.
Why does this confusion persist among SEOs?
For years, many practitioners believed that robots.txt = deindexation. This mistake comes from a mental simplification: if Googlebot cannot access a page, then it cannot be in the index. False.
Google indexes billions of pages it has never crawled, solely based on external signals. Incoming links are enough to create a footprint in the index. The robots.txt file only delays or complicates information gathering; it does not erase this footprint.
- Robots.txt blocks crawling, not indexing — a URL can be indexed without ever being visited
- Noindex blocks indexing, but requires Googlebot to crawl the page to read the directive
- Combining both (robots.txt + noindex on the same URL) is counterproductive and can lead to unwanted indexing
- To deindex properly: allow crawling, add noindex, wait for deindexation, then optionally block with robots.txt
- The X-Robots-Tag HTTP is an alternative to meta robots, useful for non-HTML files (PDFs, images)
SEO Expert opinion
Is this distinction respected in actual crawling practice?
Yes, and it can be verified in Search Console. Go to the Coverage tab, and you will regularly see URLs marked "Indexed, not crawled". These pages are in the index without Googlebot ever having visited them. They are there solely from external signals: backlinks, XML sitemaps, mentions in other crawled pages.
The problem arises when sensitive URLs (staging, test pages, duplicate content) are blocked by robots.txt but receive links. Google indexes them with an empty snippet. The result: you think you are protected, but your internal URLs appear publicly.
Why does Google maintain this counter-intuitive operation?
Because robots.txt is a voluntary exclusion protocol, not a security measure. It dates back to 1994, a time when the web was fundamentally different. Google respects this standard while applying its own indexing logic, which prioritizes relevance signals.
If a URL is heavily cited, Google considers it to have informational value, even if it cannot crawl it. This is consistent with its mission: to organize global information, not blindly adhere to the wishes of webmasters. [To be verified]: Google has never published a specific threshold of backlinks needed to trigger indexing without crawling.
What practical errors arise from this misunderstanding?
The most common: blocking by robots.txt pages you want to deindex. I have seen sites block /tag/ or /author/ via robots.txt while still having internal links to these URLs. The result: hundreds of indexed URLs with empty snippets, diluting crawl budget and creating noise in the index.
Another mistake: adding a noindex in the HTML of a page already blocked by robots.txt and then wondering why the deindexation never happens. Googlebot cannot read the noindex it has never crawled. You must first remove the robots.txt block, allow Google to crawl and read the noindex, wait for deindexation (which could take weeks), and then optionally reapply a robots.txt block if necessary.
Practical impact and recommendations
How to audit your current configuration?
Start by cross-referencing Google Search Console and your robots.txt file. In the Coverage tab, filter for URLs labeled "Indexed, not crawled" or "Detected, currently not indexed". If you see URLs you thought were protected by robots.txt, this is a red flag.
Then, use a crawler like Screaming Frog or Oncrawl to identify pages containing noindex AND blocked by robots.txt. This combination is a configuration bug: the noindex will never be read, and the page remains indexable through external signals.
What procedure should be followed for clean deindexation?
The correct sequence is non-intuitive but critical. First, remove any Disallow directive concerning the URL from robots.txt. Then, add a meta robots noindex tag (or X-Robots-Tag: noindex in the HTTP header) on the page itself.
Wait for Googlebot to crawl the page, read the noindex, and remove it from the index. This step may take from a few days to several weeks depending on crawl frequency. Check in Search Console, Coverage section, that the URL transitions to "Excluded by the noindex tag" status. Only after this confirmed deindexation, can you optionally reapply a robots.txt block if you also want to prevent crawling.
What alternatives are there to manage indexing safely?
For truly sensitive content, never rely on robots.txt alone. Use HTTP authentication (401/403) or dynamic noindex URL parameters. Staging pages should be on a separate subdomain with mandatory authentication.
For WordPress taxonomies (tags, authors, dates), prefer direct noindex rather than blocking with robots.txt. Yoast SEO and Rank Math allow you to noindex by page type. Regularly check that these settings are properly applied in the source code, not just in the admin interface.
- Audit Search Console to identify "Indexed, not crawled" URLs blocked by robots.txt
- Remove any robots.txt directives from pages you want to deindex
- Add noindex (meta tag or HTTP header) on these pages
- Wait for confirmed deindexation before possibly re-blocking crawling
- Use HTTP authentication for truly private content (staging, admin)
- Check that your SEO plugins correctly apply noindex in the source code, not just in their settings
❓ Frequently Asked Questions
Puis-je combiner robots.txt et noindex sur la même URL ?
Comment désindexer des pages actuellement bloquées par robots.txt ?
Pourquoi mes URLs bloquées par robots.txt apparaissent-elles dans Google ?
Le X-Robots-Tag HTTP est-il plus efficace que la meta balise noindex ?
Combien de temps prend la désindexation après ajout d'un noindex ?
🎥 From the same video 8
Other SEO insights extracted from this same Google Search Central video · duration 57 min · published on 05/10/2018
🎥 Watch the full video on YouTube →
💬 Comments (0)
Be the first to comment.