Does robots.txt really prevent your pages from being indexed?

Official statement

John Mueller explains that the robots.txt file and the noindex tag serve different functions: robots.txt prevents crawling, but not indexing, while the noindex tag allows you to prevent the indexing of a crawled page.

4:14

🎥 Source video

Extracted from a Google Search Central video

⏱ 57:45 💬 EN 📅 05/10/2018 ✂ 9 statements

Watch on YouTube (4:14) →

✂ Other statements from this video 8 ▾

9:57 Le JavaScript bloque-t-il vraiment l'indexation de votre contenu ?
20:31 Faut-il retirer les balises noindex sur les pages hreflang pour que ça fonctionne ?
24:07 Les balises alt peuvent-elles bloquer l'indexation de vos images en mobile-first ?
27:13 Combien de temps avant qu'un code 503 détruise votre indexation ?
29:16 L'hébergement mutualisé nuit-il vraiment au référencement de votre site ?
33:09 Un rollback de site peut-il pénaliser votre référencement dans Google ?
41:08 Comment Google récrawle-t-il vraiment les pages soft 404 après correction ?
52:31 Comment Google choisit-il vraiment la version canonique quand vos signaux se contredisent ?

What you need to understand

What is the fundamental difference between robots.txt and noindex?

The robots.txt file acts as a barrier at the crawl level. When Googlebot reads a Disallow directive in this file, it refuses to access the concerned URL. However, this crawling prohibition does not mean the URL disappears from Google's radar.

If there are external backlinks pointing to this blocked URL, Google may still decide to index it based solely on the anchor texts and the context of the incoming links. The URL will appear in the results with the note "No information available for this page".

How does the noindex tag address this issue?

The meta robots noindex tag (or the HTTP header X-Robots-Tag: noindex) works differently. For Google to read it, it must first crawl the page. Once crawled, it detects the noindex instruction and removes the page from its index.

The paradox is this: to prevent indexing, you must allow crawling. If you block a page with robots.txt AND add a noindex to it, Googlebot will never see the noindex. Therefore, the page may end up being indexed if it receives links anyway.

Why does this confusion persist among SEOs?

For years, many practitioners believed that robots.txt = deindexation. This mistake comes from a mental simplification: if Googlebot cannot access a page, then it cannot be in the index. False.

Google indexes billions of pages it has never crawled, solely based on external signals. Incoming links are enough to create a footprint in the index. The robots.txt file only delays or complicates information gathering; it does not erase this footprint.

Robots.txt blocks crawling, not indexing — a URL can be indexed without ever being visited
Noindex blocks indexing, but requires Googlebot to crawl the page to read the directive
Combining both (robots.txt + noindex on the same URL) is counterproductive and can lead to unwanted indexing
To deindex properly: allow crawling, add noindex, wait for deindexation, then optionally block with robots.txt
The X-Robots-Tag HTTP is an alternative to meta robots, useful for non-HTML files (PDFs, images)

SEO Expert opinion

Is this distinction respected in actual crawling practice?

Yes, and it can be verified in Search Console. Go to the Coverage tab, and you will regularly see URLs marked "Indexed, not crawled". These pages are in the index without Googlebot ever having visited them. They are there solely from external signals: backlinks, XML sitemaps, mentions in other crawled pages.

The problem arises when sensitive URLs (staging, test pages, duplicate content) are blocked by robots.txt but receive links. Google indexes them with an empty snippet. The result: you think you are protected, but your internal URLs appear publicly.

Why does Google maintain this counter-intuitive operation?

Because robots.txt is a voluntary exclusion protocol, not a security measure. It dates back to 1994, a time when the web was fundamentally different. Google respects this standard while applying its own indexing logic, which prioritizes relevance signals.

If a URL is heavily cited, Google considers it to have informational value, even if it cannot crawl it. This is consistent with its mission: to organize global information, not blindly adhere to the wishes of webmasters. [To be verified]: Google has never published a specific threshold of backlinks needed to trigger indexing without crawling.

What practical errors arise from this misunderstanding?

The most common: blocking by robots.txt pages you want to deindex. I have seen sites block /tag/ or /author/ via robots.txt while still having internal links to these URLs. The result: hundreds of indexed URLs with empty snippets, diluting crawl budget and creating noise in the index.

Another mistake: adding a noindex in the HTML of a page already blocked by robots.txt and then wondering why the deindexation never happens. Googlebot cannot read the noindex it has never crawled. You must first remove the robots.txt block, allow Google to crawl and read the noindex, wait for deindexation (which could take weeks), and then optionally reapply a robots.txt block if necessary.

If you have blocked entire sections with robots.txt and find they still appear in site: searches, check your external backlinks. Remove the robots.txt block, add noindex, wait for complete deindexation before re-blocking. This process takes time but is the only reliable method.

Practical impact and recommendations

How to audit your current configuration?

Start by cross-referencing Google Search Console and your robots.txt file. In the Coverage tab, filter for URLs labeled "Indexed, not crawled" or "Detected, currently not indexed". If you see URLs you thought were protected by robots.txt, this is a red flag.

Then, use a crawler like Screaming Frog or Oncrawl to identify pages containing noindex AND blocked by robots.txt. This combination is a configuration bug: the noindex will never be read, and the page remains indexable through external signals.

What procedure should be followed for clean deindexation?

The correct sequence is non-intuitive but critical. First, remove any Disallow directive concerning the URL from robots.txt. Then, add a meta robots noindex tag (or X-Robots-Tag: noindex in the HTTP header) on the page itself.

Wait for Googlebot to crawl the page, read the noindex, and remove it from the index. This step may take from a few days to several weeks depending on crawl frequency. Check in Search Console, Coverage section, that the URL transitions to "Excluded by the noindex tag" status. Only after this confirmed deindexation, can you optionally reapply a robots.txt block if you also want to prevent crawling.

What alternatives are there to manage indexing safely?

For truly sensitive content, never rely on robots.txt alone. Use HTTP authentication (401/403) or dynamic noindex URL parameters. Staging pages should be on a separate subdomain with mandatory authentication.

For WordPress taxonomies (tags, authors, dates), prefer direct noindex rather than blocking with robots.txt. Yoast SEO and Rank Math allow you to noindex by page type. Regularly check that these settings are properly applied in the source code, not just in the admin interface.

Audit Search Console to identify "Indexed, not crawled" URLs blocked by robots.txt
Remove any robots.txt directives from pages you want to deindex
Add noindex (meta tag or HTTP header) on these pages
Wait for confirmed deindexation before possibly re-blocking crawling
Use HTTP authentication for truly private content (staging, admin)
Check that your SEO plugins correctly apply noindex in the source code, not just in their settings

Proper index management requires a fine understanding of Google's crawling and indexing mechanisms. Configuration errors can persist for months before being detected, polluting your index and wasting crawl budget. These technical optimizations require constant vigilance and deep expertise with diagnostic tools. If you manage a medium or large site, hiring a specialized SEO agency can help you avoid these pitfalls and implement robust monitoring processes tailored to your architecture.

❓ Frequently Asked Questions

Puis-je combiner robots.txt et noindex sur la même URL ?

Non, c'est contre-productif. Si vous bloquez une URL par robots.txt, Googlebot ne pourra jamais crawler la page pour lire la balise noindex. L'URL risque d'être indexée quand même si elle reçoit des backlinks externes.

Comment désindexer des pages actuellement bloquées par robots.txt ?

Retirez d'abord le blocage robots.txt pour permettre le crawl. Ajoutez ensuite une balise noindex sur ces pages. Attendez que Google les crawle et les retire de l'index (vérifiable dans Search Console). Vous pouvez ensuite rebloquer par robots.txt si nécessaire.

Pourquoi mes URLs bloquées par robots.txt apparaissent-elles dans Google ?

Parce que le robots.txt bloque le crawl mais pas l'indexation. Si vos URLs reçoivent des backlinks externes ou sont mentionnées dans votre sitemap, Google peut les indexer sans les crawler, affichant uniquement l'URL avec la mention « Aucune information disponible ».

Le X-Robots-Tag HTTP est-il plus efficace que la meta balise noindex ?

Ils sont équivalents en efficacité, mais le X-Robots-Tag HTTP est plus flexible : il fonctionne sur tous types de fichiers (PDF, images, vidéos) alors que la meta balise ne s'applique qu'aux pages HTML. Choisissez selon votre architecture.

Combien de temps prend la désindexation après ajout d'un noindex ?

Cela dépend de la fréquence de crawl de votre site. Pour des pages crawlées quotidiennement, comptez quelques jours. Pour des pages peu prioritaires, cela peut prendre plusieurs semaines. Surveillez l'évolution dans l'onglet Couverture de Search Console.

🎥 From the same video 8

Other SEO insights extracted from this same Google Search Central video · duration 57 min · published on 05/10/2018

🎥 Watch the full video on YouTube →