Why is robots.txt not enough to block the indexing of your pages?

Quick SEO Quiz

Test your SEO knowledge in 3 questions

Less than 30 seconds. Find out how much you really know about Google search.

🕒 ~30s 🎯 3 questions 📚 SEO Google

Official statement

If you want to block a page from search results, robots.txt is not the best method to prevent indexing. Instead, you should use a noindex directive or require authentication to view the page.

2:37

🎥 Source video

Extracted from a Google Search Central video

⏱ 9:28 💬 EN 📅 06/10/2020 ✂ 24 statements

Watch on YouTube (2:37) →

✂ Other statements from this video 23 ▾

📅

Official statement from October 6, 2020 (5 years ago)

⚠ A more recent statement exists on this topic Should You Really Use Noindex Rather Than Robots.txt to Deindex a Page? John Mueller · March 15, 2021 View statement →

TL;DR

Google clearly states that robots.txt is not designed to prevent a page from being indexed in search results. The robots.txt file only blocks crawling, not appearances in the index—a fundamental nuance that is often misunderstood. To exclude a page from the SERPs, the noindex directive or authentication remains the only reliable method.

What you need to understand

What is the difference between crawling and indexing?

Crawling refers to the process where Google's bot explores a page to extract its content. Indexing is the decision to include that page in the searchable database of the search engine.

Robots.txt blocks crawling—the bot cannot access the page. But if there are external links pointing to this URL, Google can still index it with the available information (anchor text, link context). As a result, a URL may appear in the SERPs with an empty or generic snippet, even though Google never read the content.

Why does this confusion persist among so many practitioners?

Because for years, blocking crawling via robots.txt used to work indirectly for certain pages. If Google didn't crawl, it often didn't index either—but this was never guaranteed.

The problem arises when third-party backlinks signal the existence of a blocked URL. Google then creates a minimal index entry, based solely on external signals. Hence, you find your private URL in the results, with a rough title taken from the anchor text.

In what concrete cases does this situation occur?

Typically on staging environments blocked by robots.txt but linked from an external site, or admin pages erroneously referenced in forums or third-party tools.

Misconfigured CMSs also generate technical URLs blocked from crawling but indexed through contradictory XML sitemaps. Google sees the URL in the sitemap, notes that it is blocked from crawling, but still indexes it if other signals deem it relevant.

Robots.txt blocks crawling, not indexing—a URL can appear in the SERPs without being crawled
Noindex is the technical directive to exclude a page from the index (requires Google to crawl the page to read the tag)
Authentication (login/password) physically prevents access—an extreme but effective method
If a URL is blocked by robots.txt AND has backlinks, Google can create a partial index entry based on anchor texts
The combination of robots.txt + noindex is technically contradictory—Google cannot read the noindex if it does not crawl

SEO Expert opinion

Is this statement consistent with real-world observations?

Yes, and it's even a welcome reminder. In practice, we regularly see URLs blocked by robots.txt that still appear as indexed in Google Search Console. The mention "Blocked by robots.txt" in GSC specifically refers to these ambiguous cases.

What is missing in Waisberg's statement is an explanation of the deindexing delay. Switching from robots.txt to noindex on an already indexed page does not guarantee immediate removal from the index—Google must first recrawl to read the noindex. [To be verified] : no official data on the average duration of this process.

What concrete risks exist for an e-commerce or editorial site?

The classic scenario: a site blocks its filter facets or internal search results pages via robots.txt to preserve crawl budget. If these URLs receive external links (from forums, comparison sites), they can get indexed with empty or misleading snippets.

Result: unintentional cannibalization and dilution of visibility. Google presents a useless technical URL instead of the strategic category page. Worse, these ghost URLs consume crawl budget during periodic recrawl attempts, even if they remain blocked.

In what cases does the rule not strictly apply?

If a URL has no external backlinks and is not listed in any XML sitemaps, blocking it via robots.txt usually suffices to avoid indexing. But it's a gamble—you have no contractual guarantee from Google on this point.

Another exception: purely technical resources (CSS, JS) that you block for performance reasons. Google recommends not to block these resources, but if you do, they are unlikely to appear as organic results anyway. The problem remains limited to HTML pages intended for users.

Warning: using noindex on a page blocked by robots.txt creates a technical deadlock. Google cannot crawl to read the noindex. If you need to deindex a currently blocked URL, first remove the robots.txt blocking, add the noindex, wait for recrawl and deindexing, then reblock if necessary.

Practical impact and recommendations

What concrete actions should be taken on an existing site?

First step: cross audit robots.txt / Google index. Use the command site:yourdomain.com in Google, then filter for URLs that should be blocked. Compare with your robots.txt file—any blocked URL that still appears in the SERPs needs a noindex.

Next, check in Google Search Console the section "Pages" > "Why are the pages not indexed". Specifically look for the label "Blocked by robots.txt". If these pages have impressions or clicks, it means they are paradoxically indexed despite the blocking.

What mistakes to avoid when migrating to noindex?

Never remove a robots.txt line without first adding the noindex and verifying the recrawl. Otherwise, Google will massively crawl the newly accessible URLs and potentially index unwanted content before you can react.

Avoid noindex via X-Robots-Tag HTTP on pages blocked by robots.txt—same issue as the meta tag. Google must be able to access the HTTP response to read the header. The only viable exception remains server authentication (401/403) which physically blocks access.

How to verify that the configuration is correct in the long term?

Set up a Search Console alert for newly indexed pages with the label "Blocked by robots.txt". This detects inconsistencies as soon as they occur, especially after CMS updates or migrations.

Regularly test with the URL inspection tool in GSC: it indicates if a page is blocked from crawling but present in the index. For sites with thousands of technical URLs, automate this check via the Search Console API and monitoring scripts.

Audit blocked URLs by robots.txt that still appear in site:
Replace robots.txt with noindex on all pages to be excluded from the index
Temporarily remove the robots.txt blocking to allow the noindex to be crawled
Monitor GSC to detect new "Blocked by robots.txt" pages that are indexed
Prioritize server authentication for truly sensitive content (admin, staging)
Never combine robots.txt + noindex on the same URL (technical contradiction)

The distinction between crawling and indexing remains a frequent pitfall, even for experienced SEOs. Properly migrating from a robots.txt strategy to noindex requires precise auditing, recrawl planning, and continuous monitoring. These technical optimizations, especially on complex sites with thousands of URLs, can quickly become time-consuming and risky without rigorous methodology. Consulting a specialized SEO agency may be wise to structure the audit, prioritize corrections, and monitor the switch without disrupting strategic pages.

❓ Frequently Asked Questions

Peut-on utiliser robots.txt ET noindex sur la même page ?

Non, c'est techniquement contradictoire. Si robots.txt bloque le crawl, Google ne peut pas accéder à la page pour lire la directive noindex. Il faut d'abord retirer le blocage robots.txt, laisser Google crawler et lire le noindex, puis éventuellement rebloquer une fois la page désindexée.

Combien de temps faut-il pour qu'une page en noindex sorte de l'index ?

Aucune durée officielle communiquée par Google. En pratique, cela dépend de la fréquence de crawl de la page — quelques jours pour une page fréquemment visitée, plusieurs semaines voire mois pour une URL rarement crawlée. L'outil d'inspection d'URL de la GSC permet de forcer un recrawl.

L'authentification par mot de passe est-elle vraiment nécessaire pour des pages admin ?

C'est la méthode la plus radicale et la seule garantie à 100 %. Un code HTTP 401 ou 403 empêche physiquement Google d'accéder au contenu, contrairement à robots.txt ou noindex qui restent des directives que le moteur peut théoriquement ignorer (même si ce n'est pas le cas en pratique).

Si je bloque une facette de filtre par robots.txt, peut-elle quand même apparaître dans Google Shopping ou Google Images ?

Oui, notamment pour Google Images. Le blocage robots.txt empêche le crawl de la page HTML, mais si des images de cette page sont crawlées via d'autres URLs ou sitemaps dédiés, elles peuvent s'indexer indépendamment. Pour Google Shopping, les flux produits (XML/CSV) fonctionnent en parallèle du crawl organique.

Quelle directive utiliser pour un environnement de staging accessible publiquement ?

L'idéal reste l'authentification HTTP (htpasswd, OAuth, etc.). Si ce n'est pas possible, utilisez noindex en header X-Robots-Tag au niveau serveur sur toutes les réponses HTTP du domaine de staging. Ne vous fiez pas uniquement à robots.txt — des liens accidentels depuis des outils tiers peuvent quand même indexer des URLs.

🏷 Related Topics

robots.txt noindex indexation crawl Search Console désindexation facettes authentification

Domain Age & History Crawl & Indexing AI & SEO

🎥 From the same video 23

Other SEO insights extracted from this same Google Search Central video · duration 9 min · published on 06/10/2020

🎥 Watch the full video on YouTube →

Related statements

« Previous

Robots.txt does not block indexing but crawling...

Core Web Vitals: Real User Data...

« Back to results