Does blocking crawl with robots.txt actually prevent deindexation?

Quick SEO Quiz

Test your SEO knowledge in 3 questions

Less than 30 seconds. Find out how much you really know about Google search.

🕒 ~30s 🎯 3 questions 📚 SEO Google

Official statement

Robots.txt blocks crawling (Google cannot see the page, but the URL can still appear without content). The meta robots noindex tag allows Google to see the page and remove it completely from search results. To block crawling, use robots.txt. To prevent indexation, use one or the other depending on what is easiest.

🎥 Source video

Extracted from a Google Search Central video

💬 EN 📅 04/07/2022 ✂ 13 statements

Watch on YouTube →

✂ Other statements from this video 12 ▾

📅

Official statement from July 4, 2022 (3 years ago)

⚠ A more recent statement exists on this topic Should You Really Block the GoogleOther Crawler in Your Robots.txt? Gary Illyes · July 30, 2024 View statement →

TL;DR

Robots.txt blocks crawling but the URL can still appear in Google without content or description. The meta robots noindex tag allows Google to see the page before removing it completely. For proper deindexation, prioritize noindex over robots.txt.

What you need to understand

What is the fundamental difference between these two methods?

The robots.txt file prevents Googlebot from accessing a page. The bot cannot read the content, analyze meta tags, or follow links. It's a gatekeeper that says "don't enter."

The meta robots noindex tag works the opposite way: it allows Google to crawl the page, read its content, understand its structure — then asks it not to index it. The bot enters, sees that you don't want this page in the index, and removes it or never adds it.

Why can a URL blocked by robots.txt still appear in search results?

Google can discover a URL through external links or sitemaps without ever crawling the page itself. In this case, the URL can appear in the SERPs with "No information available" — because Google knows the page exists, but has never been able to access it to verify its content.

It's paradoxical but logical: robots.txt blocks crawling, not knowledge of the URL's existence. If you want a page to disappear completely from results, blocking access is not enough.

When should you use one method or the other?

Use robots.txt to save crawl budget on unnecessary resources: heavy JS files, internal PDFs, infinite filter pages, admin zones. The goal is to prevent Google from wasting time on these URLs.

Use meta robots noindex when you want a page completely absent from the index: thank you pages, internal search results, private but accessible pages. Google must be able to read the directive to apply it — so no upstream blocking.

Robots.txt: controls crawling, not indexation — the URL can remain visible
Meta robots noindex: controls indexation — Google must be able to crawl to read the directive
Never combine both: blocking a page in robots.txt prevents Google from seeing the noindex
To completely remove a URL, always prioritize noindex

SEO Expert opinion

Is this distinction actually respected by Google in practice?

Yes, and it's a classic pitfall. We regularly see sites that block sensitive pages in robots.txt — account pages, abandoned carts, search filters — and later discover these URLs in Google with "Description unavailable for this page."

The problem? These URLs are known through backlinks or sitemap leaks. Google indexes them as "existing pages" without ever being able to access them. Result: you have zombie URLs in the index, with no control over their presentation.

What mistakes do we still see too often in the field?

Mistake number one: blocking in robots.txt a page you want to deindex. Typical example: a redesign where you block the old site in robots.txt to "force" deindexation. Google can no longer crawl, so never sees the 301 redirects, noindex tags, nothing. Old URLs remain in the index for months.

Second common mistake: putting noindex on a page then blocking it in robots.txt "for security." Google can no longer verify if the noindex is still present — and may reindex the page if it deems it relevant through external signals. [To verify] on large volumes, as Google sometimes seems to ignore this rule if the page receives many positive signals.

Warning: If you have blocked in robots.txt pages containing sensitive data (emails, names, numbers), temporarily unblock them, add noindex, wait for complete deindexation, then reblock if necessary. Never leave sensitive URLs indexed "blindly."

Is Google consistent with its own historical recommendations?

For years, Google said "use robots.txt to block indexation" — which was technically wrong but often worked in practice, because Google wouldn't crawl and would eventually remove URLs from the index. Then Search Console started reporting "URLs indexed despite robots.txt blocking."

Today, the message is clear: robots.txt does not guarantee deindexation. This is an acknowledged change in doctrine, and practices must adapt accordingly — especially for large sites that historically managed indexation via robots.txt.

Practical impact and recommendations

What should you audit first on an existing site?

Start by cross-checking two sources: URLs blocked in robots.txt (via your CMS or a Screaming Frog crawl in list mode) and URLs indexed in Google (via site: or Search Console). Any intersection between these two lists is a potential problem.

Next, check pages with noindex: are they accessible to crawling? If you have noindex on a page blocked in robots.txt, the directive is useless — Google cannot read it. Unblock, wait for recrawl, then reblock if necessary to save crawl budget.

How do you fix a misconfigured deindexation?

If pages are blocked in robots.txt but appear in the index: unblock them immediately. Add a <meta name="robots" content="noindex"> tag in the <head>. Request reindexation via Search Console to accelerate the process.

Once pages are deindexed (verify with site:yoururl.com), you can reblock in robots.txt if you want to save crawl budget. But if there's a risk of indexation through external links, keep the noindex in place — it's safer.

What best practices should you adopt to avoid these errors in the future?

Never block in robots.txt a page you want to deindex — always use meta robots noindex
Reserve robots.txt for crawl budget control: unnecessary resources, heavy files, admin zones without SEO value
Regularly audit URLs indexed despite robots.txt blocking via Search Console (Coverage report)
Clearly document your indexation strategy: which pages should be crawled, which indexed, which neither
Test your directives on a staging environment before deploying them to production
Use Search Console to force recrawl of critical pages after directive changes

Robots.txt and meta robots are not interchangeable. The first controls crawling access, the second controls indexation. To remove a page from Google, noindex is the only reliable method. To save crawl budget without risk of residual indexation, combine noindex then robots.txt blocking — in that order. These mechanisms may seem simple on paper, but their correct implementation at scale on a complex site requires solid technical expertise and a strategic vision of crawl budget. If your architecture generates thousands of URL variants or you manage multiple environments (staging, production, mirrors), bringing in a specialized SEO agency can save you costly mistakes and ensure optimal indexation management.

❓ Frequently Asked Questions

Peut-on utiliser robots.txt ET meta robots noindex sur la même page ?

Non, c'est contre-productif. Si vous bloquez une page en robots.txt, Google ne peut pas la crawler et donc ne verra jamais la balise noindex. Utilisez l'un ou l'autre selon votre objectif : robots.txt pour économiser du crawl, noindex pour désindexer.

Pourquoi mes pages bloquées en robots.txt apparaissent-elles encore dans Google ?

Parce que Google connaît ces URLs via des liens externes, des sitemaps ou d'anciennes explorations. Robots.txt empêche le crawl, pas la connaissance de l'existence de l'URL. Pour retirer ces pages, débloquez-les, ajoutez un noindex, attendez la désindexation, puis rebloquez si nécessaire.

Le X-Robots-Tag HTTP fonctionne-t-il comme la balise meta robots ?

Oui, le X-Robots-Tag (envoyé dans les en-têtes HTTP) a exactement le même effet que la balise meta robots. C'est utile pour les fichiers non-HTML (PDF, images) ou quand vous ne pouvez pas modifier le code source. Google doit pouvoir crawler pour lire cet en-tête.

Combien de temps faut-il pour qu'une page en noindex disparaisse de l'index ?

Ça dépend de la fréquence de crawl de la page. Pour une page souvent crawlée, quelques jours à deux semaines. Pour une page rarement visitée, plusieurs mois. Vous pouvez accélérer le processus en demandant une réindexation via la Search Console.

Faut-il retirer les pages en noindex du sitemap XML ?

Oui, c'est recommandé. Google accepte les URLs en noindex dans les sitemaps, mais ça envoie des signaux contradictoires. Un sitemap doit contenir uniquement les pages que vous voulez voir indexées. Les pages en noindex n'ont rien à y faire.

🏷 Related Topics

robots.txt meta robots noindex crawl budget indexation désindexation Googlebot directives crawl

Domain Age & History Content Crawl & Indexing AI & SEO Domain Name

🎥 From the same video 12

Other SEO insights extracted from this same Google Search Central video · published on 04/07/2022

🎥 Watch the full video on YouTube →

Related statements

« Previous

Disavow file: can be removed in most cases...

PageSpeed Insights vs Search Console: Field Data v...

« Back to results