Why does blocking a page with robots.txt prevent Google from seeing your noindex tag?

Official statement

Using the URL Removal Tool does not change how pages are crawled or indexed. If a page is blocked by robots.txt, we will not see the noindex, so it is important to choose one method or the other.

10:08

🎥 Source video

Extracted from a Google Search Central video

⏱ 43:37 💬 EN 📅 23/08/2019 ✂ 9 statements

Watch on YouTube (10:08) →

✂ Other statements from this video 8 ▾

2:07 Les grands sites peuvent-ils se classer malgré des pages médiocres ?
7:31 Faut-il vraiment signaler la validation médicale de vos contenus santé en données structurées ?
9:02 L'équivalence AMP/mobile impacte-t-elle réellement le classement Google ?
11:07 Faut-il vraiment inclure un GTIN dans vos données structurées produit ?
14:30 Les images de stock plombent-elles vraiment votre référencement Google Images ?
17:38 Pourquoi votre site n'est-il toujours pas passé en indexation mobile-first ?
20:20 Comment Google gère-t-il vraiment le contenu dupliqué dans les résultats de recherche ?
36:10 L'indexation JavaScript à deux vagues est-elle vraiment en train de disparaître ?

What you need to understand

What's the difference between blocking with robots.txt and deindexing with noindex?

The robots.txt file acts as a barrier upstream: it outright denies a crawler access to a URL. Googlebot stops before even downloading the HTML content. The result: no analysis of the page, no reading of meta tags, no detection of noindex or canonical directives.

The noindex tag, on the other hand, requires the bot to access the page, download the HTML, and parse the header or the body of the document. It is a post-crawl instruction: "Okay, you can read this page, but don't index it." If you block the URL upstream, Google will never see this instruction. The page may remain indexed — orphaned, stagnant, with a snippet saying "A description for this result is not available due to the robots.txt file of this site."

Why doesn't the URL Removal Tool change anything about indexing?

The URL Removal Tool in Search Console is a temporary cache lasting 90 days. It hides a URL from search results without altering the crawling process or structural indexing. It's an emergency band-aid, not a long-term solution.

Google will continue to crawl the URL according to its usual schedule unless a robots.txt or noindex clearly instructs it otherwise. The tool does not change either the crawl frequency or the actual indexing status. Once the 90 days are up, if nothing has changed on the server side, the page reappears in the SERPs.

What happens if I use robots.txt AND noindex simultaneously?

You create a technical conflict. The robots.txt blocks access, so Google never reads the noindex. The page remains potentially indexed with a degraded snippet. This is a common scenario on poorly configured sites: old staging URLs blocked by robots.txt but with a noindex in the invisible HTML.

Google always prioritizes robots.txt first. If the file says "Disallow," the crawler won't go any further. The noindex becomes moot. To properly deindex, you must allow crawling (remove the Disallow line) and let the bot discover the noindex directive over a few crawl cycles.

The robots.txt blocks crawl access, preventing any HTML reading.
The noindex requires a crawl to be detected and applied.
The removal tool is temporary (90 days) and affects neither crawl nor structural indexing.
Combining robots.txt + noindex creates a technical conflict where the noindex remains invisible.
To properly deindex: allow crawl, let Google read the noindex, then block if necessary.

SEO Expert opinion

Does this statement reflect real-world observations?

Yes, and it's even a classic in SEO audits. We regularly find sites with thousands of URLs blocked by robots.txt but still indexed, displaying the infamous snippet “Description not available.” Google can't read the noindex, so it keeps the URL indexed by default — especially if there are backlinks pointing to it.

The problem is complicated by migrations or redesigns. A URL blocked by robots.txt for months, then unblocked, can take several weeks to be recrawled if the crawl budget is tight. In the meantime, it stays indexed with outdated or empty content. The result: index pollution, potential cannibalization, crawl dilution.

When is the robots.txt justifiable for blocking indexing?

Rarely. The robots.txt is mainly used to save crawl budget: infinite facets, dynamic URL parameters, admin areas, unnecessary resources. But to deindex an indexable page (legitimate content you simply don’t want in the SERPs), a noindex is cleaner.

The only case where robots.txt + indexing is tolerable: PDFs or downloadable files that you want to keep indexed for visibility but do not want Google to crawl their internal content. It is important that external links contribute to the URL’s authority. [To be verified] depending on the type of site: some sectors (legal, medical) prefer to block the crawl of sensitive documents entirely, even if it sacrifices SEO.

What should I do if a page blocked by robots.txt remains indexed?

First step: remove the Disallow directive from the robots.txt for that URL or directory. Next, add a clean noindex (meta tag or HTTP header X-Robots-Tag). Wait for Googlebot to recrawl — this can take anywhere from a few days to several weeks depending on the crawl budget.

In parallel, use the URL Removal Tool to accelerate the visual removal from SERPs, but never rely solely on it. Check in Search Console that the status correctly changes to “Excluded by the noindex tag.” If nothing changes after 4-6 weeks, force a recrawl using “URL Inspection” or submit a sitemap XML containing the URL (counterintuitive, but it works for triggering a quick crawl).

Warning: Never use robots.txt to hide duplicate content. Google cannot see your canonicals, so it risks indexing multiple versions and penalizing you for duplicate content. Always prefer canonical tags + noindex if necessary.

Practical impact and recommendations

How to audit robots.txt / noindex conflicts on a site?

Export the list of URLs blocked by robots.txt from your file (or via Screaming Frog in “List” mode). Cross-reference this list with the indexed URLs in Google: use the query site:example.com and then manually filter, or use a Search Console export “Coverage > Excluded” + a Screaming Frog crawl with “Respect robots.txt” turned off.

Look for URLs that appear in both “Blocked by robots.txt” AND “Indexed”. These are your critical conflicts. Check if a noindex is present in the HTML or HTTP headers: if so, it is invisible to Google. Decide for each URL: proper deindexing (remove robots.txt, keep noindex) or permanent blocking (keep robots.txt, accept residual indexing).

Which method to choose based on content type?

For sensitive or private content: password-protected, paywall, or server-side blocking (401/403). Never rely solely on robots.txt or noindex — a link leak can index the page. For duplicate or low-value content: noindex + canonical if relevant, never robots.txt (Google needs to read your directives).

For technical resources (CSS, JS, images): block NOTHING with robots.txt since 2015 — Google needs these resources for rendering and Core Web Vitals. For facets or URL parameters: use robots.txt if crawl budget is tight + canonical on the main version. For obsolete or archived pages: 301 or 410 depending on the case, never just robots.txt.

What critical errors should be absolutely avoided?

Never block with robots.txt a URL you want to properly deindex. It's the guarantee of a zombie indexing. Never use the URL Removal Tool as a long-term solution: it’s a 90-day cache, not an indexing directive.

Avoid blocking /wp-admin/ or /wp-includes/ if you're using WordPress: some plugins inject critical CSS/JS from these directories, and blocking could degrade mobile rendering. Finally, never remove a robots.txt line without checking the crawl impact: unlocking 50,000 facet URLs at once can saturate your server and dilute the budget on strategic pages.

Export URLs blocked by robots.txt and cross-reference with indexed URLs (Search Console + crawl)
To deindex: remove robots.txt, add noindex, wait for recrawl, check status “Excluded by noindex”
To block crawl without deindexing: accept residual indexing or use a real server restriction (401/403)
Never block critical CSS/JS/images — Google needs them for rendering and Core Web Vitals
Use the URL Removal Tool only as an urgent temporary measure, never as a long-term strategy
Test all robots.txt changes on a sample before global deployment (crawl budget risk)

Managing crawl and indexing directives requires a fine understanding of technical priorities and rigorous ongoing monitoring. With robots.txt, noindex, canonicals, redirects, and HTTP statuses, the possible combinations are numerous — and the mistakes costly. If your site has thousands of URLs or if you're managing a complex migration, these optimizations can quickly become a headache. Hiring a specialized SEO agency guarantees you a comprehensive audit, a tailored strategy, and post-deployment monitoring to prevent any regression in indexing or index pollution.

❓ Frequently Asked Questions

Peut-on utiliser robots.txt pour désindexer une page rapidement ?

Non. Le robots.txt bloque le crawl, donc Google ne peut pas lire la balise noindex ni mettre à jour le statut d'indexation. La page risque de rester indexée avec un snippet dégradé.

L'outil de suppression d'URL remplace-t-il le noindex ?

Non. C'est un masquage temporaire de 90 jours dans les résultats de recherche. Il ne modifie ni le crawl ni l'indexation structurelle. Une fois expiré, l'URL réapparaît si aucune directive serveur n'a changé.

Que faire si une URL bloquée par robots.txt est toujours indexée après plusieurs mois ?

Retirer la directive Disallow du robots.txt, ajouter un noindex dans le HTML ou en HTTP header, puis attendre que Google recrawle l'URL. Utiliser l'outil de suppression en parallèle pour accélérer le retrait visuel.

Bloquer des ressources CSS/JS par robots.txt impacte-t-il le SEO ?

Oui, fortement. Google a besoin de ces ressources pour le rendu de la page et le calcul des Core Web Vitals. Un blocage peut dégrader le score mobile et nuire au classement.

Peut-on combiner robots.txt et canonical sur la même URL ?

Techniquement oui, mais c'est inefficace. Si vous bloquez l'URL par robots.txt, Google ne lira jamais la balise canonical. Préférez un noindex + canonical si vous voulez désindexer tout en consolidant les signaux.

🎥 From the same video 8

Other SEO insights extracted from this same Google Search Central video · duration 43 min · published on 23/08/2019

🎥 Watch the full video on YouTube →