Why does a noindex page continue to show up in Google's index?

Official statement

If a page set to noindex is still indexed, verify that it's not being blocked by robots.txt, as this prevents Google from seeing the noindex tag. Remove the block temporarily to allow crawling.

50:14

🎥 Source video

Extracted from a Google Search Central video

⏱ 1h01 💬 EN 📅 28/02/2018 ✂ 10 statements

Watch on YouTube (50:14) →

✂ Other statements from this video 9 ▾

16:24 Le contenu desktop-only disparaît-il vraiment avec le mobile-first indexing ?
26:01 Comment le rapport de couverture d'index de la Search Console peut-il révéler vos angles morts SEO ?
28:42 Pourquoi Google propose-t-il deux crawlers dans l'outil d'inspection d'URL ?
44:51 Le cloaking est-il toujours pénalisé, même pour protéger des contenus sensibles ?
47:53 Les variations régionales de mots-clés comptent-elles encore pour le référencement ?
52:53 Les soft 404 sont-elles vraiment un problème pour votre référencement ?
53:37 L'A/B testing peut-il vraiment pénaliser votre référencement naturel ?
53:58 Pourquoi vos sitemaps dynamiques ne sont-ils pas traités par Google ?
57:18 Comment Google évalue-t-il réellement la légalité et la valeur des avis affichés en rich snippets ?

What you need to understand

How can robots.txt prevent noindex from functioning?

The mechanism is counterintuitive yet logical. When a URL is blocked in robots.txt, Googlebot cannot crawl the page to read its HTML content. The meta robots noindex directive is found in the page's source code, inaccessible if crawling is forbidden.

Google then indexes the URL through other signals: external backlinks, mentions in the XML sitemap, or internal links detected elsewhere on the site. The URL appears in search results with the typical note "No information available for this page" since Google knows of its existence without being able to crawl it.

Does the robots.txt block always take precedence over noindex?

Yes, and this is a strict priority rule in the crawl architecture. The robots.txt file is checked before any attempt to fetch the page. If access is denied, the process stops there.

The noindex directive, whether in a meta HTML tag or in the HTTP header X-Robots-Tag, can only be read if the crawler actually accesses the content. This technical hierarchy explains why certain pages persist in the index despite a properly implemented noindex on the page.

In what scenarios do we observe this issue in practice?

The classic scenario: a site blocks URL parameters or entire directories via robots.txt to save crawl budget, then adds a noindex on those same pages to clean the index. The two directives conflict.

Another common case: migrating a site that retains old robots.txt blocks while the new template systematically adds noindex tags on certain sections. The pages remain visible in Google since the inherited robots.txt blocks the reading of the new directives.

Disallow robots.txt blocks crawling before reading the HTML
Noindex requires crawling to be detected and applied
Google can index a URL without crawling if it receives external signals
Effective deindexing takes several weeks after removal of the robots.txt block
Blocked pages appear with a characteristic empty snippet

SEO Expert opinion

Is this statement consistent with on-the-ground observations?

Absolutely, and it is one of the most documented technical traps in SEO. Server logs consistently confirm that Googlebot doesn't even attempt to crawl the URLs blocked in robots.txt, making it impossible to detect noindex.

Google's recommendation to "temporarily remove the block" is nonetheless difficult to apply. It assumes that one can afford a spike in crawling on hundreds or thousands of previously blocked URLs, which can overwhelm a less robust server. [To be verified]: Google does not specify the necessary duration of this unblock nor the timeline before effective deindexing.

What nuances should be added to this directive?

The statement remains silent on one crucial point: what to do with URLs that you want to block AND deindex quickly? The recommended sequence (unblock robots.txt, wait for noindex crawl, reblock) takes several weeks minimum depending on the site's crawl frequency.

For sites with a limited crawl budget, this approach can divert crawl resources from important pages. Google does not offer an alternative to speed up the process, like a signal via Search Console that would force a priority recrawl of the affected pages.

In what cases does this rule not fully apply?

HTTP headers X-Robots-Tag noindex can sometimes be detected during a HEAD request, without a full HTML download. Some SEOs report successful deindexing even with robots.txt active, but this is inconsistent and not officially documented.

URLs never indexed (new pages) with noindex AND robots.txt blocking do not encounter this problem, since Google has no external signal to discover them. The conflict only concerns pages already present in the index that we attempt to clean.

Warning: removing a robots.txt block even temporarily can trigger massive crawling. On large sites, coordinate this operation with active server monitoring and plan for a low user traffic window.

Practical impact and recommendations

What should be done concretely to deindex these pages?

First step: identify the indexed URLs despite the noindex via a site:example.com query combined with URL inspection in Search Console. Compare with your robots.txt file to spot conflicts between Disallow directives and indexed pages.

Then, temporarily remove the robots.txt block for the targeted URLs only, not necessarily the entire file. Use the granularity of Disallow directives to limit the impact. Request a recrawl via Search Console to expedite the noindex detection, even if Google does not guarantee immediate processing.

What mistakes should be avoided in this handling?

Never reblock robots.txt before Google has crawled and accounted for the noindex. Check in Search Console that the status properly changes to "Excluded by the noindex tag" before reactivating any block.

Avoid combing noindex and robots.txt blocking by principle, unless you explicitly want to block the crawling of a resource without caring about its indexing. To deindex properly, the noindex is sufficient and allows Google to crawl the directive. Robots.txt is meant to save crawl budget, not to control indexing.

How to verify that the problem is resolved?

Use the coverage report in Search Console to track the evolution of pages marked "Excluded by the noindex tag". URLs should disappear from the index within 2-4 weeks following the noindex crawl, depending on the frequency of Googlebot's visits.

Regularly run site:example.com/problematic-path queries to confirm effective deindexing. Server logs should show Googlebot visits on previously blocked URLs, proving that crawling has resumed and that the directive can be read.

Audit the robots.txt to identify Disallow conflicts with indexed pages
Temporarily remove robots.txt blocks on URLs to be deindexed
Ensure that noindex is correctly present in the meta tag or X-Robots-Tag HTTP
Request a recrawl using the URL Inspection tool in Search Console
Monitor server logs to confirm Googlebot's presence
Wait 2-4 weeks and verify effective deindexing via site query:

Careful management of indexing via noindex and robots.txt requires a sharp technical understanding of crawl mechanisms and continuous monitoring of results. These seemingly simple actions can generate side effects on crawl budget or leave unwanted pages visible for weeks. For complex or critical sites, relying on a specialized SEO agency ensures controlled implementation, with prior auditing, intervention planning, and post-deployment monitoring to secure each step of the process.

❓ Frequently Asked Questions

Peut-on utiliser robots.txt et noindex simultanément sur la même URL ?

Techniquement oui, mais c'est inefficace : le blocage robots.txt empêche Googlebot de lire le noindex. L'URL peut rester indexée si Google la découvre par des liens externes.

Combien de temps faut-il laisser le robots.txt débloqué pour que Google détecte le noindex ?

Cela dépend de la fréquence de crawl du site. En général, 2-4 semaines suffisent pour que Googlebot revisite les URLs et détecte la directive. Les sites à fort crawl peuvent voir des résultats en quelques jours.

Le X-Robots-Tag HTTP noindex fonctionne-t-il mieux avec un robots.txt actif ?

Non, le X-Robots-Tag nécessite aussi une réponse HTTP complète, donc un crawl de la ressource. Le blocage robots.txt empêche toute requête, rendant l'en-tête inaccessible comme la meta tag HTML.

Faut-il supprimer définitivement le blocage robots.txt après désindexation ?

Pas nécessairement. Une fois le noindex détecté et la page désindexée, vous pouvez rebloquer le robots.txt si vous souhaitez économiser du crawl budget. La désindexation persistera car Google a mémorisé la directive.

Comment éviter ce problème lors d'un nettoyage d'index massif ?

Privilégiez le noindex seul sans blocage robots.txt pour désindexer. Une fois les pages sorties de l'index, ajoutez éventuellement un Disallow si vous voulez bloquer le crawl futur. Ne cumulez jamais les deux dès le départ.

🎥 From the same video 9

Other SEO insights extracted from this same Google Search Central video · duration 1h01 · published on 28/02/2018

🎥 Watch the full video on YouTube →