Does robots.txt or noindex really block your pages from being indexed?

Official statement

Blocking indexing with 'robots.txt' prevents Google from seeing the content but does not stop link tracking to the site. 'Noindex' removes the content from the index but requires the content to be accessible first for the directive to apply.

26:41

🎥 Source video

Extracted from a Google Search Central video

⏱ 1h00 💬 EN 📅 16/03/2017 ✂ 10 statements

Watch on YouTube (26:41) →

✂ Other statements from this video 9 ▾

2:00 Les publicités Google Ads pénalisent-elles vraiment le référencement naturel ?
13:40 Les liens nofollow transmettent-ils vraiment zéro PageRank ?
23:21 Les liens internes influencent-ils vraiment le PageRank de vos pages ?
29:53 AMP booste-t-il vraiment votre classement Google ou est-ce un mythe SEO ?
34:32 Peut-on cumuler plusieurs schémas de balisage sur une même page sans risque SEO ?
48:00 Pourquoi Google tolère-t-il le contenu dupliqué dans la documentation technique ?
54:50 La modération des commentaires peut-elle déclencher une action manuelle Google ?
55:52 Mettre à jour son contenu sans changer la date améliore-t-il vraiment le classement ?
57:00 Google Web Light : Faut-il optimiser différemment pour les connexions lentes ?

What you need to understand

Why isn't robots.txt enough to disallow a page from being indexed?

The robots.txt file acts like a "No Entry" sign placed in front of a door. Googlebot adheres to this instruction and does not crawl the blocked URL. The problem is that this URL can still appear in search results if external links point to it.

Google detects these backlinks, notes that a resource exists at this address, but cannot access the content to verify its nature. The result? An empty snippet with only the URL visible in the SERPs. It's not technically content indexing, but the URL remains in the index.

How does the noindex directive actually work?

The meta robots noindex tag (or the HTTP header X-Robots-Tag: noindex) explicitly tells Google to remove the page from its index. However, for this instruction to be read and applied, the bot must first be able to crawl the page.

Herein lies the trap: if you block the URL in robots.txt, Googlebot can never reach the HTML code where the noindex tag is located. The instruction remains invisible, thus ineffective. The page blocked in robots.txt with a noindex in the source code could still be listed in the index via its backlinks.

What is the impact on tracking outgoing links?

An often overlooked crucial point: robots.txt does not prevent Google from following the links pointing to the blocked URL. The engine detects these signals of popularity and incorporates them into its link graph, even without accessing the content.

On the other hand, if you use noindex without robots.txt, Google crawls the page, reads the directive, removes the URL from the index AND can follow the links present in the content of that page. The PageRank continues to flow through these outgoing links, which can be strategically useful for intermediate pages in your architecture.

Robots.txt blocks crawling but does not prevent the appearance of the URL in the index if backlinks exist
Noindex effectively removes the page from the index but requires prior crawling to be read
Combining robots.txt AND noindex on the same URL creates a technical conflict: the noindex will never be applied
A URL blocked in robots.txt can still consume crawl budget if Googlebot attempts to access it regularly
Link tracking to a URL blocked in robots.txt remains active, contrary to popular belief

SEO Expert opinion

Does this statement align with observations in the field?

Yes, and it is a welcome confirmation of a behavior observed for years. In crawl budget audits, we regularly see URLs blocked in robots.txt that continue to appear in server logs: Googlebot attempts to crawl them periodically, especially if they receive new backlinks.

The point on link tracking is less publicly documented but corresponds to tests conducted on high-volume sites. A page blocked in robots.txt with outgoing internal links does not transmit classic PageRank (since it is not crawled), but the external links pointing to it generate detectable signals of popularity for the algorithm.

What nuances should be added to this rule?

Google specifies that noindex "requires the content to be accessible first," but does not detail the processing delay. In practice, a crawled page with noindex can remain visible in the index for several days or even weeks before complete deindexation. [To be verified] based on the crawl budget allocated to the site.

Another gray area: what happens if you block a URL in robots.txt AFTER it has been indexed with noindex? Theoretically, the already applied noindex should maintain deindexation, but Googlebot can no longer re-crawl to confirm the directive. Some practitioners have observed partial reindexing in this scenario.

In what cases does this logic create problems?

The classic scenario: you inherit a site with thousands of pages blocked in robots.txt that the client wants to "cleanly" deindex. Removing these lines from robots.txt to allow Googlebot to crawl the noindex consumes a massive crawl budget on URLs without value.

A pragmatic solution that is rarely mentioned: use the HTTP header X-Robots-Tag: noindex in the server response, even for URLs blocked in robots.txt. Technically, Googlebot should not see this header since it is not crawling, but some field reports suggest that Google might still detect it during occasional checks. [To be verified] — official documentation remains vague on this point.

Caution: never block in robots.txt pages with a canonical tag pointing to another URL. The canonical will never be read, creating contradictory signals in the index.

Practical impact and recommendations

What concrete steps should be taken to deindex pages?

The clean method: remove URLs from robots.txt, add a meta robots noindex tag in the HTML code or via X-Robots-Tag in the HTTP header, and then let Googlebot crawl these pages. Monitor deindexation via Search Console, in the "Coverage" section or the URL inspection tool.

To expedite the process on large volumes, submit a XML sitemap containing only the URLs to be deindexed. Counterintuitive, but this forces Google to prioritize crawling these pages to read the noindex. Remove the sitemap once deindexation is confirmed.

What errors should be absolutely avoided in this configuration?

Error number one: blocking entire sections in robots.txt (e.g., /blog/) while adding noindex in the templates. The noindex will never be applied. If backlinks point to these URLs, they will appear in the index with empty snippets.

Error number two: using robots.txt to "hide" duplicate or low-quality content. Google does not see the content but still detects the URL via links. It is better to use noindex + allow in robots.txt, or completely remove the pages with 301 redirects to consolidated content.

How can you check that your configuration is consistent?

Audit your robots.txt line by line: every blocked URL must have a valid technical reason (system files, session parameters, duplicate content managed otherwise). If the goal is deindexation, robots.txt is the wrong tool.

Crawl your site with Screaming Frog or Oncrawl in "Googlebot" mode to identify pages with noindex AND blocked in robots.txt. These conflicts are more common than one might think, especially on CMS platforms with poorly configured SEO plugins. Also check the HTTP headers: some servers send X-Robots-Tag: noindex on already blocked URLs, creating unnecessary redundancy.

Remove any URL from robots.txt that you truly want to deindex
Implement noindex via a meta tag or X-Robots-Tag header according to your technical stack
Temporarily submit a sitemap of the URLs to deindex to speed up crawling
Monitor deindexation in Search Console with alerts on coverage changes
Regularly audit robots.txt + noindex conflicts with a technical crawler
Document every line of your robots.txt: why is this URL blocked?

Fine management of indexing requires a precise understanding of crawling mechanisms and available directives. Robots.txt controls access, noindex controls presence in the index: these are complementary levers, not interchangeable. On high-volume sites or complex architectures (marketplaces, media, multilingual e-commerce), these configurations can quickly become critical for crawl budget and organic visibility. In light of these technical issues, working with a specialized SEO agency helps avoid costly mistakes and finely optimize each directive according to your business objectives.

❓ Frequently Asked Questions

Peut-on utiliser robots.txt ET noindex sur la même URL ?

Techniquement oui, mais c'est inefficace : si robots.txt bloque le crawl, Googlebot ne peut jamais lire la directive noindex présente dans le code. L'URL risque de rester dans l'index via ses backlinks.

Une page bloquée en robots.txt peut-elle quand même apparaître dans Google ?

Oui, l'URL peut apparaître dans les résultats si des liens externes pointent vers elle. Google indexe l'existence de l'URL mais affiche un snippet vide car il ne peut pas accéder au contenu.

Comment désindexer rapidement des milliers de pages bloquées en robots.txt ?

Retirer les lignes du robots.txt, ajouter noindex dans les templates concernés, puis soumettre un sitemap XML contenant ces URLs pour forcer le crawl. Surveiller la désindexation dans Search Console.

Le PageRank circule-t-il via une page en noindex ?

Oui, si la page est crawlable (pas bloquée en robots.txt), Google suit les liens sortants et le PageRank circule normalement. C'est différent d'une page bloquée en robots.txt où les liens sortants ne sont pas détectés.

Quel impact sur le crawl budget si je débloque des milliers d'URLs pour appliquer noindex ?

Impact significatif à court terme : Googlebot va crawler massivement ces pages pour lire les directives. Priorisez par batch et surveillez les logs serveur pour éviter une surcharge. Le crawl budget se normalise après désindexation complète.

🎥 From the same video 9

Other SEO insights extracted from this same Google Search Central video · duration 1h00 · published on 16/03/2017

🎥 Watch the full video on YouTube →