What does Google say about SEO? /
Quick SEO Quiz

Test your SEO knowledge in 3 questions

Less than 30 seconds. Find out how much you really know about Google search.

🕒 ~30s 🎯 3 questions 📚 SEO Google

Official statement

To prevent a page from appearing in Google's index, use the meta robots tag or the X-Robots-Tag header, but do not block the page in robots.txt. Blocking in robots.txt prevents Googlebot from seeing your deindexing directives.
🎥 Source video

Extracted from a Google Search Central video

💬 EN 📅 04/12/2024 ✂ 13 statements
Watch on YouTube →
Other statements from this video 12
  1. La balise meta robots noindex suffit-elle vraiment à empêcher l'indexation d'une page ?
  2. Peut-on vraiment piloter Googlebot News et Googlebot Search avec des balises meta robots distinctes ?
  3. Peut-on vraiment empiler plusieurs directives meta robots dans une seule balise ?
  4. L'en-tête HTTP X-Robots peut-il remplacer la balise meta robots ?
  5. Où faut-il vraiment placer le fichier robots.txt pour qu'il soit pris en compte ?
  6. Faut-il gérer un robots.txt distinct pour chaque sous-domaine ?
  7. Le fichier robots.txt est-il vraiment respecté par tous les moteurs de recherche ?
  8. Faut-il utiliser les wildcards dans robots.txt pour mieux contrôler son crawl ?
  9. Faut-il vraiment déclarer son sitemap XML dans le fichier robots.txt ?
  10. Pourquoi ne faut-il jamais combiner robots.txt et meta noindex sur la même page ?
  11. Robots.txt bloque-t-il vraiment l'indexation de vos pages ?
  12. Le rapport robots.txt de Google Search Console change-t-il vraiment la donne pour le crawl ?
📅
Official statement from (1 year ago)
TL;DR

Blocking a page via robots.txt does not deindex it — in fact, it prevents Googlebot from reading your noindex directives. To deindex properly, use meta robots or X-Robots-Tag, never robots.txt. A frequent confusion that costs dearly in visibility.

What you need to understand

What is the robots.txt trap for deindexing?

The robots.txt blocks crawling, not indexing. If Googlebot cannot access a page, it also cannot see your noindex tag. Result: the page can remain indexed with a URL and sometimes a snippet generated from external sources.

Google can index a URL without even crawling the page — based on backlinks, mentions, or cached versions. Blocking the crawl does nothing if the page is already known to the search engine.

How do meta robots and X-Robots-Tag actually work?

The meta robots noindex tag is placed in the <head> of an HTML page. The X-Robots-Tag header is sent via the server (useful for PDFs, images, non-HTML files). In both cases, Googlebot must be able to crawl the page to read the directive.

Once the directive is detected, Google removes the page from its index on the next pass. If you then block crawling in robots.txt, the directive remains effective — but it's better to let Googlebot verify periodically.

What happens if I block a noindex page in robots.txt?

Googlebot can no longer verify whether the noindex directive is still in place. If you later remove the noindex tag but keep the robots.txt block, the page can be reindexed without your intent — because Google has no way to confirm your intention.

  • robots.txt blocks crawling, not indexing — a frequent confusion.
  • meta robots noindex and X-Robots-Tag are the only reliable methods for deindexing.
  • Googlebot must be able to crawl the page to read deindexing directives.
  • Blocking after deindexing works, but prevents future verification of directives.
  • A page blocked in robots.txt can still appear in SERPs if Google knows about it through other means.

SEO Expert opinion

Is this directive consistent with real-world observations?

Yes — and it's even one of the rare Google statements perfectly aligned with reality. We regularly see websites blocking sensitive pages (dev, staging, admin) only via robots.txt, then wondering why they find them indexed with the URL visible in SERPs.

The confusion often comes from the fact that robots.txt seems to forbid Google. But the file only controls access to content, not presence in the index. If a URL is mentioned elsewhere on the web, Google can index it without ever crawling it.

What nuances should be applied to this rule?

If a page has never been crawled or known to Google, blocking it in robots.txt is sufficient to prevent future indexing. But as soon as it is discovered — through a backlink, sitemap, internal link — the block becomes counterproductive.

Another case: orphaned pages with noindex but without robots.txt. If Googlebot never finds them (no internal links, no sitemap), the noindex directive is useless. Google must first access the page to read the tag.

And let's be honest: some third-party tools (SEO crawlers, scrapers) ignore robots.txt. Blocking a sensitive page only via this file means betting on bot goodwill. You might as well add server authentication if confidentiality is critical.

In what cases does this approach cause problems?

Sites with massive duplicate content (e-commerce with filters, misconfigured multilingual sites) may want to block certain URLs in robots.txt to save crawl budget. Except that if these pages are already indexed, the block freezes the situation — it becomes impossible to push a noindex afterward.

The proper solution: first apply the noindex, wait for deindexing (days to weeks depending on crawl frequency), then block in robots.txt if necessary. Or better: use canonicals to concentrate indexing rather than multiplying blocks.

Warning: If you migrate a site and the old domain blocked pages in robots.txt, these pages can be reindexed after migration if the new server doesn't apply the same rules. Always check noindex directives on the server side, not just robots.txt.

Practical impact and recommendations

What should you concretely do to deindex a page?

Add <meta name="robots" content="noindex"> in the <head> of the HTML page. For non-HTML files (PDFs, images), configure the X-Robots-Tag: noindex header at the server level (Apache, Nginx, or via CDN rules).

Verify that the page is not blocked in robots.txt. If it is, remove the block temporarily until Googlebot crawls and reads the noindex directive. Once the page is deindexed (verifiable via site:example.com/url), you can block again if you want to save crawl budget — but it's not mandatory.

What errors should you absolutely avoid?

Never block a page in robots.txt thinking it will disappear from the index. The opposite happens: it remains indexed with a visible URL, sometimes with a snippet generated from external links or anchors.

Also avoid combining noindex + canonical to another page. Google favors the canonical and ignores the noindex — unpredictable result. If you want to deindex, do it properly with noindex alone, without contradictory canonical.

How do you verify that deindexing is working?

Use site:example.com/exact-url in Google. If the page still appears, wait a few days — deindexing is not instantaneous. You can also force a re-crawl via Search Console (URL Inspection → Request indexing).

Check the server logs to confirm that Googlebot is accessing the page. If the bot never passes, the noindex directive will never be read — and the page will remain indexed indefinitely.

  • Add meta robots noindex or X-Robots-Tag to pages you want to deindex.
  • Make sure these pages are not blocked in robots.txt.
  • Wait for Googlebot to crawl and process the directive (days to weeks).
  • Verify deindexing with site: in Google or via Search Console.
  • If the page remains indexed, inspect logs to confirm crawling and check for contradictory canonical tags.
  • Once deindexed, you can block in robots.txt to save crawl budget — but it's not mandatory.
The rule is simple: noindex to deindex, robots.txt to block crawling. Confusing the two creates uncontrollable situations where sensitive pages remain visible in SERPs. If you manage a site with thousands of pages, multiple environments (staging, dev), or duplicate content issues, a specialized SEO agency can help you map URLs at risk, audit your indexation directives, and implement a robust architecture — to avoid unpleasant surprises in search results.

❓ Frequently Asked Questions

Peut-on bloquer une page dans robots.txt après l'avoir désindexée avec noindex ?
Oui, mais ce n'est utile que pour économiser du crawl budget. Une fois la page désindexée, Googlebot n'a plus besoin d'y accéder, sauf pour vérifier périodiquement que la directive noindex est toujours en place.
Une page bloquée dans robots.txt peut-elle apparaître dans les SERPs ?
Oui, si Google connaît l'URL via des backlinks ou d'anciennes versions en cache. La page s'affichera avec l'URL et parfois un snippet généré depuis des sources externes, sans que Googlebot n'ait jamais crawlé le contenu récent.
Faut-il retirer le blocage robots.txt avant d'appliquer un noindex ?
Oui, absolument. Si la page est bloquée dans robots.txt, Googlebot ne pourra pas crawler la balise noindex. Retirez le blocage, attendez le crawl et la désindexation, puis rebloquez si nécessaire.
Quelle différence entre meta robots et X-Robots-Tag ?
Meta robots se place dans le HTML (<head>), X-Robots-Tag est un en-tête HTTP configurable au niveau serveur. Les deux ont le même effet, mais X-Robots-Tag fonctionne aussi pour les fichiers non-HTML (PDFs, images).
Combien de temps faut-il pour qu'une page disparaisse de l'index après un noindex ?
Cela dépend de la fréquence de crawl de votre site. En général, entre quelques jours et quelques semaines. Vous pouvez forcer un re-crawl via la Search Console pour accélérer le processus.
🏷 Related Topics
Domain Age & History Crawl & Indexing AI & SEO

🎥 From the same video 12

Other SEO insights extracted from this same Google Search Central video · published on 04/12/2024

🎥 Watch the full video on YouTube →

Related statements

💬 Comments (0)

Be the first to comment.

2000 characters remaining
🔔

Get real-time analysis of the latest Google SEO declarations

Be the first to know every time a new official Google statement drops — with full expert analysis.

No spam. Unsubscribe in one click.