Official statement
Other statements from this video 13 ▾
- 2:11 Google peut-il vraiment afficher des snippets pour les éditeurs de presse en France sans autorisation explicite ?
- 4:19 Les mises à jour Core Update provoquent-elles un reset complet des classements ?
- 7:26 Les Quality Rater Guidelines peuvent-elles vraiment améliorer le classement des sites médicaux ?
- 10:32 Faut-il vraiment inclure le nom de la marque dans les balises title ?
- 11:14 Publier du contenu tiers peut-il pénaliser tout votre site dans Google ?
- 14:15 Pourquoi Google met-il autant de temps à actualiser les logos dans les résultats de recherche ?
- 19:38 Robots.txt absent : vos images sont-elles vraiment toutes indexables ?
- 23:40 Les sous-répertoires permettent-ils vraiment de cibler efficacement plusieurs pays sur un TLD générique ?
- 25:06 Les backlinks spam sont-ils vraiment ignorés par Google ?
- 28:26 Google supprime les étoiles d'auto-évaluation : pourquoi cette restriction des rich snippets change-t-elle la donne ?
- 32:44 Faut-il vraiment renseigner la date de modification dans son sitemap XML ?
- 40:01 Faut-il vraiment créer des pages dédiées pour chaque vidéo ?
- 43:13 Les meta tags peuvent-ils vraiment contrôler l'affichage des snippets dans Google Actualités ?
Google may suggest URLs in its results even if they are blocked by robots.txt, especially when these pages are considered important for users. The robots.txt file controls crawling, not indexing—a crucial distinction that many SEOs still confuse. In practical terms: blocking a URL via robots.txt does not prevent it from appearing in the SERP, only Google's access to its content.
What you need to understand
What's the difference between crawl blocking and indexing blocking?
robots.txt controls what Googlebot can crawl, not what it can index. This technical nuance still eludes many practitioners. When you block a URL via robots.txt, you prevent the bot from accessing the content of the page, but you do not forbid it from referencing it in the index.
Google can therefore index the URL itself if it is mentioned elsewhere on the web—typically through external backlinks. The entry in the SERP will then display the URL and a generic snippet such as "A description of this page is not available due to the robots.txt file." Not appealing for the user, but technically indexed.
Why does Google index pages blocked by robots.txt?
Google's logic is based on user relevance. If a page generates a lot of external signals—backlinks, mentions, direct searches—Google considers it worthy of appearing in the results, even if the content is not accessible to the bot.
The example of the AdWords login page perfectly illustrates this scenario. This page was blocked by robots.txt, but was so frequently searched and linked that Google kept it in the index. For users, finding the login link was more important than accessing a description of the page.
How does Google decide which blocked pages deserve to be indexed?
The official documentation remains vague on the selection criteria. Google mentions "importance to users" without detailing a specific threshold. It is known that backlinks play a major role—a URL without external links is unlikely to be indexed if it is blocked by robots.txt.
Direct queries (searches for the exact name of the URL or brand) also seem to weigh in. If thousands of people search for "adwords login" every day, Google deems it legitimate to offer the corresponding URL, even if it is blocked from crawling.
- Robots.txt blocks crawling, not indexing—the URL can appear in the SERP without Googlebot having accessed the content
- Google indexes blocked pages deemed important for users, especially through backlinks and direct searches
- The displayed snippet will be generic and uninformative, due to lack of access to the page's content
- To completely prevent indexing, use noindex (meta tag or HTTP header), not robots.txt
- The combination of robots.txt + noindex is impossible—Googlebot cannot read the noindex directive if it does not crawl the page
SEO Expert opinion
Is this statement consistent with real-world observations?
Yes, and it has been documented for a long time in official guidelines. The problem is that many SEOs still confuse the two concepts. I regularly encounter audits where sensitive pages (backend, admin, parameters) are only blocked by robots.txt, under the belief that this is enough to remove them from the index.
In practice, these pages often appear in the SERP if they have incoming links—even internal ones. A simple link in a footer, a mention in a sitemap file (ironically), and Google indexes the URL despite the crawl block. The snippet may be empty, but the URL lingers in the results, which raises privacy concerns and dilutes the crawl budget on irrelevant queries.
What nuances should be added to this rule?
The notion of "importance to users" remains subjective and opaque. [To be verified] Google does not publish any quantitative thresholds—how many backlinks? what volume of direct searches? We are navigating by sight. In my tests, I have seen URLs with 3-4 backlinks from average sites remain indexed for months after robots.txt blocking, while others with a single link from a major site disappeared quickly.
Another point: the speed of de-indexing. When you block an already indexed URL, Google does not remove it immediately. The delay varies greatly—from a few days to several weeks, or even months for historical pages. If you want a quick removal, go through the Search Console with a URL removal request, in addition to the blocking.
In what cases does this rule pose a problem?
The classic scenario: you want to cleanly de-index an entire section of your site (old products, test pages, duplicate content). If you only block via robots.txt, Google will keep the URLs in the index as long as they receive external signals. The result: dozens, even hundreds of "ghost" pages that pollute your presence in the SERP.
The recommended workaround—noindex before robots.txt—creates another timing issue. You first have to allow Googlebot to crawl the pages with the noindex directive, wait for complete de-indexation (checkable in the Search Console), and only then block crawling if necessary. Many sites skip this step and end up stuck with indexed URLs without a clean way to remove them.
Practical impact and recommendations
What should you do concretely to control indexing?
If your goal is to stop indexing, forget robots.txt. Use a noindex meta tag in the <head> of the page, or an HTTP header X-Robots-Tag: noindex for non-HTML files (PDFs, images). These directives explicitly tell Google not to index the resource, even if it is crawlable.
For entire sections (complete directories, URL parameters), a pattern in robots.txt can block crawling—but only after complete de-indexation via noindex. The correct sequence: (1) add noindex, (2) wait for Googlebot to pass and check de-indexation in Search Console, (3) optionally block crawling to save budget.
What mistakes should be avoided in managing robots.txt vs indexing?
Number one mistake: blocking by robots.txt pages that are already indexed hoping they will disappear from the SERP. It doesn’t work—either very slowly and unpredictably. You end up with ghost URLs that occupy the index for weeks.
Second trap: adding noindex on pages blocked by robots.txt. Google cannot read the directive, so it remains ineffective. I've seen sites leaving this configuration for months, convinced that the noindex would do its job. Spoiler: no.
Third mistake: not monitoring the actual index via Search Console. The site: command in Google is unreliable. Use the "Coverage" report in Search Console to identify URLs indexed despite a robots.txt block—it's a warning signal.
How to check if your indexing strategy is correct?
Start with an audit of the current index. In Search Console, filter pages by status "Indexed, not submitted in the sitemap" and cross-reference with your robots.txt file. Any blocked URL that appears here deserves investigation—either you de-index it properly via noindex, or you accept its presence and optimize its snippet.
Also check the incoming backlinks to these blocked pages. If external sites continue to link to URLs you want to remove, it will slow down or prevent de-indexation. Contact webmasters to remove the links, or use the disavow tool as a last resort if it’s spam.
- Use noindex (meta tag or HTTP header) to prevent indexing, never robots.txt alone
- Apply noindex first, check for complete de-indexation, then block crawling if necessary
- Regularly monitor the index via Search Console, not via
site:which is approximate - Identify and address backlinks to the pages you want to de-index—they slow down the process
- For a quick removal of already indexed URLs, use the Search Console’s URL removal tool in addition
- Never combine robots.txt and noindex simultaneously on the same resources
❓ Frequently Asked Questions
Si je bloque une page par robots.txt, Google peut-il quand même l'indexer ?
Comment empêcher complètement une page d'apparaître dans Google ?
Peut-on combiner robots.txt et noindex sur la même page ?
Combien de temps faut-il pour qu'une page bloquée par robots.txt disparaisse de l'index ?
Pourquoi Google indexe-t-il certaines pages bloquées et pas d'autres ?
🎥 From the same video 13
Other SEO insights extracted from this same Google Search Central video · duration 57 min · published on 26/09/2019
🎥 Watch the full video on YouTube →
💬 Comments (0)
Be the first to comment.