Does robots.txt really block indexing in Google?

Official statement

Google can still suggest URLs in search results even if they are blocked by robots.txt, mainly if they are important for users, as illustrated by the AdWords login page.

37:07

🎥 Source video

Extracted from a Google Search Central video

⏱ 57:16 💬 EN 📅 26/09/2019 ✂ 14 statements

Watch on YouTube (37:07) →

✂ Other statements from this video 13 ▾

2:11 Google peut-il vraiment afficher des snippets pour les éditeurs de presse en France sans autorisation explicite ?
4:19 Les mises à jour Core Update provoquent-elles un reset complet des classements ?
7:26 Les Quality Rater Guidelines peuvent-elles vraiment améliorer le classement des sites médicaux ?
10:32 Faut-il vraiment inclure le nom de la marque dans les balises title ?
11:14 Publier du contenu tiers peut-il pénaliser tout votre site dans Google ?
14:15 Pourquoi Google met-il autant de temps à actualiser les logos dans les résultats de recherche ?
19:38 Robots.txt absent : vos images sont-elles vraiment toutes indexables ?
23:40 Les sous-répertoires permettent-ils vraiment de cibler efficacement plusieurs pays sur un TLD générique ?
25:06 Les backlinks spam sont-ils vraiment ignorés par Google ?
28:26 Google supprime les étoiles d'auto-évaluation : pourquoi cette restriction des rich snippets change-t-elle la donne ?
32:44 Faut-il vraiment renseigner la date de modification dans son sitemap XML ?
40:01 Faut-il vraiment créer des pages dédiées pour chaque vidéo ?
43:13 Les meta tags peuvent-ils vraiment contrôler l'affichage des snippets dans Google Actualités ?

What you need to understand

What's the difference between crawl blocking and indexing blocking?

robots.txt controls what Googlebot can crawl, not what it can index. This technical nuance still eludes many practitioners. When you block a URL via robots.txt, you prevent the bot from accessing the content of the page, but you do not forbid it from referencing it in the index.

Google can therefore index the URL itself if it is mentioned elsewhere on the web—typically through external backlinks. The entry in the SERP will then display the URL and a generic snippet such as "A description of this page is not available due to the robots.txt file." Not appealing for the user, but technically indexed.

Why does Google index pages blocked by robots.txt?

Google's logic is based on user relevance. If a page generates a lot of external signals—backlinks, mentions, direct searches—Google considers it worthy of appearing in the results, even if the content is not accessible to the bot.

The example of the AdWords login page perfectly illustrates this scenario. This page was blocked by robots.txt, but was so frequently searched and linked that Google kept it in the index. For users, finding the login link was more important than accessing a description of the page.

How does Google decide which blocked pages deserve to be indexed?

The official documentation remains vague on the selection criteria. Google mentions "importance to users" without detailing a specific threshold. It is known that backlinks play a major role—a URL without external links is unlikely to be indexed if it is blocked by robots.txt.

Direct queries (searches for the exact name of the URL or brand) also seem to weigh in. If thousands of people search for "adwords login" every day, Google deems it legitimate to offer the corresponding URL, even if it is blocked from crawling.

Robots.txt blocks crawling, not indexing—the URL can appear in the SERP without Googlebot having accessed the content
Google indexes blocked pages deemed important for users, especially through backlinks and direct searches
The displayed snippet will be generic and uninformative, due to lack of access to the page's content
To completely prevent indexing, use noindex (meta tag or HTTP header), not robots.txt
The combination of robots.txt + noindex is impossible—Googlebot cannot read the noindex directive if it does not crawl the page

SEO Expert opinion

Is this statement consistent with real-world observations?

Yes, and it has been documented for a long time in official guidelines. The problem is that many SEOs still confuse the two concepts. I regularly encounter audits where sensitive pages (backend, admin, parameters) are only blocked by robots.txt, under the belief that this is enough to remove them from the index.

In practice, these pages often appear in the SERP if they have incoming links—even internal ones. A simple link in a footer, a mention in a sitemap file (ironically), and Google indexes the URL despite the crawl block. The snippet may be empty, but the URL lingers in the results, which raises privacy concerns and dilutes the crawl budget on irrelevant queries.

What nuances should be added to this rule?

The notion of "importance to users" remains subjective and opaque. [To be verified] Google does not publish any quantitative thresholds—how many backlinks? what volume of direct searches? We are navigating by sight. In my tests, I have seen URLs with 3-4 backlinks from average sites remain indexed for months after robots.txt blocking, while others with a single link from a major site disappeared quickly.

Another point: the speed of de-indexing. When you block an already indexed URL, Google does not remove it immediately. The delay varies greatly—from a few days to several weeks, or even months for historical pages. If you want a quick removal, go through the Search Console with a URL removal request, in addition to the blocking.

In what cases does this rule pose a problem?

The classic scenario: you want to cleanly de-index an entire section of your site (old products, test pages, duplicate content). If you only block via robots.txt, Google will keep the URLs in the index as long as they receive external signals. The result: dozens, even hundreds of "ghost" pages that pollute your presence in the SERP.

The recommended workaround—noindex before robots.txt—creates another timing issue. You first have to allow Googlebot to crawl the pages with the noindex directive, wait for complete de-indexation (checkable in the Search Console), and only then block crawling if necessary. Many sites skip this step and end up stuck with indexed URLs without a clean way to remove them.

Beware: Never combine robots.txt and noindex simultaneously on the same URLs. Googlebot will not be able to read the noindex directive if crawling is blocked, and the pages will remain indexed indefinitely. Always apply noindex first, check the de-indexation, then block crawling if really necessary.

Practical impact and recommendations

What should you do concretely to control indexing?

If your goal is to stop indexing, forget robots.txt. Use a noindex meta tag in the <head> of the page, or an HTTP header X-Robots-Tag: noindex for non-HTML files (PDFs, images). These directives explicitly tell Google not to index the resource, even if it is crawlable.

For entire sections (complete directories, URL parameters), a pattern in robots.txt can block crawling—but only after complete de-indexation via noindex. The correct sequence: (1) add noindex, (2) wait for Googlebot to pass and check de-indexation in Search Console, (3) optionally block crawling to save budget.

What mistakes should be avoided in managing robots.txt vs indexing?

Number one mistake: blocking by robots.txt pages that are already indexed hoping they will disappear from the SERP. It doesn’t work—either very slowly and unpredictably. You end up with ghost URLs that occupy the index for weeks.

Second trap: adding noindex on pages blocked by robots.txt. Google cannot read the directive, so it remains ineffective. I've seen sites leaving this configuration for months, convinced that the noindex would do its job. Spoiler: no.

Third mistake: not monitoring the actual index via Search Console. The site: command in Google is unreliable. Use the "Coverage" report in Search Console to identify URLs indexed despite a robots.txt block—it's a warning signal.

How to check if your indexing strategy is correct?

Start with an audit of the current index. In Search Console, filter pages by status "Indexed, not submitted in the sitemap" and cross-reference with your robots.txt file. Any blocked URL that appears here deserves investigation—either you de-index it properly via noindex, or you accept its presence and optimize its snippet.

Also check the incoming backlinks to these blocked pages. If external sites continue to link to URLs you want to remove, it will slow down or prevent de-indexation. Contact webmasters to remove the links, or use the disavow tool as a last resort if it’s spam.

Use noindex (meta tag or HTTP header) to prevent indexing, never robots.txt alone
Apply noindex first, check for complete de-indexation, then block crawling if necessary
Regularly monitor the index via Search Console, not via site: which is approximate
Identify and address backlinks to the pages you want to de-index—they slow down the process
For a quick removal of already indexed URLs, use the Search Console’s URL removal tool in addition
Never combine robots.txt and noindex simultaneously on the same resources

Managing indexing requires a precise sequence and rigorous monitoring. Robots.txt is not enough to de-index; noindex must be applied before any crawling block, and the actual index must be continuously monitored. These technical optimizations, coupled with backlink analysis and careful management of the Search Console, demand sharp expertise and considerable time. If your site presents complex indexing issues—sensitive pages, migration, redesign, managing duplicate content—it may be wise to consult a specialized SEO agency for personalized support and to avoid costly visibility errors.

❓ Frequently Asked Questions

Si je bloque une page par robots.txt, Google peut-il quand même l'indexer ?

Oui. Robots.txt bloque le crawl, pas l'indexation. Google peut indexer l'URL si elle reçoit des backlinks ou des recherches directes, mais affichera un snippet générique sans accès au contenu.

Comment empêcher complètement une page d'apparaître dans Google ?

Utilise une balise meta robots noindex dans le <head> de la page, ou un header HTTP X-Robots-Tag: noindex. Ces directives indiquent explicitement à Google de ne pas indexer la ressource.

Peut-on combiner robots.txt et noindex sur la même page ?

Non, c'est contre-productif. Si robots.txt bloque le crawl, Googlebot ne peut pas lire la directive noindex, qui restera donc sans effet. Toujours appliquer noindex en premier, vérifier la désindexation, puis bloquer le crawl si nécessaire.

Combien de temps faut-il pour qu'une page bloquée par robots.txt disparaisse de l'index ?

Très variable et imprévisible — de quelques jours à plusieurs mois selon l'historique de la page et les signaux externes. Pour un retrait rapide, utilise l'outil de suppression d'URL dans Search Console en complément.

Pourquoi Google indexe-t-il certaines pages bloquées et pas d'autres ?

Google indexe les pages bloquées jugées importantes pour les utilisateurs, notamment via backlinks et volume de recherches directes. Les critères exacts ne sont pas publics, ce qui rend le processus difficile à prédire.

🎥 From the same video 13

Other SEO insights extracted from this same Google Search Central video · duration 57 min · published on 26/09/2019

🎥 Watch the full video on YouTube →