Can you really index a URL blocked by robots.txt?

Official statement

Google can index the URL even if its content is blocked by robots.txt, assuming that the content might be relevant based on the internal or external links pointing to it.

9:52

🎥 Source video

Extracted from a Google Search Central video

⏱ 58:11 💬 EN 📅 28/11/2019 ✂ 13 statements

Watch on YouTube (9:52) →

✂ Other statements from this video 12 ▾

2:08 Les liens en JavaScript sont-ils vraiment suivis par Google ?
3:42 Faut-il vraiment modifier la fréquence de crawl pour gérer un pic de trafic comme le Black Friday ?
11:01 Faut-il limiter le nombre de liens sur la page d'accueil pour concentrer le PageRank ?
15:03 Les pages de catégorie bien classées transmettent-elles vraiment de l'autorité aux pages qu'elles lient ?
15:44 Le balisage SearchAction suffit-il vraiment à obtenir le champ de recherche Sitelinks ?
20:25 Comment la Search Console calcule-t-elle réellement la position moyenne de vos résultats enrichis ?
24:54 Pourquoi Google refuse-t-il de nommer ses formats d'affichage en SERP ?
31:30 Le lazy loading JavaScript bloque-t-il vraiment l'indexation Google de vos contenus ?
39:29 Faut-il vraiment afficher une date sur toutes vos pages pour bien ranker ?
39:46 Le CrUX suffit-il vraiment pour mesurer l'expérience utilisateur de votre site ?
41:00 Le test de compatibilité mobile de la Search Console est-il fiable ?
52:55 Pourquoi les URLs dynamiques posent-elles encore problème à Google ?

What you need to understand

How does Google index a page whose content is blocked?

When robots.txt blocks access to the content of a URL, Googlebot cannot crawl the page. It cannot see the text, the title/meta tags, or the HTML structure.

However, if this URL receives internal links or external backlinks, Google still discovers it. The engine then decides to index it based solely on the available signals: anchor text, context of the pages pointing to it, domain authority.

The result in the SERPs? An indexed URL, but displayed with a generic snippet such as “No information available for this page” or with the anchor text as a placeholder title. Zero editorial control.

Why does Google index a page it can't read?

Because indexing and crawling are two distinct processes. Indexing is the addition of a URL to the index — not necessarily with its content.

Google considers that if a URL receives links, it potentially has value. The engine prefers to keep it in its index, even if empty, rather than ignore it completely. It's a logic of maximum coverage: better to have an incomplete listing than a black hole in the web graph.

Practically? If your robots.txt file blocks /admin/ but ten external sites link to /admin/dashboard, this URL will appear in the index. Without a description, without a title, but present.

What’s the difference with a noindex tag?

The noindex tag explicitly requests Google not to index the page. But for the engine to read this directive, it must be able to crawl the page.

If robots.txt blocks access, Googlebot never sees the noindex tag. Result: the page can be indexed despite the directive, because the bot never had the chance to read it. It's the classic trap: blocking robots.txt + noindex does not work — the two cancel each other out.

To properly un-index, you must allow crawling (no robots.txt blocking) and place the noindex tag in the HTML. Once unindexed, you can block robots.txt if necessary.

Robots.txt blocks crawling but does not prevent indexing if the URL receives links
Noindex prevents indexing but requires that the page is crawlable to be read
A URL blocked by robots.txt can appear in the SERPs with a generic snippet
Google indexes based on external signals (anchors, context) when the content is inaccessible
Combining robots.txt + noindex is counterproductive and creates conflicting directives

SEO Expert opinion

Is this statement consistent with field observations?

Yes, and it is even a long-documented truth. We regularly observe in Search Console indexed URLs with the status “Indexed, blocked by robots.txt”. This is not a bug — it is the engine's normal behavior.

However, Google remains deliberately vague about the exact criteria that trigger this indexing. How many links are needed? What anchor weight? What minimum authority? No numerical data. We remain in the realm of “assuming that the content might be relevant” — a cautious formulation that gives the engine full latitude.

[To be verified]: Google does not specify whether this indexing consumes crawl budget during re-crawl attempts. It is assumed that it does, as the bot periodically checks if robots.txt has changed. But the real impact on high-volume sites remains to be documented.

In what situations does this mechanism really pose a problem?

The major risk is involuntary exposure. Sensitive URLs (admin, staging, test parameters) can surface in the SERPs if they receive links — even internal, even accidental.

Second case: pagination or filter pages blocked by robots.txt to limit crawling. If they receive backlinks (forums, aggregators), they index with an empty snippet. Result: you have indexed URLs that dilute your presence in the SERPs without adding value.

If you block robots.txt to save crawl budget, check in Search Console that these URLs are not indexing nonetheless. The block guarantees nothing if links point to them.

Should you rethink your strategy for managing sensitive files?

Let's be honest: robots.txt has never been a security tool. It’s a crawl directive, not a firewall. If you have sensitive content, the only viable protection is server authentication (htaccess, login, IP whitelisting).

For non-sensitive content that you want to keep out of the index, the winning combination remains: keep crawling open, place a noindex tag, wait for un-indexing, then optionally block robots.txt if you want to limit bot traffic. In this order — never the reverse.

And this is where many sites struggle: managing these trade-offs between crawl budget, strategic indexing, and security requires a holistic view that few teams have in-house. Sequencing errors (blocking before un-indexing) are extremely common.

Practical impact and recommendations

What should you do if blocked URLs appear in the index?

First step: identify the affected URLs via Search Console, under the “Indexed pages” section. Filter by the status “Indexed, blocked by robots.txt”. If the list is empty, you're clean. If it contains dozens or hundreds of entries, there is a problem.

Then, analyze the backlinks to these URLs using a tool like Ahrefs, Semrush, or Majestic. Often, the culprit is an internal link forgotten in a menu, footer, or previous campaign. Remove these links if possible — the URL will lose its reason for being indexed.

If the links are external and cannot be removed, temporarily unblock crawling in robots.txt, add a noindex tag in the HTML of the page, wait for un-indexing (a few days to a few weeks), then re-block if necessary.

How can you avoid this problem in the future?

Audit your robots.txt before each production push. List what you are blocking and ask yourself: “Can these URLs receive links?” If yes, the robots.txt block alone will not suffice.

For sensitive content, implement server authentication (htaccess, OAuth, IP whitelisting). For non-sensitive content that you want to keep out of SEO (facets, filters, parameters), use noindex in the HTML — not robots.txt.

Set up a Search Console alert for the status “Indexed, blocked by robots.txt”. If this counter increases, you will immediately know that a problem is developing.

What critical mistakes must absolutely be avoided?

Never block robots.txt AND place noindex on the same URL. The two directives cancel each other out: the bot cannot read the noindex tag, so it indexes the URL anyway.

Never assume that robots.txt protects confidential content. It’s a public file, readable by anyone. If /admin/ is listed in robots.txt, you literally signal to hackers where to look.

Do not confuse “not crawled” and “not indexed.” A URL can be indexed without ever being crawled, solely based on external signals. This is exactly what this statement from Google describes.

Check monthly the status “Indexed, blocked by robots.txt” in Search Console
Audit the backlinks to blocked URLs to identify the sources of links
Remove accidental internal links to blocked pages
Use noindex (not robots.txt) to properly un-index non-sensitive content
Implement server authentication for truly sensitive content
Set up Search Console alerts on indexing variations of blocked URLs

Fine management of indexing — balancing between robots.txt, noindex, canonicals, and link signals — requires sharp technical expertise. If your site manages thousands of URLs, complex facets, or sensitive content, these optimizations can quickly become a headache. A specialized SEO agency can assist you in auditing your architecture, correcting directive conflicts, and establishing solid long-term SEO governance.

❓ Frequently Asked Questions

Robots.txt empêche-t-il l'indexation d'une page ?

Non. Robots.txt bloque le crawl, mais Google peut indexer l'URL si elle reçoit des liens, en se basant sur les signaux externes. La page apparaîtra dans les SERP avec un snippet générique.

Comment désindexer une page déjà bloquée par robots.txt ?

Débloquez temporairement le crawl dans robots.txt, ajoutez une balise noindex dans le HTML, attendez la désindexation, puis rebloquez si nécessaire. Ne jamais combiner blocage robots.txt + noindex simultanément.

Peut-on voir dans la Search Console si une URL bloquée est indexée ?

Oui, dans la section Pages, filtrez sur le statut « Indexée, bloquée par robots.txt ». Cela indique que Google a indexé l'URL malgré le blocage du crawl.

Les liens internes peuvent-ils causer l'indexation d'une URL bloquée ?

Oui. Si une URL bloquée par robots.txt reçoit des liens internes, Google la découvre et peut l'indexer. Vérifiez votre maillage pour éviter les liens vers des pages que vous ne voulez pas indexer.

Robots.txt protège-t-il les contenus sensibles ?

Non. Robots.txt est un fichier public et une simple directive de crawl. Pour protéger des contenus sensibles, utilisez une authentification serveur (htaccess, login, IP whitelisting).

🎥 From the same video 12

Other SEO insights extracted from this same Google Search Central video · duration 58 min · published on 28/11/2019

🎥 Watch the full video on YouTube →