Why does Googlebot keep crawling your noindex pages, and how can you stop it?

Quick SEO Quiz

Test your SEO knowledge in 5 questions

Less than a minute. Find out how much you really know about Google search.

🕒 ~1 min 🎯 5 questions

Official statement

Googlebot can still crawl pages marked as noindex/nofollow. If this overloads the server, you can block these pages in the robots.txt file.

7:11

🎥 Source video

Extracted from a Google Search Central video

⏱ 57:08 💬 EN 📅 01/11/2016 ✂ 11 statements

Watch on YouTube (7:11) →

✂ Other statements from this video 10 ▾

1:38 Faut-il vraiment passer par un 302 avant un 301 lors d'une migration HTTPS ?
2:10 Pourquoi changer la structure d'URL en même temps que la migration HTTPS casse-t-il votre référencement ?
4:18 Les mots-clés dans les URL sont-ils vraiment un facteur de ranking négligeable ?
9:04 Faut-il vraiment rediriger en 302 les marques sans produits ou opter pour une 404 ?
10:05 Panda réévalue-t-il vraiment le contenu en continu ou faut-il attendre une mise à jour ?
11:46 Les outils interactifs peuvent-ils vraiment booster le classement de votre site ?
14:43 Faut-il modifier vos annotations mobiles avant le passage à l'index mobile-first ?
16:04 Les liens internes "lire plus" nuisent-ils vraiment à l'expérience utilisateur ?
22:54 Faut-il canoniser la première page ou la vue complète pour la pagination e-commerce ?
46:45 Les publicités au-dessus du pli nuisent-elles vraiment au référencement ?

📅

Official statement from November 1, 2016 (9 years ago)

⚠ A more recent statement exists on this topic Why does Googlebot persist in crawling your deleted pages with 410 status? John Mueller · June 17, 2025 View statement →

TL;DR

Googlebot crawls URLs marked noindex or nofollow even when they are not indexed. This practice wastes crawl budget on large sites. To effectively block the crawler, you should use robots.txt, but be cautious: this prevents Google from seeing the noindex directive and can create paradoxical situations where pages remain in the index.

What you need to understand

What is the difference between crawling and indexing a page?

Crawling is the visit, indexing is the recording in Google's database. Googlebot can perfectly visit a page without adding it to its index. This is exactly what happens with noindex pages.

When you add a noindex directive, you tell Google: "You can read this page, but do not show it in the results". The bot must therefore crawl it to discover this instruction. This is the central paradox that eludes many practitioners.

Why does Google continue to visit pages it will never index?

Google crawls these pages to verify that the noindex directive is still present. If you remove the noindex, the engine must quickly detect it to reindex the page.

The bot also follows the outbound links from noindex pages to discover other indexable content. A noindex page can point to important resources that Google does not want to miss. Even with nofollow, Googlebot may choose to follow links purely for discovery purposes.

In which cases does this behavior cause problems?

On a site with tens of thousands of non-indexable pages (archives, filters, user sessions), the crawler wastes server time and resources. Each visit to a noindex page is a request that could have been used to crawl strategic content.

E-commerce sites with explosive faceted filters are particularly affected. A catalog of 5000 products can generate 500,000 filter combinations, all crawlable if not blocked upfront.

Crawling consumes budget even on pages explicitly marked as non-indexable
Noindex does not block the crawler, only the final indexing of the content
Robots.txt is the only true blocking tool, but it creates a blind spot: Google no longer sees the directives on these pages
Nofollow does not prevent Googlebot from following links, it simply indicates a preference that the bot may ignore
Excessive crawling slows down the refreshing of important pages on the site

SEO Expert opinion

Is this statement consistent with on-the-ground observations?

Absolutely. Server logs confirm that Googlebot regularly visits noindex URLs, sometimes multiple times a week depending on the site's popularity. Mueller is just formalizing what crawl analysts have been observing for years.

The point about robots.txt is more delicate. Blocking a URL in robots.txt after it has already been indexed can freeze the URL in the index indefinitely. Google can no longer access the page to read the noindex, so it keeps the entry out of caution. This is documented but rarely applied correctly.

What nuances should be added to this recommendation?

Robots.txt should be your first line of defense, not your plan B. If an entire section of the site should never be crawled (admin, internal search, session parameters), block it in robots.txt right away. Do not wait for Google to discover it before placing a noindex.

Mueller's advice is aimed at sites that already have a server load issue. If your infrastructure handles traffic without a problem, allowing Google to crawl noindex pages is not catastrophic. The real challenge is the crawl budget on large sites with millions of pages. [To verify]: Google has never published a numerical threshold defining a "large site" where crawl budget becomes critical.

When does this rule not apply?

If you want to quickly deindex a page already present in the index, robots.txt is counterproductive. You need to let Google crawl the page with noindex until it disappears from the SERPs, and then possibly block it in robots.txt to save crawl budget.

Small sites (fewer than 10,000 pages) should never sacrifice index cleanliness to save crawl budget. Your priority is to remove unnecessary pages from the index, not to protect your server from negligible bot traffic.

Warning: combining noindex and robots.txt on the same URL is a common mistake that prevents deindexation. Choose one or the other based on your objective.

Practical impact and recommendations

What should I do if crawling overloads my server?

Start by analyzing your server logs to identify the most crawled sections. Look for patterns: exploding URL parameters, infinite paginated pages, redundant filters. Splunk, Screaming Frog Log Analyzer, or even basic Python scripts will work.

Once you've identified crawl sinks, block them in robots.txt if you're certain they hold no SEO value. Typically: /admin/, /cart/, /checkout/, sorting and session parameters. Test the impact on server load for 2-3 weeks before validating the strategy.

How should I manage pages that need to disappear from the index?

Two-step process. First, remove the block in robots.txt if those URLs are blocked. Let Google crawl them with noindex for several weeks until deindexation is confirmed in Search Console. Only after can you reblock them in robots.txt to save crawl.

To speed things up, use the URL removal tool in Search Console, but understand that this is temporary (6 months). The noindex directive must remain in place on the page itself for permanent removal. Never rely solely on the removal tool without noindex on the server side.

What mistakes should be avoided when optimizing crawl budget?

Never block in robots.txt a URL that contains content you want to index. It seems obvious, but after a redesign, strategic sections are often mistakenly blocked. Audit robots.txt with every major deployment.

Avoid modifying robots.txt and noindex simultaneously on the same URLs. Proceed gradually and measure. A drastic change can cause thousands of pages to disappear from the index or, conversely, keep them in a zombie state for months.

Analyze your logs to quantify the actual crawl on noindex pages
Only block in robots.txt sections with no SEO value that are never meant to be indexed
To deindex, first remove the robots.txt block, apply noindex, wait for deindexation, then block again if necessary
Test any changes to robots.txt on a sample of URLs before generalizing
Monitor crawl rate in Search Console after each structural change
Document your robots.txt/noindex strategy to avoid regressions during redesigns

Managing crawl budget and indexing directives requires a delicate technical balance between resource conservation and visibility in the index. On complex infrastructures with hundreds of thousands of URLs, these trade-offs demand sharp expertise and continuous monitoring. A specialized SEO agency can audit your architecture, analyze your server logs, and implement a tailored strategy to maximize crawl efficiency while protecting your technical resources.

❓ Frequently Asked Questions

Si je bloque une page en robots.txt, Google peut-il quand même l'indexer ?

Oui, si la page reçoit des backlinks externes. Google peut l'ajouter à l'index en se basant uniquement sur les ancres et le contexte des liens, sans jamais crawler la page elle-même. C'est pourquoi on voit parfois des URLs bloquées apparaître dans les SERP avec la mention "Aucune information disponible".

Faut-il mettre noindex ET nofollow sur les pages à faible valeur ?

Nofollow n'empêche pas réellement Googlebot de suivre les liens, c'est juste un signal de préférence. Si vous voulez vraiment isoler une page, utilisez noindex et ne la liez pas depuis vos pages indexables. Le nofollow est devenu un indice, pas une directive stricte.

Combien de temps Google continue-t-il de crawler une page après l'ajout d'un noindex ?

Indéfiniment, mais avec une fréquence décroissante. Google vérifie périodiquement que la directive est toujours présente. Sur un site actif, attendez-vous à des visites mensuelles minimum, parfois hebdomadaires selon le PageRank interne de la page.

Peut-on bloquer Googlebot mais autoriser les autres moteurs à crawler ?

Oui, via des directives user-agent spécifiques dans robots.txt (User-agent: Googlebot). Mais vous fragmentez alors votre indexation entre moteurs, ce qui complique le suivi. Rarement recommandé sauf cas très particuliers de syndication de contenu.

Le crawl budget est-il un problème pour un site de moins de 50000 pages ?

Rarement. Google affirme que la plupart des sites n'ont pas à s'en préoccuper. Le crawl budget devient critique sur les plateformes volumineuses avec du contenu qui change fréquemment (actualités, e-commerce massif, petites annonces). En dessous de 50000 pages relativement stables, concentrez-vous d'abord sur la qualité du contenu et la structure de liens.

🏷 Related Topics

crawl budget noindex nofollow robots.txt Googlebot indexation logs serveur Search Console

Domain Age & History Crawl & Indexing Links & Backlinks PDF & Files

🎥 From the same video 10

Other SEO insights extracted from this same Google Search Central video · duration 57 min · published on 01/11/2016

🎥 Watch the full video on YouTube →

Related statements

« Previous

Algorithm for Ads Above the Content...

Using Temporary Redirects for Brands Without Produ...

« Back to results