Why does Google only crawl a fraction of your known pages?

Official statement

Google has only crawled a portion of known URLs from a site since its inception. If Google crawls 20,000 pages out of 100,000 known (via sitemap), only those 20,000 can be indexed. This number increases as the quality of the site improves. This is not a new phenomenon.

3:11

🎥 Source video

Extracted from a Google Search Central video

⏱ 37:34 💬 EN 📅 12/06/2020 ✂ 18 statements

Watch on YouTube (3:11) →

✂ Other statements from this video 17 ▾

1:06 Pourquoi Google affiche-t-il soudainement plus d'URLs non indexées dans Search Console ?
5:17 Core Web Vitals : pourquoi vos tests en laboratoire ne servent-ils à rien pour le ranking ?
9:30 Le contenu généré par les utilisateurs engage-t-il vraiment la responsabilité SEO du site ?
11:03 Faut-il vraiment inclure toutes vos pages dans un sitemap général ?
12:05 Le crawl budget varie-t-il selon l'origine du contenu ?
13:08 Googlebot envoie-t-il un referrer HTTP lors du crawl de votre site ?
14:09 La qualité des images influence-t-elle vraiment le ranking dans la recherche web Google ?
18:15 Comment Google évalue-t-il vraiment l'importance de vos pages via le linking interne ?
20:19 Pourquoi un site bien positionné peut-il perdre sa pertinence sans avoir commis d'erreur ?
21:53 Les Core Web Vitals sont-ils vraiment un facteur de ranking ou juste un écran de fumée ?
22:57 Discover fonctionne-t-il vraiment sans critères techniques stricts ?
25:02 Retirer des pages d'un sitemap peut-il limiter leur crawl par Google ?
27:08 Faut-il vraiment utiliser unavailable_after pour gérer le contenu temporaire ?
30:11 Le structured data influence-t-il réellement le ranking dans Google ?
31:45 Pourquoi Google indexe-t-il parfois vos pages AMP avant leur version HTML canonique ?
33:52 Les Core Web Vitals sont-ils vraiment décisifs pour le ranking Google ?
35:51 Google voit-il vraiment le contenu chargé dynamiquement après un clic utilisateur ?

What you need to understand

Does Google really crawl all the URLs it knows?

No, and it never has. Google only crawls a fraction of known URLs from a site, regardless of its size. This reality is often misunderstood: submitting 100,000 pages via sitemap does not guarantee that these pages will be visited by Googlebot.

The search engine performs an active selection based on its perception of the site's quality and the relevance of each URL. If Google determines that 80% of your pages are not valuable, it will not waste time crawling them regularly — or even at all.

What determines the allocated crawl volume?

The crawl budget is not a fixed quota: it’s a dynamic allocation that reflects the trust Google places in your site. The higher the perceived quality, the more resources Googlebot dedicates to exploring your content.

Specifically? A site with unique content, regularly updated, and technically sound will see its crawl volume gradually increase. Conversely, a site filled with duplicate pages, low-quality content, or unnecessary facets will see its budget stagnate — or even regress.

Why has this limitation always existed?

Because crawling the web is costly in server resources, bandwidth, and energy. Google has to prioritize: it cannot visit every page of every site on the web daily, especially when 90% of crawled content is not worthy of being indexed.

This economic constraint forces Google to be selective from the crawl stage. It’s a barrier even before indexing: if a page is never crawled, it cannot compete for SERP rankings. And this is where many SEOs go wrong: they optimize pages that Google simply does not visit.

Google only crawls a portion of known URLs, even via XML sitemap
This crawl volume is proportional to the perceived quality of the site
An uncrawled URL cannot be indexed, regardless of its intrinsic qualities
This limitation has existed since the inception of Google and is not a recent phenomenon
Improving the overall quality of the site mechanically increases the allocated crawl budget

SEO Expert opinion

Is this statement consistent with real-world observations?

Yes, and it’s even one of the few statements from Google that perfectly aligns with actual SEO audits. On sites with 50,000+ pages, it’s common to see 40% to 60% of the URLs never visited by Googlebot, even after several months online.

The problem is that many SEOs discover this reality too late — after generating thousands of low-value filter or category pages. They then see that Google completely ignores these URLs, without even crawling them once.

Why is Google vague about the exact thresholds?

Because there is no universal rule. The crawled volume depends on dozens of factors: domain history, content popularity, update frequency, technical quality, page depth, server speed, HTTP error rates...

Google does not want to provide precise numbers to prevent SEOs from trying to game the system. But concretely? A typical e-commerce site with 200,000 products will rarely have more than 30% to 50% of its pages crawled regularly. [To be verified] on your own project via server logs.

What are the limits of this overall quality logic?

The issue is that Google judges quality at the site-wide level, not page by page at the initial crawl. If 80% of your site is mediocre, even your 20% premium pages might never be crawled simply because they are drowned in the mass.

This is where a mass cleanup strategy comes into play: deindexing or removing weak pages can paradoxically improve the crawl of important pages. Some sites have doubled their organic traffic by removing 60% of their content — this is not a myth, it’s a real-world reality for large sites.

Attention: If you have a site with 100,000+ pages and stagnant traffic, urgently check your actual crawl rate via server logs. You will likely discover that Google is ignoring most of your content — and this is often the first optimization lever to pull.

Practical impact and recommendations

How to effectively measure your site's crawl budget?

The first step is to analyze your server logs. Google Search Console provides a partial view, but raw logs show you exactly which URLs are visited, how often, and with what depth.

Then cross-reference this data with your declared XML sitemap. If you have 50,000 submitted URLs but only 10,000 crawled over 30 days, you have a structural issue. Either your content is deemed weak, or your architecture is drowning important pages.

What concrete actions can increase crawled volume?

First priority: eliminate low-value pages. Unnecessary facets, duplicate pages, thin content, empty categories — everything that pollutes the crawl without driving traffic should be deindexed or removed.

Next, optimize your internal linking to push strategic pages: an orphan page or one located 8 clicks from the homepage is unlikely to be crawled regularly. Bring your key content within 2-3 clicks maximum through relevant contextual links.

Finally, improve your technical signals: server speed, response time, 4xx/5xx error rates, unnecessary redirects. A slow or unstable server mechanically lowers your crawl budget — Google does not want to overload your resources.

What critical mistakes should you absolutely avoid?

Number one mistake: massively generating pages without ensuring they will be crawled. Before launching 100,000 product sheets or 500,000 filter combinations, verify that your site has the technical and qualitative capacity to handle this volume.

Number two mistake: ignoring signs of excessive crawling. If Google crawls 80% of your pages but only 20% generate traffic, you’re wasting budget on unnecessary content. Redirect this budget towards your strategic pages by cleaning up the rest.

Analyze your server logs to identify the actual crawl rate versus known URLs
Remove or deindex all low-value pages (thin content, duplications, unnecessary facets)
Optimize your internal linking to elevate strategic pages within 2-3 clicks of the homepage
Enhance server speed and response time to maximize crawl efficiency
Only submit your best pages in the XML sitemap — not the entire site
Monitor the evolution of the crawl budget via Search Console and logs after each optimization

The crawl budget is not a fatality: it’s a trust indicator that Google assigns to your site. By cleaning up your low-quality content, optimizing your architecture, and improving your technical signals, you can double or triple the crawled volume in a few months. However, these optimizations require specialized expertise in log analysis, SEO architecture, and content strategy — areas where a specialized SEO agency can assist you in structuring a tailored action plan and avoiding costly mistakes.

❓ Frequently Asked Questions

Si Google connaît 100 000 de mes URLs mais n'en crawle que 20 000, que deviennent les 80 000 autres ?

Elles restent connues (via sitemap ou liens) mais ne sont jamais visitées par Googlebot, donc jamais indexées ni classées. Elles consomment votre budget crawl sans apporter de valeur — il faut les désindexer ou les supprimer.

Peut-on forcer Google à crawler davantage de pages en augmentant la fréquence du sitemap ?

Non. Le sitemap indique quelles URLs existent, mais Google décide seul lesquelles méritent d'être crawlées en fonction de la qualité perçue du site. Soumettre plus souvent le même sitemap ne change rien.

Comment savoir si mon site souffre d'un problème de crawl budget ?

Comparez le nombre d'URLs crawlées (logs serveur) versus URLs connues (sitemap + GSC). Si moins de 50% sont crawlées sur 30 jours, ou si vos pages stratégiques ne sont jamais visitées, vous avez un problème.

Supprimer des pages faibles améliore-t-il vraiment le crawl des pages restantes ?

Oui, dans la majorité des cas. En éliminant le bruit, vous concentrez le budget crawl sur vos contenus à forte valeur. Certains sites ont doublé leur trafic organique après avoir supprimé 50% à 70% de leurs pages.

Le crawl budget est-il uniquement un problème pour les gros sites ?

Non. Même un site de 5 000 pages peut gaspiller son budget sur des contenus faibles. La taille amplifie le problème, mais la logique s'applique dès qu'il y a une masse critique de pages à faible valeur.

🎥 From the same video 17

Other SEO insights extracted from this same Google Search Central video · duration 37 min · published on 12/06/2020

🎥 Watch the full video on YouTube →