Should you really worry about the crawl budget if it's under a million URLs?

Official statement

If your site has fewer than a million URLs, you generally don't need to worry about crawl budget. This figure serves as a reference baseline.

15:37

🎥 Source video

Extracted from a Google Search Central video

⏱ 31:53 💬 EN 📅 09/12/2020 ✂ 16 statements

Watch on YouTube (15:37) →

✂ Other statements from this video 15 ▾

2:49 Pourquoi Google rend-il quasi systématiquement vos pages avant de les indexer ?
3:52 Faut-il abandonner le modèle des deux vagues d'indexation ?
7:35 Google utilise-t-il une sandbox ou une période de lune de miel pour les nouveaux sites ?
8:02 Google devine-t-il vraiment où classer un nouveau site avant même d'avoir des données ?
9:07 Pourquoi les nouveaux sites connaissent-ils des montagnes russes dans les SERP ?
13:59 Faut-il vraiment se préoccuper du crawl budget pour son site ?
16:09 Le crawl budget existe-t-il vraiment ou est-ce juste un mythe SEO ?
17:42 Google bride-t-il volontairement son crawl pour ménager vos serveurs ?
18:51 Googlebot peut-il vraiment arrêter de crawler votre site à cause de codes d'erreur serveur ?
20:24 Comment détecter un vrai problème de crawl budget sur votre site ?
21:57 Élaguer le contenu faible améliore-t-il vraiment le crawl budget ?
22:28 Faut-il sacrifier la vitesse serveur pour économiser du crawl budget ?
23:32 Pourquoi vos requêtes API explosent-elles votre crawl budget à votre insu ?
24:36 Le crawl budget : toutes vos URLs comptent-elles vraiment autant que Google l'affirme ?
25:39 Faut-il vraiment s'inquiéter du cache agressif de Googlebot sur vos ressources statiques ?

What you need to understand

What does Google really mean by 'crawl budget'?

The crawl budget refers to the number of pages a search engine will explore on a site within a given timeframe. Google allocates a quota that depends on several factors: site popularity, server response speed, content freshness, and the overall quality of the pages.

This concept is often misunderstood. The crawl budget is not a fixed limit set in stone — it's a dynamic balance between what Google can crawl without overloading your server and what it deems valuable to explore. A site may theoretically have 500,000 URLs but might only see 10,000 crawled per day if Googlebot detects a lot of duplicate content, low-quality pages, or catastrophic response times.

When Gary Illyes sets the threshold at a million URLs, he refers to a indicative threshold beyond which structural crawling issues become statistically unavoidable. Below this, most sites do not face any technical constraints related to raw volume.

Why this specific figure of 1 million?

Let’s be honest: this number is not a scientific truth set in Googlebot's code. It’s a pragmatic approximation meant to alleviate the worries of smaller sites. One million URLs is roughly the size of a well-stocked national media outlet, a mature e-commerce site with an extensive catalog and filters, or a job portal covering an entire country.

This threshold mainly serves to delineate a comfort zone — below it, if you encounter indexing problems, it's probably not because Google refuses to crawl your pages due to budget constraints. The causes lie elsewhere: shaky architecture, cascading redirects, overly restrictive robots.txt tags, absent or poorly maintained XML sitemaps, and low-quality content.

In what cases does this threshold not apply?

A site with 200,000 URLs may very well have a crawl budget issue if it generates a lot of unnecessary filtered facets, if its server is as slow as a dead donkey, or if its internal linking is catastrophic. Conversely, a well-structured site with 1.2 million pages that is fast, has coherent linking, and fresh content may never encounter limitations.

Volume is just one indicator among others. Google also considers publication velocity (how many new pages per day?), the frequency of updates to existing content, the bounce rate on crawled pages (an indirect quality signal), and the external popularity of the site (backlinks, traffic).

Architecture and depth: a site with 300,000 pages but an average depth of 8 clicks will be poorly crawled, even below a million.
Server speed: if your server takes 2 seconds to respond on average, Googlebot will intentionally slow down the crawling pace to avoid crashing — which mechanically reduces the number of pages crawled.
Quality and freshness: a site with 80% dead pages (zero traffic, zero internal links) will see its crawl budget wasted on content that Google ultimately ignores.
Spam signals: suspicious patterns (mass duplication, cloaking, dubious redirects) can drastically reduce crawl allocation, regardless of volume.
External popularity: a site with a good backlink profile and regular organic traffic naturally receives more crawl — Google sees value in returning often.

SEO Expert opinion

Is this million-rule really reliable in practice?

In my experience, this threshold is generally consistent with what we've observed — as long as it’s not taken literally. Most sites under 500,000 URLs do not have strict crawl budget constraints. When they encounter indexing issues, it’s almost always tied to structural problems: faulty internal linking, excessive depth, duplication, poorly configured robots.txt files.

That said, I’ve seen sites with 150,000 pages having a clearly throttled crawl budget — hosted on a low-cost service with response times over 1.5 seconds, thousands of cascading 301 redirects, and an XML sitemap listing 80,000 URLs, half of which returned 404 errors. In this case, Google crawls sluggishly and ultimately ignores a large part of the site.

What nuances need to be added to this statement?

Google uses the term “generally” — that word matters. The reality is that crawl budget is a resultant, not a cause. If your site is technically sound, fast, well-linked, with fresh content and an up-to-date sitemap, you won't face limitations even with 800,000 URLs. If your site is shaky, 50,000 pages may already pose a problem.

Furthermore, the distribution of crawl matters just as much as the total volume. Googlebot may crawl 10,000 pages a day on your site, but if 90% of that crawl focuses on low-value pages (filtered facets, old news), your strategic pages won't be visited regularly. So the issue is not always quantitative — it’s often qualitative.

Another point: this statement dates back to a time when the web was less dynamic. Today, with heavy JavaScript sites, Single Page Applications, and client-side rendering, crawl budget can be impacted by the computational cost of rendering, not just by the raw volume of URLs. [To be confirmed] to what extent this million integrates or does not integrate the CPU cost of modern rendering.

In what cases does this rule absolutely not apply?

If your site generates parameterized URLs indefinitely (e.g., poorly managed e-commerce facets, infinite pagination, user sessions in the URL), you may find yourself with “only” 100,000 actual pages but millions of potential URLs that Googlebot will attempt to crawl. In this case, crawl budget becomes a real problem, even if your product inventory does not exceed 10,000 items.

Sites with a high editorial velocity — news, classifieds, job offers — can also saturate their crawl budget if Google has to return several times an hour to hundreds of thousands of constantly changing pages. Again, volume alone is not sufficient to predict Googlebot's behavior.

Warning: if you notice in Search Console that strategic pages are not being crawled regularly while your site is under a million URLs, don’t be too quick to reassure yourself with this rule. Dig into the architecture, speed, linking, and content quality — the problem is likely there.

Practical impact and recommendations

What should I do concretely if my site is under a million URLs?

Don’t rest on your laurels. Being below the threshold guarantees nothing if your site is structurally unsound. Focus on the technical fundamentals: server response time, click depth, quality of internal linking, redirect management, cleaning up dead pages.

Regularly check the 'Crawl Stats' report in Google Search Console to identify anomalies: unexplained crawl spikes (often a sign of duplication or redirect loops), sharp drops (robots.txt blocking, massive server errors), concentration of crawl on non-strategic areas.

How to detect a crawl budget issue even if under a million?

Compare the number of pages crawled daily (Search Console > Crawl Stats) with the volume of active pages you want to be indexed. If Googlebot only visits 10% of your strategic pages per week, there’s a concern — even if you’re at 300,000 URLs.

Analyze the average depth of your important pages. If your key product sheets are 5-6 clicks from the homepage, it's a sign that your internal linking isn’t doing its job. Googlebot follows links like a user — what’s deep for it is also deep for your visitors.

Also, check the server response times in Search Console. If the average exceeds 500ms, Google will slow down the crawl to avoid overloading your infrastructure. A server that handles the load at 200ms allows for more aggressive crawling and therefore better coverage.

What mistakes should be avoided at all costs?

Do not generate unnecessary URLs — every facet, filter, sort, or pagination that doesn’t provide unique SEO value should be blocked (robots.txt, meta robots noindex tag, or better yet: canonical parameter). Every unnecessarily crawled URL is wasted budget.

Do not leave thousands of cascading 301 redirects lying around. Googlebot follows redirects, but it consumes budget. If page A redirects to B which redirects to C, Googlebot may decide not to go all the way — or to slow down the crawl of your section.

Remember to regularly update your XML sitemap. A sitemap listing 50,000 URLs with 10,000 returning 404 or redirects sends a signal of negligence to Google. A clean, up-to-date sitemap that only lists active, indexable pages effectively guides Googlebot.

Check crawl stats in Search Console every week
Audit the click depth of strategic pages (goal: max 3 clicks from the homepage)
Measure server response times and optimize if >300ms
Clean up unnecessary parameterized URLs (facets, sessions, tracking)
Maintain an up-to-date XML sitemap, without 404s or redirects
Identify and deindex or remove dead pages (zero traffic, zero internal links)

The million URL threshold is a useful marker, not a guarantee. Even below it, a poorly architected site can waste its crawl budget. The key is to make exploration smooth, fast, and relevant — Googlebot will thank you by effectively indexing your important pages. These technical optimizations can quickly become complex to manage, especially on growing sites. If you notice warning signs in Search Console or your indexing stagnates despite efforts, consider partnering with a specialized SEO agency that can diagnose your architecture and unblock bottlenecks.

❓ Frequently Asked Questions

Un site de 50 000 pages peut-il avoir un problème de crawl budget ?

Oui, si la profondeur de clic est excessive, si le serveur est lent, ou si beaucoup d'URLs générées sont inutiles (facettes, paramètres de session). Le volume seul ne garantit pas un crawl fluide.

Comment savoir combien de pages Google crawle par jour sur mon site ?

Dans Google Search Console, section « Paramètres » puis « Statistiques d'exploration ». Tu y trouveras le nombre de requêtes par jour, les temps de réponse et les erreurs serveur.

Faut-il bloquer les facettes e-commerce pour économiser du crawl budget ?

Si les facettes ne génèrent pas de trafic organique distinct et créent de la duplication, oui — utilise robots.txt, meta robots noindex ou des canoniques vers la page principale. Ça évite de gaspiller du crawl sur du contenu non stratégique.

Un sitemap XML mal maintenu peut-il réduire le crawl budget ?

Indirectement oui. Un sitemap rempli de 404, de redirections ou d'URLs bloquées envoie un signal de négligence. Google peut réduire la fréquence de crawl s'il détecte trop d'erreurs répétées.

La vitesse serveur impacte-t-elle vraiment le crawl budget ?

Absolument. Si ton serveur répond lentement, Googlebot ralentit volontairement le rythme pour ne pas te surcharger. Un serveur rapide permet un crawl plus agressif et donc une meilleure couverture.

🎥 From the same video 15

Other SEO insights extracted from this same Google Search Central video · duration 31 min · published on 09/12/2020

🎥 Watch the full video on YouTube →