How does Googlebot actually calculate your crawl budget?

Quick SEO Quiz

Test your SEO knowledge in 5 questions

Less than a minute. Find out how much you really know about Google search.

🕒 ~1 min 🎯 5 questions

Official statement

Googlebot counts every request made to the server, including images, JavaScript, and CSS files for the crawl budget calculation. However, Google uses aggressive caching to reduce repetitive requests.

5:48

🎥 Source video

Extracted from a Google Search Central video

⏱ 1h13 💬 EN 📅 26/06/2017 ✂ 26 statements

Watch on YouTube (5:48) →

✂ Other statements from this video 25 ▾

📅

Official statement from June 26, 2017 (8 years ago)

⚠ A more recent statement exists on this topic Does Google Merchant Center crawling count against your SEO crawl budget? John Mueller · April 30, 2024 View statement →

TL;DR

Googlebot counts every server request in the crawl budget, including images, JavaScript, and CSS. This comprehensive counting mechanism can quickly exhaust your quota on resource-intensive sites. Google's aggressive caching mitigates the problem by reducing repetitive requests, but it does not guarantee control over the consumed budget.

What you need to understand

What resources actually consume crawl budget?

John Mueller's statement cuts through a persistent ambiguity: every HTTP request counts. Not just the HTML of your pages. The images embedded in your articles, the JavaScript files that orchestrate your interfaces, the CSS stylesheets that dress your content — all of this impacts the count.

In practical terms, a typical web page in 2020 weighs 2 MB and generates an average of 70 HTTP requests. If Googlebot crawls 1,000 pages from your site, it can easily trigger 70,000 server requests. On an e-commerce site with product pages packed with high-definition visuals or a SaaS platform that loads 15 third-party scripts, the ratio skyrockets.

This granular counting changes the game for optimization. A site with 10,000 HTML URLs but 200,000 associated static resources does not consume the budget of a site with 10,000 URLs. It consumes that of a site with 210,000 crawlable resources.

Does Google's cache really solve the problem?

Mueller mentions an "aggressive cache" that reduces repetitive requests. This is the lifeline: if Googlebot crawled your header logo three days ago and the file hasn't changed, it won't re-download it with every page visit.

The catch? Google does not publish any metrics on the effectiveness of this cache. How long does a CSS file stay cached? What criteria determine if a resource should be re-crawled? [To be verified] — no official data allows for estimating the real gain. On a site that pushes daily updates of its assets, caching can become nearly useless.

Server logs show that some images are indeed crawled only once over several weeks. Others, inexplicably, are re-downloaded every 2-3 days. The logic of caching remains opaque and unguaranteed.

Why is this counting mechanism so important?

The crawl budget is not infinite. Google allocates a quota based on server health, domain authority, and content freshness. If you waste this quota on non-priority resources, your strategic new pages won't be crawled in time.

A news site publishing 50 articles a day but generating 5,000 requests for unoptimized images risks having its recent articles indexed only with a 12-24 hour delay. For trending queries, this kills traffic. The same problem arises for e-commerce during sales: if Googlebot uses its quota on old out-of-stock products, the new listings arrive too late in the index.

Every HTTP request counts in the crawl budget, not just HTML pages
A site with many static resources (images, JS, CSS) consumes proportionally more budget than a minimalist site
Google's cache reduces repetitive requests, but its effectiveness remains undocumented and unpredictable
Wasting budget on secondary resources delays indexing of priority content
Server logs remain the most reliable tool for measuring actual budget consumption by resource type

SEO Expert opinion

Is this statement consistent with field observations?

Yes, and this is one of the rare occasions where Google provides an actionable figure. Log analyses confirm that Googlebot indeed generates tens of thousands of requests on average sites, far beyond the simple number of HTML pages. The patterns clearly show that images, scripts, and stylesheets appear in Googlebot hits.

The problem is that this transparency stops there. Google does not specify the relative weight that each type of request has in the final budget calculation. Does a 50 KB image count the same as a 200 KB HTML page? Does server response time modulate this count? No official answers. [To be verified] in your own data.

Is the "aggressive" cache really a reliable solution?

Let's be honest: the term "aggressive" sounds reassuring, but it hides a complete lack of guarantee. Tests show huge variability depending on the sites. Some see their CSS crawled once a month, others every week. Validation mechanisms (ETag, Last-Modified, Cache-Control) may influence behavior, but Google documents nothing.

Worse, on sites using CDNs with versioned URLs (like file.abc123.css), each deployment changes the URL and invalidates the cache. As a result, the "aggressive cache" is useless. If you engage in systematic cache-busting for your assets, expect Googlebot to re-crawl them completely every time.

Warning: Do not rely on Google's cache to optimize your budget. Assume that all your resources can be crawled with each visit and optimize accordingly.

What strategies does this logic invalidate?

Some common SEO practices become counterproductive in light of this statement. Embedding 15 high-resolution images in each article to "enrich content"? This explodes your budget. Loading 8 custom web fonts for a premium visual identity? Same deal. Multiplying tracking scripts and third-party widgets to analyze user behavior? You pay in crawl quota.

Sites that have optimized their Critical Rendering Path for users (aggressive lazy loading, minification, concatenation) also benefit on the crawl side. Fewer HTTP requests = preserved budget for real pages. The Performance Budget becomes an SEO Budget. It’s no coincidence that the best-performing sites in Core Web Vitals often also have the best crawl rates.

Practical impact and recommendations

How to audit your actual budget consumption?

First step: analyze your server logs. Filter Googlebot requests over a period of 30 days and segment by resource type (HTML, images, CSS, JS, fonts, others). Calculate the requests/resources ratio versus crawled HTML pages. If you see 10,000 HTML pages but 150,000 total requests, your ratio is 15:1 — each HTML page generates an average of 14 additional requests.

Second level: identify priority waste. Which resources are crawled most often without adding SEO value? Old versions of CSS files still accessible? Thumbnail images resized server-side instead of being pre-generated? Exotic web fonts used on 3 titles per page? All of this eats up your budget.

What technical optimizations should be prioritized?

Start with robots.txt. Explicitly block non-essential resources for indexing: web fonts, tracking files, repetitive decorative images. Be careful not to block what is needed for rendering (Google needs to see the content as the user does), but anything purely cosmetic can be dropped.

Next, optimize your cache HTTP headers. Cache-Control: max-age=31536000 for versioned assets (which will never change once deployed). Correct ETag and Last-Modified headers to allow conditional requests for 304 Not Modified. If Googlebot can validate that a file hasn't changed without downloading it, it saves quota even if Google's internal cache has expired.

Third avenue: implement smart lazy loading. Images at the bottom of the page that are displayed only when the user scrolls can be loaded with client-side JavaScript. Googlebot will still see the content (it executes the JS), but if you structure it well, it won't necessarily trigger all image requests. Test it with Google Search Console to ensure the rendering remains correct.

What strategy to adopt for large sites?

On sites with 100,000+ pages, prioritization becomes critical. Use segmented XML sitemaps by business priority: one sitemap for strategic pages (flagship products, landing pages), another for the long tail catalog, a third for old editorial content. This does not directly control crawl budget, but it guides Googlebot towards what matters.

Then, ruthlessly clean up. Old URLs generating 404 errors but still crawled? Serve 410 Gone to signal permanent deletion. Infinite paginated pages that dilute the crawl? Consolidate with rel=prev/next or switch to load-more JavaScript. E-commerce filter facets that explode the number of URLs? Strategic robots.txt or noindex tags.

Monitor your server response times. A fast TTFB (Time To First Byte) allows Googlebot to crawl more URLs in the same timeframe. If your server takes 800ms to respond instead of 200ms, you lose 75% of crawl capacity. It's mathematical: Google allocates crawl time, not an absolute number of requests.

Audit your server logs over 30 days to identify the total requests/pages HTML crawled ratio
Block non-essential decorative resources for indexing in robots.txt (fonts, tracking, repetitive images)
Configure aggressive Cache-Control HTTP headers (max-age=31536000) for versioned assets
Segment your XML sitemaps by business priority to guide crawling towards strategic content
Clean up dead URLs (410 Gone), consolidate infinite paginations, streamline filter facets
Optimize your server TTFB to maximize the number of crawlable URLs within the allocated time quota

Optimizing the crawl budget quickly becomes a balancing act between technical performance, information architecture, and business prioritization. For complex sites or those facing recurring indexing problems, hiring a specialized SEO agency can provide a thorough audit of server logs and a tailored strategy suited to your specific infrastructure.

❓ Frequently Asked Questions

Les fichiers bloqués par robots.txt consomment-ils du budget de crawl ?

Non, Googlebot respecte robots.txt avant de lancer la requête HTTP. Une ressource bloquée ne génère pas de requête serveur et ne consomme donc pas de budget. C'est justement la méthode recommandée pour économiser du quota sur les assets non prioritaires.

Les requêtes 304 Not Modified comptent-elles dans le budget de crawl ?

Oui, une requête conditionnelle qui retourne 304 reste une requête HTTP serveur et consomme du budget. Elle est juste beaucoup moins coûteuse en bande passante qu'un téléchargement complet. Le cache interne de Google vise justement à éviter même ces requêtes 304.

Faut-il bloquer les images dans robots.txt pour économiser du budget ?

Ça dépend. Google a besoin de voir vos images pour l'indexation image et pour comprendre le contexte des pages. Bloquez uniquement les images purement décoratives ou répétitives (logos, icônes UI, backgrounds). Gardez accessibles les images de contenu et de produits.

Le lazy loading JavaScript empêche-t-il Googlebot de crawler les images ?

Pas nécessairement. Googlebot exécute le JavaScript et peut déclencher le lazy loading. Cependant, un lazy loading bien configuré peut réduire le nombre de requêtes initiales, ce qui peut optimiser indirectement le budget. Testez toujours le rendu avec la Search Console.

Comment savoir si mon budget de crawl est saturé ?

Vérifiez dans la Search Console le rapport de statistiques d'exploration. Si le nombre de pages crawlées par jour plafonne alors que vous publiez régulièrement du contenu neuf qui tarde à être indexé, c'est un signe de saturation. Les logs serveur confirmeront le diagnostic.

🏷 Related Topics

crawl budget googlebot indexation logs serveur robots.txt cache HTTP optimisation technique TTFB

Domain Age & History Crawl & Indexing AI & SEO Images & Videos JavaScript & Technical SEO PDF & Files Web Performance

🎥 From the same video 25

Other SEO insights extracted from this same Google Search Central video · duration 1h13 · published on 26/06/2017

🎥 Watch the full video on YouTube →

Related statements

« Previous

Exploration and Impact of robots.txt Blocking...

Handling HTTP and HTTPS URLs...

« Back to results