What page size limit could hinder Google’s indexing?

Official statement

Google Fetch in Search Console has a high limit regarding page size. If a page exceeds this limit, Google may not fully retrieve it. This may cause slower performance for users, but it does not directly affect ranking.

0:39

🎥 Source video

Extracted from a Google Search Central video

⏱ 1h12 💬 EN 📅 16/12/2016 ✂ 11 statements

Watch on YouTube (0:39) →

✂ Other statements from this video 10 ▾

3:40 Comment Google détecte-t-il vraiment les sites dupliqués sur plusieurs domaines ?
5:27 Faut-il vraiment respecter l'ordre des balises Hn pour le SEO ?
9:44 Faut-il vraiment ajouter toutes les versions de domaine dans Search Console ?
12:50 Faut-il vraiment mettre à jour son contenu régulièrement pour bien se positionner ?
15:03 Faut-il migrer d'un coup vers HTTPS quand on a un petit site ?
18:50 Faire un lien vers une page pertinente suffit-il à améliorer votre propre classement ?
39:34 Les interstitiels intrusifs coûtent-ils vraiment des positions dans Google ?
42:38 Les interstitiels intégrés directement dans la page sont-ils aussi pénalisants que les popups classiques ?
46:00 Faut-il vraiment canoniser toutes les variantes produits vers une seule URL ?
66:46 Peut-on vraiment récupérer son site désindexé suite à une plainte DMCA ?

What you need to understand

What exactly is this size limit mentioned by Google?

Google does not disclose a specific public figure, but field observations converge around a limit of about 15 MB for raw HTML retrieved during crawling. Beyond this, Googlebot truncates the retrieval and stops parsing at a certain threshold.

This limit only pertains to the initial HTML document, not external resources (CSS, JS, images). If your page consists of 20 MB of pure HTML (which is rare but possible on certain e-commerce or aggregation sites), Google may never see the last sections of your content.

Why does this limit exist technically?

Google manages hundreds of billions of pages and needs to allocate its crawl budget and server resources efficiently. Crawling and parsing a 30 MB document costs exponentially more than that of a 100 KB document.

The limit also acts as a safeguard against ill-configured dynamically generated pages that can produce infinite content streams. Google prefers truncation to completely blocking the crawl of a domain.

How does this differ from the classic crawl budget?

The crawl budget determines how many URLs Google explores on your site within a certain timeframe. The size limit concerns a single URL: even if Google decides to crawl it, it may not fully retrieve it.

In practical terms, you can have an excellent crawl budget but lose content on certain pages if they exceed the threshold. The two mechanisms are complementary and must be optimized separately.

Retrieval Limit: about 15 MB of raw HTML per page
No Direct Alert: Google does not notify you if a page is truncated
Indirect Ranking Impact: missing content = loss of semantic depth
Distinct from Crawl Budget: concerns the volume of data per URL, not the number of crawled URLs
External Resources Excluded: only the initial HTML is counted in this limit

SEO Expert opinion

Is this statement consistent with what is observed on the ground?

Yes, but with significant nuances. Cases of true truncation remain rare on conventional sites. They are mostly encountered on content aggregation platforms, giant marketplaces, or poorly configured sites that load thousands of lines of JSON inline into the DOM.

The insidious point: Google does not warn you when a page exceeds the limit. You must detect for yourself whether your heavy pages are fully indexed. Use the URL inspection in Search Console and compare the retrieved HTML code with the actual source.

What are the true practical consequences?

Mueller states that it does not directly affect ranking. Let's be clear: this is technically true but misleading in its implications. If Google does not see half of your content, it cannot extract entities, secondary keywords, or thematic depth. As a result, you rank lower without explicit penalties.

The real danger concerns rich product pages or long articles with hundreds of user reviews injected in HTML. If these sections are at the bottom of the page and the document exceeds 15 MB, Google will never see them. [To be verified]: Google could theoretically retrieve this content via JavaScript rendering, but there is no guarantee it will do so systematically on all heavy pages.

When does this limit truly become problematic?

Sites that heavily inject structured content in JSON-LD or microdata directly into HTML can quickly reach critical sizes. Some poorly configured CMS also generate pages with tens of thousands of lines of redundant markup.

Pay special attention to sites that load infinite product lists server-side before pagination. If you generate 500 products in pure HTML on a single category page, you risk truncation. The solution lies in strict server-side pagination and controlled lazy-loading.

If your strategic pages exceed 10 MB of HTML, immediately check in Search Console that Google retrieves the entire content. A gap of more than 20% between the source code and the retrieved code indicates a problem.

Practical impact and recommendations

How can you tell if your pages exceed the critical limit?

Start with a HTML weight audit on your main templates. Use Chrome DevTools > Network > Doc to measure the size of the initial HTML document (Size column). Focus initially on category pages, enriched product pages, and long articles with comments.

Then, cross-check with the URL inspection tool in Search Console. Request live indexing, retrieve the HTML as seen by Google, and compare the byte length with your source. A significant gap indicates possible truncation.

What optimizations can be implemented to reduce HTML weight?

Externalize anything that can be. Large structured data can sometimes be reduced by retaining only essential properties. Avoid injecting JSON-LD with hundreds of lines if Google can retrieve the information otherwise.

For generated content, prioritize client-side lazy-loading for reviews, comments, or long lists. Load a lightweight HTML skeleton, then enhance it via JavaScript after the first paint. Google executes the JS, but you maintain control over the weight of the initial crawled HTML.

Should you panic if a page exceeds 15 MB?

No, but don't remain passive. Most sites will never encounter this threshold. If you reach it, it's often the symptom of poorly designed architecture rather than a legitimate need for volume. Rarely do 15 MB of pure HTML genuinely provide value.

However, some sectors (scientific data aggregation, ultra-rich marketplaces) can legitimately produce heavy pages. In that case, a technical overhaul is necessary to break down the content into separate indexable blocks, with a strict silo architecture.

Audit the HTML weight of strategic templates (categories, products, articles)
Compare the source code with the HTML retrieved by Google via Search Console
Externalize or lighten large JSON-LD and microdata
Implement client-side lazy-loading for secondary content (reviews, long lists)
Strictly paginate server-side product lists or content
Monitor size gaps in crawl logs if available

The page size limit remains a fringe case for most sites, but it often reveals deeper architectural issues. If your audit uncovers critical pages near the threshold, a technical overhaul can be complex to undertake alone. Consulting a specialized SEO agency can provide a precise diagnosis and an action plan tailored to your technical stack, especially if your site relies on a logic of dynamic content generation or massive aggregation.

❓ Frequently Asked Questions

Quelle est la limite exacte de taille de page pour Google Fetch ?

Google ne communique pas de chiffre officiel, mais les observations terrain situent la limite autour de 15 Mo de HTML brut. Au-delà, Googlebot peut tronquer la récupération du contenu.

Google m'alerte-t-il si une page est trop lourde pour être crawlée entièrement ?

Non, Google ne notifie pas directement ce problème. Vous devez comparer manuellement le HTML récupéré via Search Console avec votre code source pour détecter une éventuelle troncature.

Cette limite inclut-elle les ressources externes comme le CSS et le JavaScript ?

Non, seul le document HTML initial est concerné. Les fichiers CSS, JS, images et autres ressources externes ne sont pas comptabilisés dans cette limite de 15 Mo.

Puis-je contourner cette limite avec du lazy-loading JavaScript ?

Oui, charger du contenu via JavaScript après le DOM initial réduit le poids du HTML crawlé. Google exécute le JS et récupérera le contenu enrichi, mais cela déplace le problème vers le rendering budget.

Quels types de sites risquent le plus de dépasser cette limite ?

Les marketplaces avec des centaines de produits par page, les sites d'agrégation de contenu, et les plateformes injectant massivement des données structurées en JSON-LD inline sont les plus exposés.

🎥 From the same video 10

Other SEO insights extracted from this same Google Search Central video · duration 1h12 · published on 16/12/2016

🎥 Watch the full video on YouTube →