Why doesn't Google index all pages on large sites?

Quick SEO Quiz

Test your SEO knowledge in 5 questions

Less than a minute. Find out how much you really know about Google search.

🕒 ~1 min 🎯 5 questions

Official statement

For large-scale sites, it is normal for not all pages to be indexed. Improving internal links can help Google discover more pages.

3:01

🎥 Source video

Extracted from a Google Search Central video

⏱ 56:43 💬 EN 📅 04/09/2019 ✂ 10 statements

Watch on YouTube (3:01) →

✂ Other statements from this video 9 ▾

1:45 Pourquoi Google n'indexe-t-il pas le contenu qu'il ne parvient pas à rendre en JavaScript ?
5:45 Les Core Updates changent-ils vraiment le classement en continu entre deux mises à jour ?
9:48 Le maillage interne suffit-il vraiment à booster le classement de toutes vos pages ?
10:20 Les blogs rankent-ils plus vite que les pages statiques dans Google ?
14:37 Pourquoi Google affiche-t-il parfois des URLs M-Dot dans les résultats desktop ?
23:54 Les erreurs 500 prolongées font-elles vraiment disparaître vos pages de l'index Google ?
29:06 L'en-tête Vary mal configuré impacte-t-il vraiment l'indexation de votre site responsive ?
32:09 Faut-il vraiment utiliser l'outil de changement d'adresse pour migrer des sous-domaines ?
53:20 Pourquoi Google peut-il fusionner vos pages JS si les balises meta sont identiques ?

📅

Official statement from September 4, 2019 (6 years ago)

⚠ A more recent statement exists on this topic Should You Split Your XML Sitemap Into Multiple Smaller Files Instead of One Lar... John Mueller · November 4, 2019 View statement →

TL;DR

Google openly acknowledges that it does not aim to index every page of a large site. According to Mueller, this is normal behavior that should not alarm SEOs. What's the recommended solution? Improve internal linking to guide the crawl towards strategic pages instead of obsessively trying to get the entire base indexed.

What you need to understand

What does 'partial indexing' really mean according to Google?

Google no longer hides the fact that it performs a drastic selective indexing on large sites. We're talking about e-commerce catalogs, news portals, marketplaces — in short, any site exceeding several thousand pages. Partial indexing is not a bug; it's a deliberate strategy to optimize crawl budget allocation and index quality.

In practical terms? Google scans, evaluates, and decides which pages deserve to be stored. The others are left aside, often permanently if nothing changes. This filtering relies on multiple signals: internal and external links, content freshness, user engagement, perceived duplication, depth in the structure.

Why is this behavior considered 'normal'?

Mueller uses the word 'normal' to calm the panic of site owners seeing 30%, 50%, or even 70% of their pages excluded from the index. For Google, massively indexing low-value pages — removed product listings, auto-generated content, minimal variations — would pollute the index and degrade the relevance of results.

The engine, therefore, prioritizes quality over quantity. A site with 100,000 pages doesn't need all of them indexed if only 20,000 generate qualified traffic. Google accepts this compromise and asks SEOs to do the same. Let's be honest: most large sites carry thousands of dead or redundant URLs that no one misses in the SERPs.

How does internal linking influence crawling and indexing?

Internal linking remains the most direct lever to signal to Google which pages really matter. The more a page receives internal links from already well-crawled pages, the more likely it is to be visited regularly and considered a priority. This is an obvious internal PageRank signal: an orphaned page or one buried six clicks deep from the home page stands no chance.

Mueller implicitly refers to the optimization of the crawl budget: if Google has a limited number of requests per day on your site, it might as well dedicate them to strategic pages. A well-crafted internal linking structure — contextual links, clear hierarchy, breadcrumbs, targeted XML sitemaps — directs the bot towards high-value content and reduces waste on satellite pages.

Partial indexing is a policy embraced by Google, not a technical malfunction.
Internal linking plays a crucial role in prioritizing crawling and eligibility for indexing.
Google prioritizes index quality over exhaustive coverage, especially on large sites.
Isolated, duplicated, or low-value pages are naturally excluded from the index.
A site with 100,000 pages doesn't need 100,000 indexed pages to perform well in SEO.

SEO Expert opinion

Is this statement consistent with on-the-ground observations?

Absolutely. Crawl audits on medium and large sites systematically show massive gaps between crawled pages, indexable pages, and actually indexed pages. Google Search Console regularly reports thousands of URLs as 'Crawled, currently not indexed' or 'Detected, currently not indexed'. This is no longer an anomaly; it's a recurring pattern.

E-commerce sites with several tens of thousands of product listings have long noted that Google only deigns to index a fraction of their catalog — often the best-selling, most-linked, or most recent references. The rest sits dormant in crawl logs without ever appearing in the SERPs. And that's where the rub is: Mueller's statement validates this behavior without providing specific criteria to identify which pages will be excluded. [To be verified]: Google remains vague regarding the precise weight of internal linking versus other signals like external popularity or click-through rate.

What are the blind spots of this statement?

Mueller does not specify what threshold a site is considered 'large scale'. 5,000 pages? 50,000? 500,000? This gray area leaves SEOs uncertain. A site with 10,000 pages and 40% indexing — is that normal or problematic? Impossible to decide without a benchmark.

Another point: improving internal linking is presented as a universal solution, but it doesn't resolve everything. A site with impeccable linking can still see entire sections of its catalog ignored if Google deems the content too similar, of low quality, or not in demand by users. Internal linking is a facilitator, not a guarantee of indexing. [To be verified]: the actual impact of a revamp of linking on indexing remains difficult to isolate from other parallel SEO optimizations.

In what cases does this rule not fully apply?

News sites and media partially escape this logic. Google indexes thousands of articles massively and quickly each day on major news portals, even if their top position lifespan is short. Freshness and immediate demand (trending topics, hot news) take precedence over internal linking. An article published at 6 AM can be indexed within minutes even before any internal link has pointed to it.

Conversely, technical or B2B sites with few pages but ultra-specialized content can see all their URLs indexed, regardless of their depth in the structure. Google adjusts its behavior based on site typology and user demand. Mueller's statement mainly targets large transactional or informational volumes with high redundancy — not niche catalogs with low volume.

Note: Do not confuse accepted partial indexing with a real technical issue. If a site with 500 pages has only 50 indexed, internal linking is probably not the sole culprit. Potential areas to investigate: misconfigured robots.txt, accidental noindex, incorrect canonicals, or manual penalties.

Practical impact and recommendations

How can I optimize internal linking to promote indexing?

First instinct: map your current crawl budget using server logs or a tool like Botify, OnCrawl, or Screaming Frog Log Analyzer. Identify which pages Google visits most often, which ones it ignores, and how frequently it returns. This diagnosis reveals the dead zones of your architecture and orphaned pages that receive no internal links.

Next, reinforce the linking towards strategic pages — those that convert, rank well, or have untapped traffic potential. Use contextual links within editorial content, not just menus or footers that Google may consider noise. Each link should have semantic meaning, with a descriptive anchor. And here's where it often falters: many sites add internal linking mechanically, without thematic coherence, diluting the signal rather than strengthening it.

What mistakes should be absolutely avoided?

Don't fall into the trap of overlinking: stuffing every page with 50 internal links to ancillary pages creates noise and dilutes internal PageRank. Google may consider these links irrelevant and ignore them. Always prioritize quality over quantity — 5 well-placed and contextual internal links are worth more than 20 generic links in a sidebar.

Another classic mistake: neglecting the cleaning of unnecessary URLs. If you want Google to index your important pages, start by removing or noindexing unnecessary pages — infinite faceted filters, empty tag pages, old campaigns, expired content. The fewer pages you offer for crawl, the better Google allocates its budget to those that matter. A lean and structured site crawls better than an obese and chaotic one.

How can I check if my site meets Google's expectations?

Use Google Search Console to analyze coverage and indexing reports. Identify URLs marked 'Crawled, currently not indexed': this is where internal linking can make a difference. If these pages are strategic, link to them from already well-crawled pages and monitor progress over 4 to 6 weeks.

Run a complete crawl with Screaming Frog or Botify to identify orphaned pages — those accessible via the XML sitemap but without any internal links. Google discovers them through the sitemap, but without internal relays, they remain at the bottom of the crawl pile. Integrate them smartly into your architecture via category pages, related blog posts, or 'similar products' blocks.

Analyze server logs to identify under-crawled or ignored areas by Googlebot.
Reinforce internal linking towards strategic pages with contextual links and descriptive anchors.
Remove or noindex unnecessary, duplicated, or low-value pages.
Eliminate orphaned pages by integrating them into the structure through relevant links.
Monitor indexing progress in Google Search Console over 4 to 6 weeks post-optimization.
Avoid overlinking: prioritize 5 relevant internal links over 20 generic links.

Partial indexing is an accepted reality by Google, especially on large sites. Internal linking remains the most direct lever to guide crawling towards strategic pages, but it doesn't solve everything on its own. Technical optimization work — cleaning unnecessary URLs, restructuring the architecture, fine analysis of crawl budget — is essential to maximize the efficiency of crawling and indexing of high-value content. These optimizations can be complex to orchestrate alone, especially on multi-thousand-page sites. Engaging a specialized SEO agency allows for a granular analysis of logs, a data-driven revamp of internal linking, and personalized support to prioritize actions according to your business objectives.

❓ Frequently Asked Questions

Google indexe-t-il moins de pages qu'avant sur les gros sites ?

Pas nécessairement moins, mais de manière plus sélective. Google a toujours filtré, mais il assume désormais ouvertement qu'il n'a ni l'intention ni l'intérêt d'indexer l'intégralité d'un catalogue de grande taille.

Le maillage interne suffit-il à garantir l'indexation d'une page ?

Non. Le maillage interne améliore la découvrabilité et la priorisation du crawl, mais Google peut toujours juger une page non pertinente ou trop similaire à d'autres et refuser de l'indexer.

Faut-il s'alarmer si 50% de mon site n'est pas indexé ?

Pas forcément. Tout dépend de la nature des pages écartées. Si ce sont des pages stratégiques avec du potentiel de trafic, c'est un problème. Si ce sont des variations mineures ou des contenus expirés, c'est normal.

À partir de combien de pages un site est-il considéré comme « de grande envergure » ?

Google ne donne aucun chiffre. C'est une zone grise. En pratique, on observe des comportements d'indexation partielle dès plusieurs milliers de pages, surtout sur les e-commerce et les sites de contenus.

Les sitemaps XML aident-ils à forcer l'indexation des pages ignorées ?

Les sitemaps XML facilitent la découverte, mais ne garantissent pas l'indexation. Google peut crawler une URL via le sitemap et décider de ne pas l'indexer si elle ne lui semble pas prioritaire ou pertinente.

🏷 Related Topics

indexation partielle crawl budget maillage interne PageRank interne Google Search Console logs serveur URLs orphelines architecture de site

Domain Age & History Crawl & Indexing AI & SEO Links & Backlinks

🎥 From the same video 9

Other SEO insights extracted from this same Google Search Central video · duration 56 min · published on 04/09/2019

🎥 Watch the full video on YouTube →

Related statements

« Previous

Subdomain Migration Without Address Change Tool...

Displaying M-Dot versions in desktop results...

« Back to results