Why does your indexed page count never match your total URLs?

Official statement

It is normal for the number of indexed URLs not to always match the total number of pages on a site. A substantial difference may indicate issues with duplication or mismanaged parameterized URLs.

38:08

🎥 Source video

Extracted from a Google Search Central video

⏱ 1h04 💬 EN 📅 29/07/2016 ✂ 10 statements

Watch on YouTube (38:08) →

✂ Other statements from this video 9 ▾

1:43 Comment le PageRank se transmet-il réellement à travers les redirections ?
4:43 Les refonte et redirections massives tuent-elles vraiment votre visibilité SEO ?
4:50 Faut-il soumettre un sitemap temporaire avec les anciennes et nouvelles URL lors d'une migration ?
6:25 Les redirections 3xx font-elles vraiment perdre du PageRank ?
7:45 Faut-il vraiment renvoyer un 404 sur vos pages de contenu expiré plutôt que rediriger vers l'accueil ?
13:27 Faut-il vraiment mettre du nofollow sur tous les liens d'affiliation ?
19:43 Faut-il vraiment utiliser rel=canonical pendant un test A/B ?
53:28 Le texte en bas de page aide-t-il vraiment votre SEO ou Google l'ignore-t-il ?
61:36 Faut-il vraiment héberger son blog SEO sur un sous-domaine plutôt que dans le site principal ?

What you need to understand

Why does Google never index 100% of a site?

Google does not promise to index every URL on a domain. The algorithm makes choices: it evaluates quality, detects duplicates, ignores unnecessary parameters, and filters out what seems irrelevant for the search experience.

A site with 10,000 pages may have only 7,500 indexed without any issue. The gap is structural, not incidental. Google does not seek completeness; it seeks relevance.

What constitutes a 'substantial' difference according to Google?

Mueller provides no specific threshold. A substantial difference should be interpreted on a case-by-case basis: a 10% gap on a site with 500 pages is not the same as 60% on a site with 50,000.

The red flag appears when the gap is poorly explained by the site structure. If you have 2,000 unique editorial pages and only 800 are indexed, the issue is not normal. This is where duplication or parameters come into play.

How does Search Console reflect this reality?

The 'Pages' report in Search Console shows two main categories: indexed pages and those excluded with a reason. The exclusion reasons reveal Google's logic: detected duplicate, alternative canonical URL, page crawled but not indexed, crawl blocked by robots.txt.

These statuses are not fixed. An excluded page may be indexed later if its content evolves or if the internal linking changes. Indexing is not binary; it fluctuates based on crawl budget and the perceived added value by Googlebot.

The indexing/total gap is normal and structural, not an anomaly.
A substantial difference signals duplication, mismanaged parameters, or technical issues.
Search Console provides the precise exclusion reasons for each non-indexed URL.
Indexing fluctuates over time based on crawl budget and perceived quality.
Regularly monitoring the gap allows for early detection of issues before they impact traffic.

SEO Expert opinion

Does this statement truly reflect real-world practice?

On this point, Mueller aligns with what we observe. No large e-commerce or media site achieves 100% indexing. Facets, pagination pages, and product variants naturally create duplicates that Google filters.

The problem arises when the gap persists without explanation. A site with 5,000 product listings and only 1,200 indexed does not merely have a 'normal gap.' Either the content is too similar across listings, or URL parameters (sorting, filtering) generate massive duplicates that Google ignores.

What nuances should we consider regarding this position?

Mueller remains vague about what constitutes a ‘substantial’ gap. [To verify]: Google provides no industry benchmark or standard ratio. A 30% gap might be acceptable for a media site with lots of tags and filters, but alarming for a showcase site with 50 pages.

Another point: deliberate exclusion is not always a problem. If you intentionally block the indexing of internal search pages or filters via meta robots, the gap is intentional. Search Console will show these pages as excluded, but it is a strategic decision, not an error.

When does this rule not apply?

On a site with fewer than 100 unique editorial pages, a 30% gap becomes suspicious. Google should index nearly all of a small well-structured site unless canonical or noindex tags deliberately block certain URLs.

One-page sites or landing pages optimized for conversion rarely face indexing problems. The structural gap mainly concerns sites with large inventories: e-commerce, classifieds, content aggregators, and media with extensive archives.

Practical impact and recommendations

What concrete steps should you take to monitor this gap?

Set up a weekly alert on the 'Pages' report in Search Console. Monitor the volume of indexed URLs and the distribution of exclusions by reason. A sharp drop of 20% in a week signals a technical issue: robots.txt mistakenly modified, canonical poorly implemented after migration, slow server that hinders crawling.

Regularly compare the submitted XML sitemap to the number of indexed pages. If you submit 8,000 URLs and only 3,000 are indexed, investigate the exclusion reasons. Google specifies exactly why it ignores each URL: duplicate, alternative canonical, crawled but not indexed.

What mistakes should you absolutely avoid?

Do not submit all your URLs in the sitemap. A sitemap cluttered with duplicate or low-value pages dilutes the crawl budget and muddles Google's priorities. Focus on strategic pages: main product listings, in-depth articles, conversion landing pages.

Avoid mistakenly canonicalizing unique pages. A poorly pointed canonical tag sends the signal that the page has no unique value, leading Google to exclude it from indexing. Always verify that each canonical points to itself or a true master version, never mistakenly to a generic URL.

How can you fix an abnormally high gap?

Identify the dominant exclusion reasons in Search Console. If 'Detected duplicate' accounts for 40% of exclusions, audit your facets, filters, and parameterized pages. Block the indexing of non-strategic combinations via robots.txt or meta robots noindex.

For 'Crawled but not indexed', improve internal linking and content quality. Google crawls these pages but decides not to index them: a clear signal that the content is not worth it. Enrich them or redirect them with 301 to stronger pages.

Set up a weekly alert on the Pages report in Search Console
Compare submitted sitemap vs. indexed pages to detect massive discrepancies
Audit the dominant exclusion reasons (duplicate, canonical, crawl blocked)
Clean the XML sitemap: only submit strategic high-value URLs
Ensure consistency of canonical tags across the site
Block the indexing of non-strategic facets and filters via robots.txt or noindex

The gap between total pages and indexed pages is structural. It only becomes problematic when it reveals massive duplication or mismanaged URL parameters. Regularly monitor Search Console, clean your sitemaps, and focus the crawl budget on your high-value pages. If the audit reveals complex issues with canonicalization or deep duplication, consulting a specialized SEO agency can expedite diagnosis and ensure a lasting correction tailored to your specific architecture.

❓ Frequently Asked Questions

Un écart de 30% entre pages totales et pages indexées est-il normal ?

Cela dépend entièrement de la structure du site. Pour un e-commerce avec des facettes et filtres, oui. Pour un site vitrine de 50 pages éditoriales, non. L'écart s'évalue au cas par cas selon le contexte.

Search Console affiche « Explorée mais non indexée » sur 40% de mes pages, que faire ?

Google crawle ces pages mais juge leur contenu insuffisant pour l'indexation. Améliorez la qualité éditoriale, renforcez le maillage interne, ou redirigez-les en 301 vers des pages plus fortes si elles n'ont pas de valeur propre.

Faut-il soumettre toutes les URL d'un site dans le sitemap XML ?

Non. Un sitemap pollué dilue le crawl budget et brouille les priorités de Google. Ne soumettez que les pages stratégiques : fiches produits principales, articles de fond, landing pages de conversion.

Comment savoir si un écart d'indexation est lié à de la duplication ?

Consultez le rapport Pages de Search Console. Si le motif « Duplicata détecté » ou « Canonical alternative » représente une part importante des exclusions, le problème vient de contenus trop similaires ou de balises canonical mal configurées.

Une chute brutale du nombre de pages indexées signale-t-elle toujours un problème grave ?

Pas toujours, mais c'est un signal d'alerte. Vérifiez robots.txt, les balises canonical, la vitesse serveur et les logs de crawl. Une migration, une modification de configuration ou un incident technique peuvent expliquer cette chute.

🎥 From the same video 9

Other SEO insights extracted from this same Google Search Central video · duration 1h04 · published on 29/07/2016

🎥 Watch the full video on YouTube →