Does Google really index all the pages it crawls?

Official statement

Google will not index all found pages, especially if a site uses infinite parameter combinations. This can be normal, and it is often wise to restrict crawling with well-defined rules.

10:25

🎥 Source video

Extracted from a Google Search Central video

⏱ 57:34 💬 EN 📅 18/10/2018 ✂ 9 statements

Watch on YouTube (10:25) →

✂ Other statements from this video 8 ▾

8:11 Où placer vos données structurées pour qu'elles comptent vraiment ?
11:48 Votre serveur lent tue-t-il votre crawl budget sans que vous le sachiez ?
22:16 Les canonicals sont-elles vraiment évaluées comme les balises noindex par Google ?
23:49 Le JavaScript bloque-t-il vraiment l'indexation de vos pages par Google ?
31:39 Faut-il regrouper vos petits sites en un seul domaine pour améliorer votre SEO ?
34:39 Le Dynamic Rendering est-il encore une solution viable pour gérer le JavaScript en SEO ?
42:00 Faut-il vraiment optimiser toutes vos images pour Google Images ?
52:11 Faut-il vraiment corriger toutes les erreurs 404 dans Search Console ?

What you need to understand

Why doesn't Google index everything it crawls?

The crawl budget is just the first half of the problem. Google can very well crawl a page, process it, analyze its content, and then decide that it does not deserve a spot in the index. This decision is not arbitrary: it relies on perceived quality signals, duplication, and added value for the user.

E-commerce sites with dynamic filters perfectly illustrate this. Each combination of price, color, and size generates a unique URL. Google can technically crawl thousands of these variants, but indexing them all would dilute the index with nearly identical content. Thus, the engine sorts them out, and this sorting is ongoing.

What exactly is an infinite combination of parameters?

A URL with parameters becomes “infinite” when the possible values multiply without logical limits. For example: endless pagination, multiple sortable combinations (price+date+popularity), session IDs, advertising trackers, or worse, parameters that reinject into one another.

Google detects these circular patterns and cuts them off. But the problem is that in the meantime, the engine has already consumed crawl budget on pages without value. The result? Your strategic pages might be crawled less often, or worse, not at all if the site is new or lacks authority.

When is it truly “normal” not to index everything?

Let’s be honest: not all sites need every URL to be indexed. Empty result pages, monthly archives on a blog dormant for three years, exotic filters never used by anyone—these are all deadweight that benefits neither Google nor the user.

The problem arises when Google arbitrarily decides that a strategic page does not belong in the index. That's when Mueller's “it's normal” no longer holds. If your main categories or flagship product pages are excluded, it is no longer optimization; it's a warning sign. The nuance matters: accepting the non-indexing of accessory pages is rational; suffering the non-indexing of key pages is a structural problem.

Crawling ≠ indexing: a crawled page can be rejected by the index if it lacks value or differentiation.
Dynamic parameters are the primary cause of unnecessary URL bloat—Google detects and cuts them off.
Restricting crawling via robots.txt, canonicals, and noindex is often more effective than leaving Google to sort through it alone.
A site with thousands of indexable URLs but few backlinks or authority will see Google severely ration its crawl.
Non-indexing is only “normal” if it concerns accessory pages, not your strategic content.

SEO Expert opinion

Does this statement align with real-world conditions?

Yes, but with a huge gray area. On e-commerce sites with several hundred thousand URLs, it is regularly observed that Google indexes less than 30% of the crawled pages. Server logs confirm this: massive crawl, selective indexing. Nothing surprising here.

The problem is that Mueller does not specify the exact criteria that tip a page to the “indexable” or “rejected” side. Is it unique content? Click depth? Actual traffic to the URL? The number of internal links pointing to it? All of these? [To be verified] because Google remains intentionally vague about thresholds.

When does this logic become counterproductive?

When Google applies this sorting logic to new or niche sites where each page has a specific search intent. I have seen themed blogs with 200 quality articles, well interlinked, of which 40% are never indexed. No infinite parameters, no duplication, just a perceived lack of overall domain authority.

Another problematic case: sites that fully optimize their SEO filters (clean URLs, unique content per combination, solid internal linking) and still get blacklisted by the “infinite pagination” detection algorithm. Google does not always differentiate a legitimate filter from parameter spam. The risk is real, and Mueller does not mention it.

Is it really necessary to proactively restrict crawling?

Yes, and it is non-negotiable for large sites. Allowing Google to freely crawl thousands of filter or sorting URLs wastes crawl budget that could have gone to your new product listings or in-depth articles.

But be careful: too aggressive a restriction can also hide strategic pages. I have seen sites block all pagination via robots.txt “for safety,” then wonder why their deep categories never rank. The right approach is a combination of noindex on unnecessary variations, canonicals on duplications, and declared URL parameters in Search Console. There is no one-size-fits-all solution here.

Warning: If Google indexes less than 50% of your crawled pages and this rate has stagnated for months, it is probably not “normal.” Check the content quality, the structure of your internal linking, and duplication signals before concluding that Google is doing its job well.

Practical impact and recommendations

How can you tell if Google is rejecting your strategic pages?

Go to Google Search Console, Coverage tab. Check the section “Discovered, currently not indexed.” If you find main categories, best-selling product pages, or pillar articles there, it’s a red flag. Google has seen them, but refuses to index them.

Then cross-check with your server logs. If Googlebot is crawling these pages heavily but they remain excluded from the index, the issue is not the crawl budget; it’s perceived quality or detection of duplication. At this stage, inspecting the URL via Search Console and understanding the exact reasons becomes a priority.

What concrete actions can you take to regain control?

First step: clean up unnecessary parameters. If your site generates URLs with session_id, utm_source, or redundant sort options, block them via robots.txt or declare them as “parameters to ignore” in Search Console. No mercy for trackers or filters never used.

Second step: canonicalize intelligently. Each filter variation should point to a reference URL if the content is essentially the same. But if the filter generates truly different content (e.g., “red women's t-shirts” vs. “black men's t-shirts”), let it be indexable with enhanced unique content. Google will accept the differentiation if it is real.

Should you block crawling or just indexing?

Both have their uses, but not in the same contexts. Blocking via robots.txt prevents any crawling, so no link equity (PageRank) flows through those URLs. Useful for completely useless pages (admin, internal search, etc.).

The noindex, on the other hand, allows Google to crawl and follow links, but denies indexing. Perfect for pagination pages or intermediate filters that serve internal linking but do not have standalone value. The choice depends on your architecture: if the page serves as a link hub, keep it crawlable but noindex.

Audit Search Console: export the list “Discovered, not indexed” and sort by strategic importance.
Declare unnecessary URL parameters in Search Console or block them via robots.txt if they are purely technical.
Implement consistent canonicals on filter variations that generate nearly identical content.
Use noindex on intermediate pages (pagination, sorting) that aid internal linking but have no inherent SEO value.
Check your server logs to identify heavily crawled but never indexed URLs—often a sign of algorithmic detection.
Enhance the unique content on strategic filter pages to clearly differentiate them in Google's eyes.

Restricting crawling is not punishment; it’s optimization. Google gives you a limited budget, and it's up to you to spend it on what matters. If your site generates thousands of URLs through dynamic parameters, don’t let Google decide alone what deserves indexing. Take charge with robots.txt, canonicals, and clear parameter declarations. These optimizations can quickly become complex to orchestrate alone, especially on e-commerce or editorial architectures with thousands of pages. Hiring a specialized SEO agency for a structural audit and a tailored action plan can help avoid months of trial and error and ensure that your strategic pages remain visible in the index.

❓ Frequently Asked Questions

Google explore-t-il toutes les pages qu'il trouve sur un site ?

Non. Google suit les liens et découvre des URL, mais décide ensuite lesquelles méritent d'être crawlées selon le crawl budget alloué au site. L'exploration est déjà sélective, avant même la phase d'indexation.

Pourquoi certaines pages explorées ne sont jamais indexées ?

Google peut juger qu'une page manque de valeur unique, est trop similaire à d'autres, ou fait partie d'un pattern de paramètres infinis. L'exploration ne garantit pas l'indexation, c'est une étape de tri supplémentaire.

Comment bloquer efficacement les paramètres inutiles sans perdre du crawl budget ?

Déclare les paramètres dans Search Console (onglet Paramètres d'URL, si encore actif) ou bloque-les via robots.txt. Les canonicals sont aussi une option si les pages doivent rester crawlables pour le maillage interne.

Faut-il utiliser noindex ou robots.txt pour les pages de filtres ?

Noindex si la page sert de hub de liens internes mais n'a pas de valeur SEO propre. Robots.txt si elle est totalement inutile et que tu veux économiser du crawl budget sans transmettre d'équité de lien.

Combien de pages non indexées est considéré comme normal ?

Aucun seuil universel. Sur un site e-commerce avec filtres, 50-70 % de non-indexation peut être acceptable. Sur un blog éditorial de 200 articles, même 20 % de rejet doit alerter. Tout dépend de la nature des URL concernées.

🎥 From the same video 8

Other SEO insights extracted from this same Google Search Central video · duration 57 min · published on 18/10/2018

🎥 Watch the full video on YouTube →