Why does Google crawl pages it never adds to its index?

Quick SEO Quiz

Test your SEO knowledge in 3 questions

Less than 30 seconds. Find out how much you really know about Google search.

🕒 ~30s 🎯 3 questions 📚 SEO Google

Official statement

The statuses 'URL crawled but not indexed' and 'URL discovered but not indexed' should be treated essentially the same way. Just because Google crawled a page doesn't mean it will automatically be indexed. Google doesn't index all content.

🎥 Source video

Extracted from a Google Search Central video

💬 EN 📅 04/02/2022 ✂ 18 statements

Watch on YouTube →

✂ Other statements from this video 17 ▾

📅

Official statement from February 4, 2022 (4 years ago)

⚠ A more recent statement exists on this topic Does GoogleBot really crawl URLs your site never created? Google · March 27, 2025 View statement →

TL;DR

Google makes no meaningful distinction between 'URL crawled but not indexed' and 'URL discovered but not indexed'. Crawling does not guarantee indexation — Google actively chooses what it indexes, and this is non-negotiable. If your pages fall into these categories, it's because Google has determined they offer no real value.

What you need to understand

What does this 'crawled but not indexed' status actually mean?

Google explores your page, reads its content, understands what it's about… and decides not to add it to its index. Crawling is just a technical step, not a quality validation. Googlebot visits, takes notes, and leaves without making any promises.

What Mueller is saying here is that there is no hierarchy between 'crawled' and 'discovered'. In both cases, Google knows your page exists and has chosen to ignore it for indexation purposes. The distinction is purely operational — one has been visited, the other was simply spotted through links or a sitemap.

Why does Google refuse to index certain crawled pages?

Because Google doesn't index everything. That's the key sentence in this statement. The index isn't a passive repository where everything that gets crawled ends up stored. It's an active selection based on perceived quality, relevance, duplication, and resource allocation to your site.

A page can be technically accessible, with no server errors, no noindex tag… and still be excluded. Google filters according to its own criteria — some transparent (weak content, duplication), others opaque (crawl budget, algorithmic priority).

What's the difference between 'crawled' and 'discovered' from an indexation standpoint?

Technically? Almost none. A discovered URL has been spotted (external link, sitemap, internal reference) but hasn't been visited by Googlebot yet. A crawled URL has actually been explored — the content was downloaded and analyzed.

But Mueller insists: both should be treated the same way. If a page stays 'discovered' for months, it's because Google doesn't see enough value in it to allocate crawl budget. If it shifts to 'crawled' without ever being indexed, it means the content analysis didn't change that decision.

Crawling guarantees nothing — it's a technical step, not a validation.
Google indexes selectively based on quality criteria and resource allocation.
'Crawled' and 'discovered' non-indexed signal the same problem: your page doesn't bring enough value in Google's eyes.
No magic action will force indexation — you need to understand why Google is rejecting these pages.

SEO Expert opinion

Is this statement consistent with what we observe in the field?

Absolutely. SEO professionals monitoring Search Console regularly see pages crawled for months without ever being indexed. This isn't a bug, it's a Google choice. The myth 'if Googlebot visits your page, it will eventually be indexed' has been dead for a long time — but Mueller kills it officially once more.

What strikes home is that this statement confirms what many suspected: Google doesn't publish all its selection criteria. We know duplication, thin content, and crawl budget play a role… but there's still a gray zone around exact thresholds. [To verify]: How much mediocre content will Google tolerate before blacklisting a page from the index?

What nuances should we add to this claim?

Mueller says 'treat similarly', but the order of operations still matters. A 'discovered' page might simply lack crawl budget — increasing internal linking or update frequency could be enough to move it to 'crawled' status. A page already crawled but not indexed has already failed quality evaluation — there, you need to rework the content itself.

Another point: Google never specifies how long it leaves a page in 'crawled non-indexed' status before considering it permanently out of play. We observe fluctuations — a page might stay in limbo for 6 months, then suddenly be indexed after a content update or gain in backlinks.

In what cases can these statuses be safely ignored?

If your 'crawled non-indexed' pages are filter pages, old pagination, empty templates, or unnecessary URL variations… that's normal and actually desirable. Google is cleaning up on your behalf. The problem emerges when strategic pages — product sheets, important blog articles — fall into this category.

Let's be honest: everyone has non-indexed pages. The goal isn't reaching 100% indexation, but ensuring the pages that matter pass the threshold. If 80% of your non-indexed pages are noise (filters, old versions, technical pages), you can sleep soundly.

Warning: If pages with strong SEO potential remain 'crawled non-indexed' despite unique content and backlinks, it might signal that Google isn't valuing your domain globally. An algorithmic trust problem can impact indexation far beyond each individual page's content.

Practical impact and recommendations

What should you concretely do with these non-indexed pages?

First, audit. Export the list of 'crawled' and 'discovered' non-indexed URLs from Search Console. Sort them by category: strategic pages, secondary pages, technical pages. If 90% is noise, delete them or block them in robots.txt — there's no point wasting crawl budget.

For pages that should be indexed: analyze content quality. Is it thin content? Is there duplication with other pages? Does the text provide unique value? Google doesn't index mediocre pages out of charity — if content is weak, rework it or merge it with a stronger page.

What mistakes should you avoid with these Search Console statuses?

Don't request indexation manually in loops via the 'Request Indexing' tool. If Google crawled your page and chose not to index it, resubmitting 10 times won't change anything — you're just wasting time. The tool doesn't force indexation, it accelerates crawling.

Another classic trap: believing that adding external backlinks will mechanically solve the problem. Yes, backlinks help… if the content deserves it. If a page stays crawled non-indexed despite incoming links, it's because Google thinks it adds nothing, even with popularity signals.

How should you prioritize actions on these pages?

Start with pages already generating organic traffic or conversions through other channels. If a page converts well via Google Ads or social media but isn't indexed in SEO, that's an obvious lever. Rework the content, add differentiating elements, strengthen internal linking.

Next, tackle pages with potential for high-intent keywords. A non-indexed product page in a low-competition niche has more value than a generic blog article already saturated. Prioritize by business impact, not by volume of pages to handle.

Export 'crawled' and 'discovered' non-indexed URLs from Search Console
Categorize these pages: strategic, secondary, technical/useless
Delete or block pages with no value to free up crawl budget
Audit content quality for strategically important non-indexed pages
Eliminate internal duplication and enrich unique content
Strengthen internal linking to these pages from indexed pages with high authority
Avoid spamming the 'Request Indexing' tool — prioritize content improvement
Monitor status evolution over 3-6 months after optimizations

Indexation is not a right, it's a quality validation that Google grants according to its criteria. If your strategic pages remain blocked despite your efforts, it may be wise to consult a specialized SEO agency for a thorough audit — some indexation blockers reveal structural issues (architecture, technical duplicate content, silent penalties) that an experienced external eye detects faster. Personalized support also helps prioritize actions based on your industry and prevents costly mistakes in time and crawl budget.

❓ Frequently Asked Questions

Combien de temps Google laisse-t-il une page en 'crawlée non indexée' avant de l'abandonner définitivement ?

Google ne communique pas de délai précis. On observe des pages qui restent des mois voire des années en statut 'crawlée non indexée', puis s'indexent après une mise à jour de contenu ou un gain de backlinks. Rien n'est définitif, mais plus le temps passe sans changement, moins les chances d'indexation sont élevées.

Si j'améliore le contenu d'une page 'crawlée non indexée', Google la recrawlera-t-il automatiquement ?

Pas forcément. Google recrawle selon sa propre planification et ses priorités de crawl budget. Pour accélérer, modifiez la date de dernière mise à jour dans le sitemap XML, renforcez le maillage interne vers cette page, ou utilisez l'outil 'Demander une indexation' une fois (pas en boucle).

Une page peut-elle passer de 'découverte' à 'crawlée' sans jamais être indexée ?

Oui, c'est très courant. Google peut crawler une page et décider qu'elle ne mérite pas l'indexation après analyse du contenu. Le passage de 'découverte' à 'crawlée' ne garantit rien — c'est juste une étape technique supplémentaire franchie.

Faut-il supprimer toutes les pages 'crawlées non indexées' de mon site ?

Non. Beaucoup de ces pages sont normales et sans impact (pages de tri, filtres, anciennes URLs). Supprimez uniquement celles qui gaspillent du crawl budget ou qui créent de la confusion. Concentrez-vous sur l'amélioration des pages stratégiques qui devraient être indexées.

Le statut 'découverte non indexée' signifie-t-il que mes pages ont un problème technique ?

Pas nécessairement. Cela peut simplement indiquer que Google ne les a pas encore jugées prioritaires pour le crawl, ou qu'il estime leur contenu peu pertinent avant même de les visiter. Vérifiez quand même qu'elles sont accessibles (pas de noindex, pas de blocage robots.txt) et liées correctement dans le maillage interne.

🏷 Related Topics

indexation crawl budget Search Console contenu dupliqué maillage interne Googlebot thin content

Domain Age & History Content Crawl & Indexing Discover & News AI & SEO Domain Name

🎥 From the same video 17

Other SEO insights extracted from this same Google Search Central video · published on 04/02/2022

🎥 Watch the full video on YouTube →

Related statements

« Previous

Server response time: determines crawl rate, not r...

Date in title: useful for news articles, no magic ...

« Back to results