Why does Googlebot ignore some of the URLs it discovers?

Quick SEO Quiz

Test your SEO knowledge in 3 questions

Less than 30 seconds. Find out how much you really know about Google search.

🕒 ~30s 🎯 3 questions 📚 SEO Google

Official statement

Googlebot does not crawl every URL it discovers. Some pages may be on sites that do not meet the required quality threshold to be indexed, while other URLs may be forbidden from crawling or inaccessible without login credentials.

🎥 Source video

Extracted from a Google Search Central video

💬 EN 📅 22/02/2024 ✂ 10 statements

Watch on YouTube →

✂ Other statements from this video 9 ▾

📅

Official statement from February 22, 2024 (2 years ago)

⚠ A more recent statement exists on this topic How does Google actually discover all the URLs on your website? Google · June 26, 2025 View statement →

TL;DR

Googlebot does not crawl all the URLs it detects. Some pages are excluded because they belong to sites that do not meet the minimum quality threshold for indexing, while others are blocked technically or require authentication. This confirms that crawling is conditioned by an upstream quality filter, even before indexation.

What you need to understand

This statement confirms what many of us observe in the field: not all discovered URLs deserve to be crawled. Google performs a filtering process upstream, before even allocating crawl budget to a page.

Gary Illyes's message reveals the existence of an upstream quality filter that applies at the site or domain level. If your site does not pass this threshold, Googlebot may decide not to crawl certain URLs it has discovered through internal links, sitemaps, or backlinks.

What criteria determine this quality threshold?

Google does not explicitly specify the criteria, but field experience suggests several indicators: domain authority, overall content quality, user signals (bounce rate, time on page), update freshness, and likely metrics related to site expertise and trustworthiness.

This threshold is not binary — there is a gradation. A low-quality site will see a large proportion of its URLs ignored, while a high-authority site will benefit from quasi-systematic crawling of newly discovered pages.

Which URLs are blocked for technical reasons?

Beyond the quality filter, certain URLs are technically inaccessible to Googlebot: pages protected by robots.txt, URLs behind a login wall, content requiring complex JavaScript that is not interpretable, or pages with response times that are too long.

These technical blocks are often intentional (member areas, backend), but sometimes accidental — a misconfigured robots.txt or a redirect loop can prevent crawling of strategic URLs.

Crawling is conditioned by an upstream quality filter, independent of the technical indexability of a page.
Googlebot performs selective sorting based on perceived authority and quality of the site.
Technical obstacles (robots.txt, authentication) also prevent crawling, but these are two distinct issues.
A discovered URL is not a crawled URL, and even less an indexed URL.

SEO Expert opinion

Is this quality threshold logic consistent with field observations?

Absolutely. On sites with low authority or content issues, we regularly observe discovered but non-crawled URLs in Search Console. Some pages remain for months in "Discovered – currently not indexed" status without ever being visited by Googlebot.

The problem is that Google never clearly communicates where this threshold sits. [To verify] — no official metric allows you to know whether your site is above or below it. We are reduced to interpreting indirect signals: crawl frequency, proportion of indexed URLs, average discovery time.

What nuances should be added to this statement?

Gary Illyes speaks of "quality threshold required for indexing", but let's be precise: this is a threshold for being crawled, not indexed. This is an even earlier stage. A page can be crawled and then rejected from indexing for other reasons (duplicate content, thin content, noindex).

Another important nuance — this filter likely applies with varying degrees of severity depending on sections of a site. A product page on an established e-commerce site will have a better chance of being crawled than a blog post on a new site, even if both belong to different domains.

In which cases does this rule not apply strictly?

Sites with very high authority (national media, institutions, major platforms) appear to benefit from privileged treatment: their new URLs are generally crawled very quickly, sometimes within minutes. The quality threshold matters less — or rather, these sites exceed it by default.

Similarly, a URL widely shared from reliable sources can force crawling even if the hosting site is of average quality. The external signal (quality backlinks) partially compensates for the inherent weakness of the domain.

Warning: If you notice that a large proportion of your URLs remain in "Discovered – not indexed" status for several weeks, it's likely a signal that Google considers your site below the quality threshold. Before optimizing technically, question yourself about the real quality of your content and the perceived authority of your domain.

Practical impact and recommendations

What should you concretely do to improve your URLs' crawl rate?

First priority: strengthen the perceived quality of your site overall. This involves original and in-depth content, regular updates, expertise signals (authorship, author bio, external citations), and improved Core Web Vitals.

Next, rationalize your architecture. If Google believes your site does not deserve exhaustive crawling, you need to prioritize strategic URLs: remove low-value pages, consolidate internal linking toward priority content, and avoid diluting crawl budget across thousands of undifferentiated pages.

What mistakes should you avoid at all costs?

Do not waste time massively submitting URLs via the Indexing API if your site is below the quality threshold — it will not change anything. Google has already decided upstream that these pages are not worth immediate crawling.

Another common mistake: believing that an XML sitemap is enough to guarantee crawling. A sitemap signals URLs, but does not force Googlebot to visit them if the site does not meet the quality threshold. It is a discovery tool, not a guaranteed indexing lever.

How can you verify that your site is above the threshold?

Two indicators in Search Console are revealing. First signal: the average time between URL discovery (via sitemap or internal link) and its first crawl. If it exceeds several weeks for new content, that's a bad sign.

Second signal: the ratio between detected URLs and indexed URLs. A massive gap (e.g., 10,000 detected, 2,000 indexed) indicates either a content quality issue or a site below threshold. Cross-reference with crawl statistics to confirm.

Audit the real quality of your content — be honest about the added value you provide.
Remove or consolidate low-value pages to concentrate crawl budget.
Strengthen internal linking toward strategic URLs.
Monitor the discovery/crawl delay in Search Console to detect potential filtering.
Do not multiply manual submissions if the problem is structural — it will not speed anything up.
Build domain authority through quality backlinks and trust signals (mentions, citations).

Optimizing to pass this quality threshold requires a holistic approach: content, technical aspects, authority, and user signals. These levers are interdependent and require a strategic vision over several months.

For sites facing persistent blocks or large volumes of non-crawled URLs, it may be worthwhile to engage a specialized SEO agency capable of thoroughly auditing the quality signals perceived by Google and piloting structured domain authority growth.

❓ Frequently Asked Questions

Combien de temps faut-il attendre avant qu'une URL découverte soit crawlée ?

Ça dépend entièrement de l'autorité de votre site. Un site à forte autorité verra ses nouvelles URLs crawlées en quelques heures, tandis qu'un site de faible qualité peut attendre plusieurs semaines, voire ne jamais voir certaines URLs crawlées.

Soumettre une URL via l'outil d'inspection force-t-il le crawl ?

L'outil d'inspection peut accélérer le crawl ponctuel d'une URL, mais si votre site est sous le seuil qualité global, ça ne garantit ni un crawl rapide ni une indexation. C'est un levier tactique, pas stratégique.

Un sitemap XML garantit-il que toutes les URLs listées seront crawlées ?

Non. Le sitemap facilite la découverte, mais Googlebot décide ensuite de crawler ou non en fonction du seuil qualité du site et de la priorité perçue de chaque URL.

Comment savoir si mon site est en dessous du seuil qualité ?

Regardez dans la Search Console le ratio entre URLs détectées et URLs crawlées, ainsi que le délai moyen de crawl après découverte. Un écart important et des délais de plusieurs semaines sont des signaux d'alerte.

Les backlinks vers une URL non crawlée peuvent-ils forcer Googlebot à la visiter ?

Oui, dans une certaine mesure. Des backlinks de qualité peuvent compenser un déficit d'autorité du domaine et déclencher un crawl même si le site global est sous le seuil. Mais ça reste un signal parmi d'autres.

🏷 Related Topics

crawl budget indexation Googlebot qualité contenu autorité domaine Search Console découverte URLs

Domain Age & History Crawl & Indexing AI & SEO Domain Name

🎥 From the same video 9

Other SEO insights extracted from this same Google Search Central video · published on 22/02/2024

🎥 Watch the full video on YouTube →

Related statements

« Previous

Googlebot uses algorithms to determine what to cra...

Crawling: Page Discovery and Download Process...

« Back to results