Why does Google discover your pages but refuse to index them?

Official statement

The status 'discovered but not indexed' means that Google is aware of the existence of the URLs but has not yet crawled them, or that after crawling, the content was deemed insufficiently relevant for indexing. This is generally not a JavaScript problem if Google has never retrieved the HTML.

30:47

🎥 Source video

Extracted from a Google Search Central video

⏱ 48:50 💬 EN 📅 27/01/2021 ✂ 15 statements

Watch on YouTube (30:47) →

✂ Other statements from this video 14 ▾

1:01 Googlebot crawle-t-il et rend-il le JavaScript à la même fréquence ?
4:17 Googlebot exécute-t-il vraiment le JavaScript comme un navigateur réel ?
4:50 Googlebot ignore-t-il vraiment tout le contenu chargé après interaction utilisateur ?
6:53 Le HTML rendu est-il vraiment la seule référence pour l'indexation Google ?
7:23 Faut-il encore se fier au cache Google pour vérifier l'indexation JavaScript ?
7:54 Le JavaScript impacte-t-il réellement votre budget de crawl ?
9:00 Google indexe-t-il vraiment l'intégralité de vos pages ou juste des fragments stratégiques ?
12:08 Les classes CSS nommées 'SEO' pénalisent-elles le référencement ?
16:36 Le cache de Google peut-il fausser le rendu de vos pages JavaScript ?
20:27 Supprimer des liens en JavaScript peut-il rendre vos pages invisibles pour Google ?
23:54 Pourquoi les tests en direct dans Search Console donnent-ils des résultats contradictoires ?
26:00 Comment gérer les paramètres d'URL pour éviter les problèmes d'indexation ?
35:39 Le sitemap XML peut-il vraiment déclencher un recrawl ciblé de vos pages ?
44:44 Pourquoi Googlebot ne voit-il pas les liens révélés après un clic utilisateur ?

What you need to understand

What does 'discovered but not indexed' really mean?

This Search Console status covers two distinct scenarios that many practitioners often confuse. First case: Google has spotted the URL — via a sitemap, an internal link, or a backlink — but has never sent Googlebot to fetch the HTML. Second case: Googlebot has crawled the page, but after analysis, the engine decided not to include it in the index.

The distinction is crucial. In the first scenario, the problem lies upstream of rendering: insufficient crawl budget, blocking robots.txt, URL deemed non-priority. In the second, it's a matter of perceived quality or duplication — Google has seen the content and has excluded it.

Martin Splitt emphasizes a technical point: if Google has never fetched the HTML, JavaScript is not the issue. JS rendering comes into play after the initial fetch — so if Googlebot doesn't even download the page, looking for a bug on the React hydration side is a waste of time.

When does this status become problematic?

On a site with 10,000 URLs, having 500 discovered but not indexed is not unusual. Google doesn’t index everything by principle — it prioritizes based on its perception of value. Deep pagination pages, redundant product filters, and thin auto-generated content are all natural candidates for this status.

The issue arises when strategic pages — featured product sheets, pillar content, landing pages — remain blocked for weeks. Here, it’s necessary to dig deeper: why does Google deem these URLs non-priority or non-relevant? The signal can come from a lack of internal links, excessive click depth, or cannibalization with pages already indexed.

How to identify the root cause?

Search Console doesn’t clearly distinguish between the two sub-cases — it’s up to you to cross-reference the data. Go to the Crawl tab, filter by HTTP status 200, and check the last crawl date. If it’s empty or very old, Google has never or rarely crawled. If it’s recent but indexing remains blocked, it's a quality rejection.

Also, use server logs — they don’t lie. Look for the affected URLs: if Googlebot has never requested them, the issue is architectural or budget-related. If the bot visits regularly but doesn’t index, the problem is editorial or technical rendering issues.

Discovery without crawl = crawl budget, depth, robots.txt, absence of strong internal links
Crawl without indexing = duplicate content, thin content, mispositioned canonical, insufficient perceived quality
JavaScript is not the issue if Google has never fetched the base HTML
This status is not a bug — it’s an explicit decision from Google
Scale matters: 5% of URLs in this status on a large e-commerce site is normal, 50% on a 100-page blog is a red flag

SEO Expert opinion

Is this statement consistent with real-world observations?

Yes, and it's actually one of the few points where Google is transparent. There truly are two profiles of URLs in this status: those that have never been visited (often deep pages, filters, variants) and those that were crawled and then discarded. Splitt's precision on JavaScript deserves to be highlighted — it cuts off a common excuse from front-end developers.

But let's be honest: Google doesn’t specify how long you should wait before worrying. A discovered page can remain in this status for months without a problem — or reveal a structural issue within the first week. The lack of numerical thresholds makes diagnosis opaque. [To be verified] on large volumes: does Google apply implicit quotas by sector or site type?

What nuances should be added regarding this status?

First nuance: just because a page is discovered but not indexed doesn’t mean it's eternally banned from being indexed. Enhancing internal linking, boosting external popularity, or updating content can unlock the situation. Google constantly reevaluates its priorities — but it doesn’t do so in real-time.

Second nuance: not all URLs deserve to be indexed. An e-commerce site with 50,000 references and 200,000 color-size variants must decide what it wants to index. Aggressively canonicalizing, blocking in robots.txt, or noindexing certain combinations is often healthier than pleading with Google to index redundant content.

Third nuance: Splitt dismisses JavaScript for pages that were never crawled, but he doesn’t mention pages that were crawled and then rejected. A faulty JS rendering — timeout, blocking console errors, content loaded after 5 seconds — can very well produce empty or poor HTML on the Googlebot side, thus leading to a quality rejection. The JS issue remains relevant in this second scenario.

In what cases does this rule not apply?

If you use aggressive lazy loading or conditional rendering based on user-agent, you can technically serve empty HTML to Googlebot even after fetching. In this specific case, Splitt is correct in principle — Google did retrieve some HTML — but the problem remains on the front end. It’s an edge case, but it exists.

Another exception: pages blocked by robots.txt or X-Robots-Tag typically do not appear in "discovered but not indexed" — they switch to "Excluded by robots.txt" or "Excluded by noindex tag." If you see them anyway, it’s often a lag in Search Console updates, or a discovery via sitemap while the crawl is blocked.

Warning: Google can index a URL without ever crawling it if it receives enough authoritative backlinks and the anchor + context are sufficient. These pages appear indexed with generic snippets — rare, but documented. The 'discovered' status does not cover this edge case.

Practical impact and recommendations

What should you do in practice regarding this status?

The first step: prioritize strategic URLs. Export the Search Console list, cross-reference it with your high-potential page list (bestselling product sheets, pillar content, landing SEA). If these pages are absent from the index, it’s an urgent task. If they are minor filters or variants, it’s normal — even desirable.

The second step: determine whether Google has crawled or not. Check the server logs or the last crawl date in Search Console. For pages that have never been crawled, enhance the internal linking, reduce click depth, add them to the sitemap, and boost their internal PageRank through links from strong pages. For crawled pages that were rejected, audit the content: duplication, thin content, incorrect canonical tags, weak editorial quality.

The third step: if the volume of discovered URLs suddenly explodes, you likely have a problem with facets or pagination being poorly managed. Google discovers thousands of unnecessary combinations — clean up via robots.txt, URL parameters in Search Console (an obsolete feature but still conceptually relevant), or strict canonicals.

What mistakes should be avoided?

Do not mass submit via "Request indexing" — it’s ineffective, and you burn your quota for nothing. Google has already made its choice; forcing the issue does not change anything if quality or priority signals are absent. Use this function only for strategic pages after corrections.

Do not ignore quality signals. If Google crawls but does not index, it’s rarely a whim — it detects duplicate, thin, or non-value-added content. Enriching, differentiating, or deleting is often more cost-effective than stubbornly trying to get indexed.

Do not confuse discovery with priority. Your sitemap may contain 10,000 URLs, but if your site has a Domain Rating of 20 and zero backlinks, Google will never index everything. It’s a question of crawl budget and trust — work on external popularity before demanding the indexing of every page.

How to check if your site is compliant?

Analyze the ratio of indexed URLs to submitted URLs. A rate below 60% on a well-structured site should raise alarms — either you have too many unnecessary URLs, or you have a trust or quality issue. Segment by page type: product sheets, categories, blog, institutional pages. If an entire category is systematically rejected, that’s a pattern to investigate.

Check the consistency between sitemap, internal linking, and canonicals. A URL canonicalized to another should not appear in the sitemap — it’s a contradictory signal that wastes crawl budget. Use Screaming Frog or Oncrawl to cross-reference this data and identify inconsistencies.

Export the "Discovered but not indexed" list from Search Console
Cross-reference with server logs to distinguish crawl/non-crawl
Audit the content of crawled but not indexed pages (duplication, thin content)
Enhance the internal linking of strategic pages never crawled
Clean up the sitemap: remove canonicalized, noindex, and redirected URLs
Monitor the monthly evolution of the indexed/discovered ratio by page type

The status 'discovered but not indexed' is a judgment from Google, not a bug. It reveals either a crawl budget and architecture issue or a quality rejection after analysis. Prioritize strategic pages, distinguish between crawl and non-crawl, correct quality signals, and accept that not all URLs deserve indexing. These diagnostics often require complex cross-referencing of data — server logs, Search Console, internal crawl — and expertise to interpret signals. If you lack resources or tools for thorough auditing, consulting a specialized SEO agency can accelerate diagnosis and avoid months of wandering.

❓ Frequently Asked Questions

Combien de temps faut-il attendre avant de s'inquiéter d'une page en statut découverte mais non indexée ?

Google ne donne aucun seuil officiel. Sur un site à faible autorité, plusieurs semaines voire mois peuvent s'écouler. Si une page stratégique reste bloquée au-delà de 30 jours malgré un bon maillage interne, creusez : c'est probablement un problème de qualité ou de crawl budget.

Le fait de soumettre l'URL via « Demander une indexation » résout-il le problème ?

Rarement. Si Google a déjà crawlé et rejeté la page, forcer l'indexation ne change rien — il faut corriger les signaux de qualité. N'utilisez cette fonction qu'après correction, et sur des pages stratégiques.

JavaScript peut-il quand même être en cause si Google a crawlé la page ?

Oui. Splitt écarte JS pour les pages jamais crawlées, mais un rendu JS défaillant — timeout, erreurs bloquantes, contenu vide côté bot — peut produire un HTML pauvre après fetch, donc un rejet qualité. La piste JS reste valable dans ce second scénario.

Toutes les URLs d'un site doivent-elles être indexées ?

Non. Un site e-commerce avec des milliers de variantes ou de filtres n'a pas besoin de tout indexer. Canonicaliser, bloquer en robots.txt, ou noindexer certaines combinaisons améliore souvent la santé globale de l'indexation.

Comment savoir si Google a crawlé une URL marquée découverte mais non indexée ?

Vérifiez la date de dernière exploration dans l'onglet Exploration de Search Console, ou analysez vos logs serveur. Si Googlebot n'a jamais requêté l'URL, le problème est en amont du rendu ; sinon, c'est un rejet qualité.

🎥 From the same video 14

Other SEO insights extracted from this same Google Search Central video · duration 48 min · published on 27/01/2021

🎥 Watch the full video on YouTube →