Should you really abandon the site: command to count your indexed pages?

Quick SEO Quiz

Test your SEO knowledge in 5 questions

Less than a minute. Find out how much you really know about Google search.

🕒 ~1 min 🎯 5 questions

Official statement

Using 'site:domain.com' is not a reliable method for determining the number of indexed pages. It is better to use a sitemap to verify the URLs that are actually indexed.

48:11

🎥 Source video

Extracted from a Google Search Central video

⏱ 1h01 💬 EN 📅 02/08/2017 ✂ 13 statements

Watch on YouTube (48:11) →

✂ Other statements from this video 12 ▾

📅

Official statement from August 2, 2017 (8 years ago)

⚠ A more recent statement exists on this topic Is the site: command really reliable for verifying your pages' indexation? Google · February 24, 2022 View statement →

TL;DR

Google states that the 'site:domain.com' command only provides a vague estimate of the actual number of indexed pages. For a reliable count, you should cross-reference Search Console data with your sitemap. This distinction changes how you diagnose indexing issues and manage your crawl budget.

What you need to understand

Why does Google consider the site: command unreliable?

The site:domain.com command queries a lightweight index, not the complete database used for ranking. Google maintains several layers of indexing: a main index for ranking, secondary indexes for diagnostic queries, and temporary caches. When you type 'site:', you are accessing an approximation that may include URLs that haven’t been recently crawled or exclude well-indexed pages.

The displayed number fluctuates from day to day without any changes to your site. You may observe discrepancies of 20 to 40% between two queries spaced a few hours apart. These variations do not reflect actual indexing or de-indexing but rather technical artifacts related to the query system used for this command.

What method does Google recommend instead?

Google promotes the use of the index report in Search Console, which is based on actual crawl logs. This report distinguishes between discovered, crawled, indexed URLs, and those excluded with specific reasons (alternative canonical, robots.txt, noindex). You have access to historical data and graphs showing trends over time.

The sitemap serves as a repository: it lists what you wish to index. By cross-referencing this file with Search Console data, you can identify submitted but not indexed URLs. This discrepancy is what matters for diagnostics, not an absolute figure derived from an approximate command.

What are the actual limitations of this statement?

Google does not specify how inaccurate the site: command is. We are talking about a vague order of magnitude, but for a site with 50,000 pages, a 30% margin of error represents 15,000 phantom or missing URLs. This statement remains vague regarding the threshold beyond which the discrepancy becomes critical.

Another point: Search Console itself is not infallible. It samples some data and may ignore URLs crawled by Googlebot but not reported in the interface. The idea that the sitemap is the source of truth assumes it is perfectly up-to-date, which is not always the case on dynamic sites where the catalog changes every hour.

The site: command queries a secondary index, not the one used for ranking
The observed discrepancies can reach 20 to 40% from day to day without real change
The Search Console report is based on crawl logs and offers detailed historical data
The sitemap serves as a repository for identifying submitted but unindexed URLs
The Search Console itself samples some data and may miss crawled URLs

SEO Expert opinion

Is this statement consistent with real-world observations?

In practice, SEOs have observed for years that site: gives erratic numbers. An e-commerce site with 80,000 product listings may show 65,000 results one day and 92,000 the next, without any technical modification. This instability has been documented in forums since at least 2015, well before Google officially confirmed it.

What’s new is that Google publicly acknowledges this limitation. For a long time, the site: command was the only accessible method before Search Console became widespread. Many SEO audits still rely on this figure, which raises issues regarding the reliability of diagnostics. [To be verified]: Google does not clarify whether this inaccuracy is a bug it could fix or an inherent feature of the system.

In what cases does the site: command remain useful nonetheless?

The site: command still holds value for quick qualitative checks. If you launch a new domain and site: returns zero results after three weeks, you know there is a blocking indexing issue (robots.txt, global noindex, penalty). It serves as an alert signal, not a dashboard.

It is also useful for inspecting specific subsections: site:domain.com/blog/ quickly shows if this part of the site is present in the index. You can combine it with temporal filters (inurl:, intitle:) to track specific pages. However, when it comes to counting accurately or tracking changes over time, this method becomes counterproductive.

What precautions should be taken with the Search Console report?

Search Console sometimes displays URLs that Google has discovered but never crawled. They appear in the report with the status 'Discovered, currently not indexed,' which artificially inflates the number of known URLs. If you have 100,000 discovered URLs and 60,000 of them have never been explored, your sitemap won’t help you understand why.

The report can also exclude URLs crawled through non-standard paths (temporary 302 redirections lasting for months, undeclared dynamic parameters). Cross-referencing Search Console and server logs remains the most reliable method, but few teams have the infrastructure to handle tens of millions of log lines every day. [To be verified]: Google provides no guarantee regarding the completeness of the data reported in Search Console.

Practical impact and recommendations

How can you effectively audit your site's indexing?

Implement a weekly reconciliation process between three sources: your XML sitemap (a list of URLs you want indexed), the Search Console report (URLs actually indexed according to Google), and your server logs (URLs genuinely crawled by Googlebot). Export these three datasets and cross-reference them in a spreadsheet or a Python script.

Identify the URLs present in the sitemap but absent from the index. This should be your priority: they should be indexed but aren’t. The Search Console report will provide the exclusion reasons (noindex, canonical, duplicated content, blocked crawling). Stop wasting time comparing approximate figures from site:.

What mistakes should be avoided when tracking indexing?

Never rely on a single figure. A client tells you, 'I lost 10,000 indexed pages in a week' based on site:? First, check if those pages are in the sitemap and if Search Console confirms a true de-indexing. Often, it’s just a fluctuation in the site: estimate, not a technical problem.

Avoid submitting URLs en masse through the inspection tool. Google limits submissions to a few dozen per day, and this method won’t solve structural problems (loading times, poor content, faulty canonicalization). If 5,000 pages are not indexed, it’s rarely an oversight by Googlebot; it’s a quality signal.

What tools can be used to automate this tracking?

Connect the Search Console API to a reporting tool (Google Sheets, Data Studio, or a custom solution). Set up automatic alerts if the indexation rate drops below a critical threshold (for example, less than 85% of sitemap URLs indexed). Some tools like Oncrawl, Botify, or Screaming Frog allow you to cross-reference crawl + logs + Search Console in a unified interface.

For very large sites (millions of URLs), invest in a log processing stack (ELK, BigQuery). You can identify sections of the site that Googlebot systematically ignores and adjust your internal linking structure accordingly. This approach requires advanced technical expertise, and many businesses choose to delegate this part to a specialized SEO agency for personalized support and tailored recommendations.

Export the Search Console indexation report weekly and cross-reference it with the sitemap
Identify the URLs present in the sitemap but marked 'Excluded' in Search Console
Analyze the exclusion reasons and correct technical issues (canonicalization, noindex, redirections)
Set up automatic alerts if the indexation rate falls below 85%
Cross-reference Search Console data with server logs to detect crawled but unindexed URLs
Stop using the site: command for strategic decisions or client reports

The site: command becomes a quick troubleshooting tool, not a reliable data source. Manage indexing through Search Console, the sitemap, and server logs. Automate monitoring to detect anomalies before they impact traffic.

❓ Frequently Asked Questions

La commande site: donne-t-elle au moins un ordre de grandeur correct ?

Oui, mais avec une marge d'erreur qui peut atteindre 30 à 40%. Pour un site de 10 000 pages, l'écart peut être de 3 000 à 4 000 URLs, ce qui rend toute analyse fine impossible.

Si je vois mon nombre de résultats site: augmenter, est-ce forcément bon signe ?

Pas nécessairement. L'augmentation peut refléter l'indexation de pages de faible qualité (pagination infinie, paramètres dynamiques) que vous ne souhaitez pas voir indexées. Vérifiez toujours dans la Search Console quelles URLs sont concernées.

Le sitemap garantit-il l'indexation des URLs qu'il contient ?

Non. Le sitemap est une suggestion, pas une instruction. Google peut choisir de ne pas indexer des URLs du sitemap si elles sont jugées de faible qualité, dupliquées, ou si le crawl budget est insuffisant.

Comment savoir si une page spécifique est indexée ?

Utilisez l'outil d'inspection d'URL dans la Search Console. Il vous dira si la page est indexée, et sinon, pour quelle raison. C'est plus fiable que de chercher manuellement dans les résultats de recherche.

Dois-je retirer les pages non indexées de mon sitemap ?

Cela dépend. Si elles sont exclues pour raisons techniques (noindex, robots.txt), oui, nettoyez le sitemap. Si elles sont marquées 'Découverte, actuellement non indexée', gardez-les : Google peut les indexer plus tard si la qualité ou la popularité du site augmente.

🏷 Related Topics

indexation Search Console site: sitemap crawl budget audit SEO logs serveur Googlebot

Domain Age & History Crawl & Indexing AI & SEO JavaScript & Technical SEO Domain Name Search Console

🎥 From the same video 12

Other SEO insights extracted from this same Google Search Central video · duration 1h01 · published on 02/08/2017

🎥 Watch the full video on YouTube →

Related statements

« Previous

Pages with Images Blocked by robots.txt...

Use of Non-Unicode Fonts and Its Impact on SEO...

« Back to results