Official statement
Other statements from this video 22 ▾
- 3:03 Do temporary 404 errors during a migration really kill your SEO?
- 4:56 Is it true that Googlebot crawls from the USA: how can you avoid the geo-IP cloaking trap?
- 8:42 Can you really block Googlebot state by state in the U.S. without breaking everything?
- 11:31 Why does Google not index all your pages despite active crawling?
- 12:17 Are Reddit's nofollow links really useless for SEO?
- 14:14 Should you always enable loading='lazy' on all your images to boost SEO?
- 15:25 Should you really reduce the number of language versions for hreflang?
- 18:27 Should you really fix every 404 error reported in Search Console?
- 20:47 Are jump links really useless for Google's crawling?
- 21:55 Should you disavow ghost backlinks that are only visible in Search Console?
- 23:20 Why doesn't the Disavow file hide bad links in Search Console?
- 29:18 Should you really contextualize the alt attribute beyond a visual description?
- 32:47 Should you really worry about 301 redirects and multiple 404 pages?
- 33:02 Is Google algorithmically downgrading specific sectors during health crises?
- 34:06 Should you really use different domain names for a multilingual site?
- 36:28 Should you really make all recipe images indexable to perform well in SEO?
- 38:15 Does Hreflang Really Ensure Accurate Geographic Targeting for Your International Traffic?
- 41:05 Why does Google only index one version when your country pages are nearly identical?
- 45:51 Should you develop unique content to effectively index various versions of the same service?
- 46:27 Should you create a new page or update the existing one for a temporary change?
- 49:01 Is it really necessary to avoid using multiple title and meta description tags on a single page?
- 52:13 Are 500/503 errors lasting a few hours really invisible to your indexing?
Google confirms that URLs containing non-ASCII characters (accents, ideograms, Cyrillic) are accepted in XML sitemaps, as long as the encoding specified in the sitemap spec is respected. In practice, you can submit URLs with UTF-8 characters without prior conversion to percentage encoding. However, compatibility with third-party parsers and some analysis tools remains a gray area that needs to be tested in advance.
What you need to understand
What does 'non-ASCII characters' mean in this context?
ASCII characters cover only the first 128 symbols of the basic English character set: unaccented letters, numbers, common punctuation. Anything beyond this scope — French accents, German umlauts, Chinese ideograms, Cyrillic alphabet — falls under non-ASCII.
In a URL context, these characters are traditionally escaped using percentage encoding: é becomes %C3%A9, 你 becomes %E4%BD%A0. Mueller's statement clarifies that this conversion is not mandatory in an XML sitemap, as long as UTF-8 encoding is declared in the XML header.
Why is this clarification necessary?
Historically, URL specifications (RFC 3986) require that any character outside ASCII be encoded in percentages. Many SEOs have gotten into the habit of pre-encoding their URLs before inserting them into sitemaps, thinking that Googlebot would refuse raw characters.
Mueller cuts this reflex short: Google accepts both formats. You can submit https://example.fr/café or https://example.fr/caf%C3%A9 in your sitemap — both work. It's a technical simplification that avoids an extra transformation step on the CMS side.
What is the sitemap specification he is talking about?
The XML sitemap protocol is described at sitemaps.org, which states that URLs must follow the RFC 3986 standard but that XML entities can be used to escape certain reserved characters (<, >, &). The file encoding itself is UTF-8 by default.
In short: as long as your XML file declares encoding="UTF-8" in its prologue, Googlebot interprets multi-byte characters correctly. No need for double encoding or contortions. The XML parser handles the conversion internally.
- Non-ASCII characters are allowed in XML sitemap URLs without prior conversion to percentage encoding.
- UTF-8 encoding must be explicitly declared in the XML header (usually present by default).
- Both formats (raw characters and percentage encoding) are accepted by Googlebot and treated equivalently.
- Watch out for third-party tools: not all XML parsers handle multi-byte characters well, especially older legacy systems.
- XML entities (<, >, &) remain mandatory for reserved characters in XML syntax itself.
SEO Expert opinion
Is this statement consistent with observed practices?
On the ground, observations confirm: Google has been crawling and indexing URLs with raw non-ASCII characters in XML sitemaps for years. French, German, or Japanese e-commerce sites that do not encode their URLs in percentages do not suffer any crawl disadvantage.
That being said, Mueller remains vague on a critical point: URL normalization. Does Google canonicalize identical URLs submitted in raw and encoded versions? The official answer is missing, but tests show that Google treats both as variants of the same resource — unless the server returns different HTTP codes based on the form. [To be verified] on a case-by-case basis with server logs.
What risks does this flexibility introduce?
The main pitfall concerns third-party analysis tools: Screaming Frog, Botify, OnCrawl, and other log parsers. Some publishers do not handle multi-byte UTF-8 characters properly in their URL comparisons, leading to false duplicates or matching errors between sitemaps and logs.
In practice, you could submit a URL with an accent in the sitemap, see Googlebot crawl it correctly, but fail to match this visit in your reporting tools because the tool encodes the string differently. Frustrating but not blocking — it's a dashboard issue, not an SEO issue.
Should you still encode just in case?
Honestly, no. Systematically encoding URLs complicates maintenance: readable URLs (with accents) are easier to debug, read in logs, and communicate to developers. If Google accepts both formats, it's better to prefer the simpler one.
An exception: multilingual sites with multiple alphabets (Latin + Cyrillic + Chinese). In this case, keeping homogeneous encoding (all in percentages) can simplify regex and automated server-side processing. But it's an architectural choice, not an SEO constraint.
Practical impact and recommendations
What should you concretely do on an existing site?
Start with an audit of your current sitemap: extract a sample of URLs and check the format (raw or encoded). If everything is already encoded and working, there’s no reason to change — it’s not a ranking factor.
If you dynamically generate sitemaps via a CMS or script, ensure that the XML header declares encoding="UTF-8". Also, check that your web server returns the Content-Type application/xml; charset=UTF-8 to avoid parsing errors by strict clients.
How to manage multilingual URLs in sitemaps?
For sites with multiple alphabets, the simplest way is to submit one sitemap per language/region and maintain consistent encoding within each file. This facilitates debugging and reduces the risk of collisions between homographic characters (e.g.: Latin 'a' vs Cyrillic 'а').
If you use hreflang tags in your sitemaps, make sure that the destination URLs are exactly those that the server returns with a 200. A mismatch between raw URL in the sitemap and encoded URL in the Location header of a redirect can create loops or mixed signals for Googlebot.
What errors to avoid when generating sitemaps?
The classic error: forgetting to escape reserved XML entities (&, <, >). Even if your URLs contain valid non-ASCII characters, an unescaped & will break the parsing of the entire XML file. Google will return an error in Search Console, and no URL will be crawled.
Another trap: mixing multiple encodings in the same sitemap (UTF-8 for some URLs, ISO-8859-1 for others). Choose UTF-8 everywhere, it’s the universal web standard and the one Google favors. If your database or CMS has a legacy encoding, convert upstream.
- Make sure the XML header declares encoding="UTF-8" in the prologue of the sitemap file.
- Test the sitemap with the Search Console XML validator to detect parsing errors.
- Ensure that the server returns the Content-Type application/xml; charset=UTF-8 for sitemap files.
- Systematically escape reserved XML entities (&, <, >) even in URLs with non-ASCII characters.
- Cross-check server logs and sitemap to ensure that crawled URLs match submitted URLs (same form, no unexpected redirects).
- If you change the format (raw → encoded or vice versa), check that the server does not generate duplicates or unwanted redirects.
❓ Frequently Asked Questions
Dois-je absolument encoder mes URLs en pourcentages dans le sitemap XML ?
Que se passe-t-il si je soumets la même URL en version brute et en version encodée ?
Les outils SEO tiers gèrent-ils correctement les URLs non-ASCII dans les sitemaps ?
Faut-il échapper les entités XML (&, <, >) dans les URLs du sitemap ?
Quel est le risque principal si je change le format de mes URLs dans le sitemap ?
🎥 From the same video 22
Other SEO insights extracted from this same Google Search Central video · duration 54 min · published on 15/05/2020
🎥 Watch the full video on YouTube →
💬 Comments (0)
Be the first to comment.