Official statement
Other statements from this video 22 ▾
- 3:03 Les erreurs 404 temporaires lors d'une migration tuent-elles vraiment votre référencement ?
- 4:56 Googlebot crawle depuis les USA : comment éviter le piège du cloaking géo-IP ?
- 8:42 Peut-on vraiment bloquer Googlebot état par état aux USA sans tout casser ?
- 11:31 Pourquoi Google n'indexe-t-il pas toutes vos pages malgré un crawl actif ?
- 12:17 Les liens nofollow de Reddit sont-ils vraiment inutiles pour le SEO ?
- 14:14 Faut-il systématiquement activer loading='lazy' sur toutes vos images pour booster le SEO ?
- 15:25 Faut-il vraiment réduire le nombre de versions linguistiques pour hreflang ?
- 18:27 Faut-il vraiment corriger toutes les erreurs 404 remontées dans Search Console ?
- 20:47 Les jump links sont-ils vraiment inutiles pour le crawl de Google ?
- 21:55 Faut-il désavouer les backlinks fantômes visibles uniquement dans Search Console ?
- 23:20 Pourquoi le fichier Disavow ne masque-t-il pas les mauvais liens dans Search Console ?
- 29:18 Faut-il vraiment contextualiser l'attribut alt au-delà de la description visuelle ?
- 32:47 Faut-il vraiment s'inquiéter des redirections 301 et pages 404 multiples ?
- 33:02 Google déclasse-t-il algorithmiquement certains secteurs en période de crise sanitaire ?
- 34:06 Faut-il vraiment utiliser plusieurs noms de domaine pour un site multilingue ?
- 36:28 Faut-il vraiment rendre toutes les images de recettes indexables pour performer en SEO ?
- 38:15 Hreflang garantit-il vraiment le bon ciblage géographique de votre trafic international ?
- 41:05 Pourquoi Google indexe-t-il une seule version quand vos pages pays sont quasi-identiques ?
- 45:51 Faut-il créer du contenu différent pour indexer plusieurs variantes d'un même service ?
- 46:27 Faut-il créer une nouvelle page ou modifier l'existante pour un changement temporaire ?
- 49:01 Faut-il vraiment éviter les balises title et meta description multiples sur une même page ?
- 52:13 Les erreurs 500/503 de quelques heures sont-elles vraiment invisibles pour votre indexation ?
Google confirms that URLs containing non-ASCII characters (accents, ideograms, Cyrillic) are accepted in XML sitemaps, as long as the encoding specified in the sitemap spec is respected. In practice, you can submit URLs with UTF-8 characters without prior conversion to percentage encoding. However, compatibility with third-party parsers and some analysis tools remains a gray area that needs to be tested in advance.
What you need to understand
What does 'non-ASCII characters' mean in this context?
ASCII characters cover only the first 128 symbols of the basic English character set: unaccented letters, numbers, common punctuation. Anything beyond this scope — French accents, German umlauts, Chinese ideograms, Cyrillic alphabet — falls under non-ASCII.
In a URL context, these characters are traditionally escaped using percentage encoding: é becomes %C3%A9, 你 becomes %E4%BD%A0. Mueller's statement clarifies that this conversion is not mandatory in an XML sitemap, as long as UTF-8 encoding is declared in the XML header.
Why is this clarification necessary?
Historically, URL specifications (RFC 3986) require that any character outside ASCII be encoded in percentages. Many SEOs have gotten into the habit of pre-encoding their URLs before inserting them into sitemaps, thinking that Googlebot would refuse raw characters.
Mueller cuts this reflex short: Google accepts both formats. You can submit https://example.fr/café or https://example.fr/caf%C3%A9 in your sitemap — both work. It's a technical simplification that avoids an extra transformation step on the CMS side.
What is the sitemap specification he is talking about?
The XML sitemap protocol is described at sitemaps.org, which states that URLs must follow the RFC 3986 standard but that XML entities can be used to escape certain reserved characters (<, >, &). The file encoding itself is UTF-8 by default.
In short: as long as your XML file declares encoding="UTF-8" in its prologue, Googlebot interprets multi-byte characters correctly. No need for double encoding or contortions. The XML parser handles the conversion internally.
- Non-ASCII characters are allowed in XML sitemap URLs without prior conversion to percentage encoding.
- UTF-8 encoding must be explicitly declared in the XML header (usually present by default).
- Both formats (raw characters and percentage encoding) are accepted by Googlebot and treated equivalently.
- Watch out for third-party tools: not all XML parsers handle multi-byte characters well, especially older legacy systems.
- XML entities (<, >, &) remain mandatory for reserved characters in XML syntax itself.
SEO Expert opinion
Is this statement consistent with observed practices?
On the ground, observations confirm: Google has been crawling and indexing URLs with raw non-ASCII characters in XML sitemaps for years. French, German, or Japanese e-commerce sites that do not encode their URLs in percentages do not suffer any crawl disadvantage.
That being said, Mueller remains vague on a critical point: URL normalization. Does Google canonicalize identical URLs submitted in raw and encoded versions? The official answer is missing, but tests show that Google treats both as variants of the same resource — unless the server returns different HTTP codes based on the form. [To be verified] on a case-by-case basis with server logs.
What risks does this flexibility introduce?
The main pitfall concerns third-party analysis tools: Screaming Frog, Botify, OnCrawl, and other log parsers. Some publishers do not handle multi-byte UTF-8 characters properly in their URL comparisons, leading to false duplicates or matching errors between sitemaps and logs.
In practice, you could submit a URL with an accent in the sitemap, see Googlebot crawl it correctly, but fail to match this visit in your reporting tools because the tool encodes the string differently. Frustrating but not blocking — it's a dashboard issue, not an SEO issue.
Should you still encode just in case?
Honestly, no. Systematically encoding URLs complicates maintenance: readable URLs (with accents) are easier to debug, read in logs, and communicate to developers. If Google accepts both formats, it's better to prefer the simpler one.
An exception: multilingual sites with multiple alphabets (Latin + Cyrillic + Chinese). In this case, keeping homogeneous encoding (all in percentages) can simplify regex and automated server-side processing. But it's an architectural choice, not an SEO constraint.
Practical impact and recommendations
What should you concretely do on an existing site?
Start with an audit of your current sitemap: extract a sample of URLs and check the format (raw or encoded). If everything is already encoded and working, there’s no reason to change — it’s not a ranking factor.
If you dynamically generate sitemaps via a CMS or script, ensure that the XML header declares encoding="UTF-8". Also, check that your web server returns the Content-Type application/xml; charset=UTF-8 to avoid parsing errors by strict clients.
How to manage multilingual URLs in sitemaps?
For sites with multiple alphabets, the simplest way is to submit one sitemap per language/region and maintain consistent encoding within each file. This facilitates debugging and reduces the risk of collisions between homographic characters (e.g.: Latin 'a' vs Cyrillic 'а').
If you use hreflang tags in your sitemaps, make sure that the destination URLs are exactly those that the server returns with a 200. A mismatch between raw URL in the sitemap and encoded URL in the Location header of a redirect can create loops or mixed signals for Googlebot.
What errors to avoid when generating sitemaps?
The classic error: forgetting to escape reserved XML entities (&, <, >). Even if your URLs contain valid non-ASCII characters, an unescaped & will break the parsing of the entire XML file. Google will return an error in Search Console, and no URL will be crawled.
Another trap: mixing multiple encodings in the same sitemap (UTF-8 for some URLs, ISO-8859-1 for others). Choose UTF-8 everywhere, it’s the universal web standard and the one Google favors. If your database or CMS has a legacy encoding, convert upstream.
- Make sure the XML header declares encoding="UTF-8" in the prologue of the sitemap file.
- Test the sitemap with the Search Console XML validator to detect parsing errors.
- Ensure that the server returns the Content-Type application/xml; charset=UTF-8 for sitemap files.
- Systematically escape reserved XML entities (&, <, >) even in URLs with non-ASCII characters.
- Cross-check server logs and sitemap to ensure that crawled URLs match submitted URLs (same form, no unexpected redirects).
- If you change the format (raw → encoded or vice versa), check that the server does not generate duplicates or unwanted redirects.
❓ Frequently Asked Questions
Dois-je absolument encoder mes URLs en pourcentages dans le sitemap XML ?
Que se passe-t-il si je soumets la même URL en version brute et en version encodée ?
Les outils SEO tiers gèrent-ils correctement les URLs non-ASCII dans les sitemaps ?
Faut-il échapper les entités XML (&, <, >) dans les URLs du sitemap ?
Quel est le risque principal si je change le format de mes URLs dans le sitemap ?
🎥 From the same video 22
Other SEO insights extracted from this same Google Search Central video · duration 54 min · published on 15/05/2020
🎥 Watch the full video on YouTube →
💬 Comments (0)
Be the first to comment.