Should you encode non-ASCII characters in XML sitemap URLs?

Quick SEO Quiz

Test your SEO knowledge in 3 questions

Less than 30 seconds. Find out how much you really know about Google search.

🕒 ~30s 🎯 3 questions 📚 SEO Google

Official statement

URLs in an XML sitemap can contain non-ASCII characters. You just need to respect the encoding specified in the sitemap specification.

37:49

🎥 Source video

Extracted from a Google Search Central video

⏱ 54:50 💬 EN 📅 15/05/2020 ✂ 23 statements

Watch on YouTube (37:49) →

✂ Other statements from this video 22 ▾

📅

Official statement from May 15, 2020 (5 years ago)

⚠ A more recent statement exists on this topic Is double URL encoding silently killing your crawl budget? Gary Illyes · February 3, 2026 View statement →

TL;DR

Google confirms that URLs containing non-ASCII characters (accents, ideograms, Cyrillic) are accepted in XML sitemaps, as long as the encoding specified in the sitemap spec is respected. In practice, you can submit URLs with UTF-8 characters without prior conversion to percentage encoding. However, compatibility with third-party parsers and some analysis tools remains a gray area that needs to be tested in advance.

What you need to understand

What does 'non-ASCII characters' mean in this context?

ASCII characters cover only the first 128 symbols of the basic English character set: unaccented letters, numbers, common punctuation. Anything beyond this scope — French accents, German umlauts, Chinese ideograms, Cyrillic alphabet — falls under non-ASCII.

In a URL context, these characters are traditionally escaped using percentage encoding: é becomes %C3%A9, 你 becomes %E4%BD%A0. Mueller's statement clarifies that this conversion is not mandatory in an XML sitemap, as long as UTF-8 encoding is declared in the XML header.

Why is this clarification necessary?

Historically, URL specifications (RFC 3986) require that any character outside ASCII be encoded in percentages. Many SEOs have gotten into the habit of pre-encoding their URLs before inserting them into sitemaps, thinking that Googlebot would refuse raw characters.

Mueller cuts this reflex short: Google accepts both formats. You can submit https://example.fr/café or https://example.fr/caf%C3%A9 in your sitemap — both work. It's a technical simplification that avoids an extra transformation step on the CMS side.

What is the sitemap specification he is talking about?

The XML sitemap protocol is described at sitemaps.org, which states that URLs must follow the RFC 3986 standard but that XML entities can be used to escape certain reserved characters (<, >, &). The file encoding itself is UTF-8 by default.

In short: as long as your XML file declares encoding="UTF-8" in its prologue, Googlebot interprets multi-byte characters correctly. No need for double encoding or contortions. The XML parser handles the conversion internally.

Non-ASCII characters are allowed in XML sitemap URLs without prior conversion to percentage encoding.
UTF-8 encoding must be explicitly declared in the XML header (usually present by default).
Both formats (raw characters and percentage encoding) are accepted by Googlebot and treated equivalently.
Watch out for third-party tools: not all XML parsers handle multi-byte characters well, especially older legacy systems.
XML entities (<, >, &) remain mandatory for reserved characters in XML syntax itself.

SEO Expert opinion

Is this statement consistent with observed practices?

On the ground, observations confirm: Google has been crawling and indexing URLs with raw non-ASCII characters in XML sitemaps for years. French, German, or Japanese e-commerce sites that do not encode their URLs in percentages do not suffer any crawl disadvantage.

That being said, Mueller remains vague on a critical point: URL normalization. Does Google canonicalize identical URLs submitted in raw and encoded versions? The official answer is missing, but tests show that Google treats both as variants of the same resource — unless the server returns different HTTP codes based on the form. [To be verified] on a case-by-case basis with server logs.

What risks does this flexibility introduce?

The main pitfall concerns third-party analysis tools: Screaming Frog, Botify, OnCrawl, and other log parsers. Some publishers do not handle multi-byte UTF-8 characters properly in their URL comparisons, leading to false duplicates or matching errors between sitemaps and logs.

In practice, you could submit a URL with an accent in the sitemap, see Googlebot crawl it correctly, but fail to match this visit in your reporting tools because the tool encodes the string differently. Frustrating but not blocking — it's a dashboard issue, not an SEO issue.

Should you still encode just in case?

Honestly, no. Systematically encoding URLs complicates maintenance: readable URLs (with accents) are easier to debug, read in logs, and communicate to developers. If Google accepts both formats, it's better to prefer the simpler one.

An exception: multilingual sites with multiple alphabets (Latin + Cyrillic + Chinese). In this case, keeping homogeneous encoding (all in percentages) can simplify regex and automated server-side processing. But it's an architectural choice, not an SEO constraint.

Alert: If you are migrating from an encoded format to a non-encoded format (or vice versa), check that your server does not treat the two as distinct URLs. A test with curl or Postman is enough to avoid accidental content duplication.

Practical impact and recommendations

What should you concretely do on an existing site?

Start with an audit of your current sitemap: extract a sample of URLs and check the format (raw or encoded). If everything is already encoded and working, there’s no reason to change — it’s not a ranking factor.

If you dynamically generate sitemaps via a CMS or script, ensure that the XML header declares encoding="UTF-8". Also, check that your web server returns the Content-Type application/xml; charset=UTF-8 to avoid parsing errors by strict clients.

How to manage multilingual URLs in sitemaps?

For sites with multiple alphabets, the simplest way is to submit one sitemap per language/region and maintain consistent encoding within each file. This facilitates debugging and reduces the risk of collisions between homographic characters (e.g.: Latin 'a' vs Cyrillic 'а').

If you use hreflang tags in your sitemaps, make sure that the destination URLs are exactly those that the server returns with a 200. A mismatch between raw URL in the sitemap and encoded URL in the Location header of a redirect can create loops or mixed signals for Googlebot.

What errors to avoid when generating sitemaps?

The classic error: forgetting to escape reserved XML entities (&, <, >). Even if your URLs contain valid non-ASCII characters, an unescaped & will break the parsing of the entire XML file. Google will return an error in Search Console, and no URL will be crawled.

Another trap: mixing multiple encodings in the same sitemap (UTF-8 for some URLs, ISO-8859-1 for others). Choose UTF-8 everywhere, it’s the universal web standard and the one Google favors. If your database or CMS has a legacy encoding, convert upstream.

Make sure the XML header declares encoding="UTF-8" in the prologue of the sitemap file.
Test the sitemap with the Search Console XML validator to detect parsing errors.
Ensure that the server returns the Content-Type application/xml; charset=UTF-8 for sitemap files.
Systematically escape reserved XML entities (&, <, >) even in URLs with non-ASCII characters.
Cross-check server logs and sitemap to ensure that crawled URLs match submitted URLs (same form, no unexpected redirects).
If you change the format (raw → encoded or vice versa), check that the server does not generate duplicates or unwanted redirects.

In summary: Google accepts non-ASCII characters in XML sitemaps without requiring percentage encoding, as long as UTF-8 is respected. It’s a welcome technical simplification that avoids unnecessary transformations. However, ensuring consistency between the sitemap, server, and analysis tools remains your responsibility — and can be complex to audit on high-volume multilingual sites. If you're unsure about the technical architecture of your sitemaps or notice inconsistencies between submitted and crawled URLs, it may be wise to consult a specialized SEO agency for an in-depth diagnostic and personalized support on the dynamic generation of optimized sitemaps.

❓ Frequently Asked Questions

Dois-je absolument encoder mes URLs en pourcentages dans le sitemap XML ?

Non. Google accepte les caractères non-ASCII bruts (accents, idéogrammes) dans les sitemaps XML tant que l'encodage UTF-8 est déclaré dans l'en-tête. Les deux formats (brut et encodé) fonctionnent.

Que se passe-t-il si je soumets la même URL en version brute et en version encodée ?

Google traite généralement les deux comme des variantes de la même ressource et applique sa normalisation interne. Vérifiez toutefois que votre serveur ne renvoie pas de codes HTTP différents selon la forme pour éviter les doublons.

Les outils SEO tiers gèrent-ils correctement les URLs non-ASCII dans les sitemaps ?

Pas tous. Certains parseurs de logs ou crawlers peuvent mal interpréter les caractères multi-octets, générant des faux doublons ou des erreurs d'appariement. Testez vos outils avant de basculer massivement.

Faut-il échapper les entités XML (&, <, >) dans les URLs du sitemap ?

Oui, obligatoirement. Même si les caractères non-ASCII sont acceptés bruts, les entités XML réservées doivent être échappées en &, <, > pour que le fichier reste valide.

Quel est le risque principal si je change le format de mes URLs dans le sitemap ?

Créer une duplication accidentelle si le serveur traite les URLs brutes et encodées comme des ressources distinctes. Vérifiez les codes HTTP renvoyés et ajoutez des redirections 301 si nécessaire pour unifier les variantes.

🏷 Related Topics

sitemap XML caractères UTF-8 encodage URL crawl normalisation indexation URLs multilingues parsing XML

Domain Age & History Crawl & Indexing JavaScript & Technical SEO Domain Name PDF & Files Search Console

🎥 From the same video 22

Other SEO insights extracted from this same Google Search Central video · duration 54 min · published on 15/05/2020

🎥 Watch the full video on YouTube →

Related statements

« Previous

Image alt text: describe the image AND its context...

Disavow file does not change the link report in Se...

« Back to results