Robots.txt has got its own system of codification for content, which does not allow any text codification different than US-ASCII.
According to the URI specifications, only the US-ASCII character set has to be used in order to define URL’S. This very point can create quite a lot of trouble for webmasters trying to set up their own robots.txt with a different set of characters.
ASCII’s 128 characters only covers the English alphabet, numbers, and punctuation marks, making impossible to control search engine behaviour when some “weird” characters are used into folder codification, like ñ in Spanish and ç in French, which are left out of ASCII.
Most characters in non-Latin-based alphabets, such as pi (π) in Greek, ya (я) in Cyrillic, and entire alphabets from many other world languages, can’t be accurately written in the limited, English-oriented ASCII.
robots.txt file codification is the following:
- ANSI (Windows-1252)
- Unicode
- UTF-8
The file however supports following codifications for its content:
- ANSI (Windows-1252): 8 bit
- ASCII: 7 bit
- ISO-8859-1: 8 bit
- UTF-8: 8 bit
Let’s take the case of a russian website, using Cyrillic codification for its folders and directories. In this case, characters like π or я should be correctly encoded into US-ASCII.
Percent-encoding comes into play, making possible to encode a non-ASCII string into a set of characters which can be perfectly read by search engines.
Let’s consider a russian website with a admin folder we do not want search engine to crawl:
http://www.domain.com/папка/
In order to avoid search engines crawling the admin folder, the folder’s name should be encoded as following:
Disallow: /%D0%BF%D0%B0%D0%BF%D0%BA%D0%B0/
…while the following line won’t work, since directory specifications into robots.txt must be always encoded in US-ASCII:
Disallow: /папка/
You might also want to read this article from the Bing Community, which explains the issue.