3.1.3 Unicode 字符串 Unicode Strings

Starting with Python 2.0 a new data type for storing text data is available to the programmer: the Unicode object. It can be used to store and manipulate Unicode data (see http://www.unicode.org/) and integrates well with the existing string objects providing auto-conversions where necessary.

从Python2.0开始,程序员们可以使用一种新的数据类型来存储文本数据:Unicode 对象。它可以用于存储多种Unicode数据(请参阅 http://www.unicode.org/ ),并且,通过必要时的自动转换,它可以与现有的字符串对象良好的结合。

Unicode has the advantage of providing one ordinal for every character in every script used in modern and ancient texts. Previously, there were only 256 possible ordinals for script characters and texts were typically bound to a code page which mapped the ordinals to script characters. This lead to very much confusion especially with respect to internationalization (usually written as "i18n" -- "i" + 18 characters + "n") of software. Unicode solves these problems by defining one code page for all scripts.

Unicode 针对现代和旧式的文本中所有的字符提供了一个序列。以前,字符只能使用256个序号,文本通常通过绑定代码页来与字符映射。这很容易导致混乱,特别是软件的国际化( internationalization --通常写做“i18n”--“i”+18 characters +“n”)。 Unicode 通过为所有字符定义一个统一的代码页解决了这个问题。

Creating Unicode strings in Python is just as simple as creating normal strings:

Python 中定义一个 Unicode 字符串和定义一个普通字符串一样简单:

>>> u'Hello World !'
u'Hello World !'

The small "u" in front of the quote indicates that an Unicode string is supposed to be created. If you want to include special characters in the string, you can do so by using the Python Unicode-Escape encoding. The following example shows how:

引号前小写的“u”表示这里创建的是一个 Unicode 字符串。如果你想加入一个特殊字符,可以使用 Python 的 Unicode-Escape 编码。如下例所示:

>>> u'Hello\u0020World !'
u'Hello World !'

The escape sequence \u0020 indicates to insert the Unicode character with the ordinal value 0x0020 (the space character) at the given position.

被替换的 \u0020 标识表示在给定位置插入编码值为 0x0020 的 Unicode 字符(空格符)。

Other characters are interpreted by using their respective ordinal values directly as Unicode ordinals. If you have literal strings in the standard Latin-1 encoding that is used in many Western countries, you will find it convenient that the lower 256 characters of Unicode are the same as the 256 characters of Latin-1.

其它字符也会被直接解释成对应的 Unicode 码。如果你有一个在西方国家常用的 Latin-1 编码字符串,你可以发现 Unicode 字符集的前256个字符与 Latin-1 的对应字符编码完全相同。

For experts, there is also a raw mode just like the one for normal strings. You have to prefix the opening quote with 'ur' to have Python use the Raw-Unicode-Escape encoding. It will only apply the above \uXXXX conversion if there is an uneven number of backslashes in front of the small 'u'.

另外,有一种与普通字符串相同的行模式。想要使用 Python 的Raw-Unicode-Escape 编码,你需要在字符串的引号前加上 ur 前缀。如果在小写“u”前可能有不止一个反斜杠,它只会把那些单独的
uXXXX 转化为Unicode字符。

>>> ur'Hello\u0020World !'
u'Hello World !'
>>> ur'Hello\\u0020World !'
u'Hello\\\\u0020World !'

The raw mode is most useful when you have to enter lots of backslashes, as can be necessary in regular expressions.

行模式在你需要输入很多个反斜杠时很有用,可能会用于正则表达式。

Apart from these standard encodings, Python provides a whole set of other ways of creating Unicode strings on the basis of a known encoding.

作为这些编码标准的一部分, Python 提供了一个完备的方法集用于从已知的编码集创建 Unicode 字符串。

The built-in function unicode() provides access to all registered Unicode codecs (COders and DECoders). Some of the more well known encodings which these codecs can convert are Latin-1, ASCII, UTF-8, and UTF-16. The latter two are variable-length encodings that store each Unicode character in one or more bytes. The default encoding is normally set to ASCII, which passes through characters in the range 0 to 127 and rejects any other characters with an error. When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding.

内置函数 unicode() 提供了访问(编码和解码)所有已注册的 Unicode 编码的方法。它能转换众所周知的 Latin-1, ASCII, UTF-8, 和 UTF-16。后面的两个可变长编码字符集用一个或多个 byte 存储 Unicode 字符。默认的字符集是 ASCII ,它只处理0到127的编码,拒绝其它的字符并返回一个错误。当一个 Unicode 字符串被打印、写入文件或通过 str() 转化时,它们被替换为默认的编码。

>>> u"abc"
u'abc'
>>> str(u"abc")
'abc'
>>> u"漩"
u'\xe4\xf6\xfc'
>>> str(u"漩")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-2: ordinal not in range(128)

To convert a Unicode string into an 8-bit string using a specific encoding, Unicode objects provide an encode() method that takes one argument, the name of the encoding. Lowercase names for encodings are preferred.

要把一个 Unicode 字符串用指定的字符集转化成8位字符串,可以使用 Unicode 对象提供的 encode() 方法,它有一个参数用以指定编码名称。编码名称小写。

>>> u"漩".encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'

If you have data in a specific encoding and want to produce a corresponding Unicode string from it, you can use the unicode() function with the encoding name as the second argument.

如果你有一个特定编码的字符串,想要把它转为 Unicode 字符集,,可以使用 encode() 函数,它以编码名做为第二个参数。

>>> unicode('\xc3\xa4\xc3\xb6\xc3\xbc', 'utf-8')
u'\xe4\xf6\xfc'