2. 词法分析 Lexical analysis

A Python program is read by a parser. Input to the parser is a stream of tokens, generated by the lexical analyzer. This chapter describes how the lexical analyzer breaks a file into tokens.

一个Python程序由解析器读入, 输入解析器的是一个语言符号流, 由词法分析器生成.本章讨论词法分析器是如何把文件分隔成语言符号的.

Python uses the 7-bit ASCII character set for program text. New in version 2.3: An encoding declaration can be used to indicate that string literals and comments use an encoding different from ASCII.. For compatibility with older versions, Python only warns if it finds 8-bit characters; those warnings should be corrected by either declaring an explicit encoding, or using escape sequences if those bytes are binary data, instead of characters.

Python使用7比特长的ASCII字符集作为程序文本和串字面值. 8比特长的字符的也可以作串字面值和注释, 但它们的解释是依赖于平台的, 在串中插入八比特字符的正确方法是使用八进制数和十六进制数的转义字符.

The run-time character set depends on the I/O devices connected to the program but is generally a superset of ASCII.

运行时字符集依赖于连接到程序的I/O设备, 但通常是ASCII的超集.

Future compatibility note: It may be tempting to assume that the character set for 8-bit characters is ISO Latin-1 (an ASCII superset that covers most western languages that use the Latin alphabet), but it is possible that in the future Unicode text editors will become common. These generally use the UTF-8 encoding, which is also an ASCII superset, but with very different use for the characters with ordinals 128-255. While there is no consensus on this subject yet, it is unwise to assume either Latin-1 or UTF-8, even though the current implementation appears to favor Latin-1. This applies both to the source character set and the run-time character set.

向后兼容性备忘: 假定8位字符集是ISO Latin-1(一种ASCII码的超集，它覆盖了大部分使用拉丁字母的西方语言.)看起来是个不错的做法, 但是未来可能是支持Unicode的编辑器更流行一些, 通常使用UTF-8(另一种ASCII码的超集)编码, 但是对于顺序在128到255之间的字符用法两者存在很大的区别。然而关于这点还没有一致的意见，假定为Latin-1或UTF-8都是不明智的，尽管当前的实现偏向于Latin-1, 这一点对于源程序字符集和运行字符集都是适用的。