Python 2.7 str 方法 isalpha 不支持 unicode 的一个小坑

今天在写一个搜索组件时，我想根据搜索的是否是全部字母来选择搜索的字段。
于是有下面的代码：

```python
if q.isalpha():
query = query.filter(User.username.ilike(like_str))
else:
query = query.filter(User.realname.ilike(like_str))
```

但是发现就算里面有中文也被判断成 `isalpha` 为 `true` 了。
测试发现是 `str` 中方法 `isalpha` 对于 Unicode 的判断有不可靠。
而 Flask 中默认对参数解码都是 UTF-8 的。所以需要使用 `encode('utf-8')` 对其进行重新编码之后函数 `isalpha()` 才可用。
测试如下：

```python
In [15]: u"张 x".isalpha()
Out[15]: True

In [16]: "张 x".isalpha()
Out[16]: False

In [17]: "aac".isalpha()
Out[17]: True

In [18]: u"张 x".encode('utf-8').isalpha()
Out[18]: False
```

banxi1988

2015-10-18 20:52:42 +08:00

@mulog

文档就是这样说的。

> str.isalpha()
Return true if all characters in the string are alphabetic and there is at least one character, false otherwise.

另外看下面的输出吧。

```python
In [23]: "a1".isalpha()
Out[23]: False

In [24]: "ab".isalpha()
Out[24]: True

In [25]: "a?".isalpha()
Out[25]: False
```

hahastudio

2015-10-19 10:16:33 +08:00

我在想咱们是不是用的不是同一个 Python 2.7
Python 2.7.10
In [10]: u'张 x'.isalpha()
Out[10]: False

另外
https://docs.python.org/2/library/stdtypes.html#str.isalpha
For 8-bit strings, this method is locale-dependent.

Clarencep

2015-10-19 10:25:53 +08:00

这个问题以前在 segmentfault 上有人问过了： http://segmentfault.com/q/1010000000732038/a-1020000000732447
> 对于 unicode string ， string.isalpha 会根据字符串中的字符是否属于 Unicode 编码的 LETTER 区域来判断是否都由字母组成。所以得出的结果为 True ，不一定表示只有 26 个英文字母。

banxi1988

2015-10-19 11:48:48 +08:00

@hahastudio
我用的是： `Python 2.7.10 (default, Aug 22 2015, 20:33:39)`

不过如 @Clarencep 指出。这个问题确实是存在的。

而且我在官方网站的 shell https://www.python.org/shell/
上试了下，在 python 3.4 中 isalpha() 的判断还是不可靠的。

```ipython
In [1]: "\u5f20".isalpha()
Out[1]: True
In [2]: "\u5f20".encode('utf-8').isalpha()
Out[2]: False
```

aro167

2015-10-19 12:29:36 +08:00

利用 translate 可靠
import string
notrans = string.maketrans('', '')
def containsAll(astr, strset):
return not strset.translate(notrans, astr)
containsAll(string.letters,'我是 aro167')

Clarencep

2015-10-19 13:27:10 +08:00

@banxi1988
unicode 的 isalpha 中所定义的字母范围不只是[a-zA-Z]，比如：

>>> u'测试'.isalpha()
True

但是，全角的数字和标点符号是不会被判作字母的：

>>> u'０１２３４５６７８９'.isalpha()
False
>>> u'，。；‘'.isalpha()
False

应该不是 python 的 bug

staticor

2015-10-19 13:46:15 +08:00

str.isalpha()

Return true if all characters in the string are alphabetic and there is at least one character, false otherwise. Alphabetic characters are those characters defined in the Unicode character database as “ Letter ”, i.e., those with general category property being one of “ Lm ”, “ Lt ”, “ Lu ”, “ Ll ”, or “ Lo ”. Note that this is different from the “ Alphabetic ” property defined in the Unicode Standard.

同上理解, isalpha() != English letters

mulog

2015-10-19 14:12:57 +08:00

对于 unicode 如果你的字符串全是「字母」组成的， isalpha 返回的就是 True ，没有什么不可靠的。
当然严格来讲汉字不算「字母」，也就无所谓 alphabetical, 但是这是另一回事了。。
你 encode 之后变成了 str, isalpha 判断的东西是编码的每个 byte, 根本没有意义。

banxi1988

2015-10-19 17:38:48 +08:00

@mulog
这个是有意义的。因为汉字经过 UTF-8 编码之后，
首字母必定不在 ascii 的基本字符（或字母）
范围之类。所以对于一个简单判断字母与汉字的区别来说足够了。

参考： http://www.unicode.org/charts/unihangridindex.html
常用汉字起始编码 U+4E00 through U+9FCC
扩展汉字起始编码： U+3400 through U+4DB5

```ipython
In [24]: u"\u3400".encode('utf-8')
Out[24]: '\xe3\x90\x80'

In [22]: u"\u4300".encode('utf-8')
Out[22]: '\xe4\x8c\x80'

In [23]: 0xe4
Out[23]: 228
```

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/228992

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.