@
Jackeriss 感谢。
不过 snownlp 的分句逗号处也分了。
zh = snownlp.SnowNLP('非中文分句,nltk 里的 tokenizer 可处理各种语言(其实是用 nltk 里的 pickle 文件),可惜用了 nltk 的 Python 程序生成 exe 包时麻烦多多,因为 nltk 的 data 目录结构不在 python lib 下。另外 github 上有个 segtok 项目,好像还可以用,但性能不如 nltk。')
zh.sentences
['非中文分句',
'nltk 里的 tokenizer 可处理各种语言(其实是用 nltk 里的 pickle 文件)',
'可惜用了 nltk 的 Python 程序生成 exe 包时麻烦多多',
'因为 nltk 的 data 目录结构不在 python lib 下',
'另外 github 上有个 segtok 项目',
'好像还可以用',
'但性能不如 nltk']
而且标点符号也没了。估计是简单用 re.split 做的。
snwonlp 想做成中文版 textblob, 可惜在分句上没做足功夫。
textblob 的分句:
blob = textblob.TextBlob('The rider sits on and operates these vehicles like a motorcycle, but the extra wheels give more stability at slower speeds. Although equipped with three or four wheels, six-wheel models exist for specialized applications. Engine sizes of ATVs currently for sale in the United States, (as of 2008 products), range from 49 to 1,000 cc (3 to 61 cu in).')
blob.sentences
[Sentence("The rider sits on and operates these vehicles like a motorcycle, but the extra wheels give more stability at slower speeds."),
Sentence("Although equipped with three or four wheels, six-wheel models exist for specialized applications."),
Sentence("Engine sizes of ATVs currently for sale in the United States, (as of 2008 products), range from 49 to 1,000 cc (3 to 61 cu in).")]