「英文词组」分词问题

之前已经有一个提问了： https://www.v2ex.com/t/340752#reply13

关于这个问题搜索了挺久还不是很明白，于是开了这个帖子。

我想做英文词组分词（可能不这么叫），比如 I love New York，我希望分词出来的是 I / love / New York，而不是：I / love / New / York 。New York 分开原本的意思就变了。

中文分词有非常多的工具，比如结巴（ https://github.com/fxsjy/jieba ），但是找英文词组分词工具就非常难（我甚至不知道用什么单词去搜索，比如是 Tokenizer 、Chunking 、还是 text segmentation ），请问英文有没有比较方便可以直接分词的工具。

比如斯坦福的 stanza （ https://github.com/stanfordnlp/stanza ）可以用于分词。中文分词结果没问题，但是英文只是按照空格做分词。

text = """英国首相约翰逊 6 日晚因病情恶化。"""

zh_nlp = stanza.Pipeline('zh')
doc = zh_nlp(text)

for sent in doc.sentences:
  print("Sentence：" + sent.text) # 断句
  print("Tokenize：" + ' '.join(token.text for token in sent.tokens)) # 中文分词

它的输出结果是分词后的结果，这没问题：

Tokenize：英国 首相 约翰逊 6 日 晚因 病情 恶化 ， 被 转入 重症 监护 室 治疗 。

但是英文分词：

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_no_ssplit=True)
doc = nlp('This is a sentence.\n\nThis is a second. This is a third.')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

输出结果为：

====== Sentence 1 tokens =======
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: sentence
id: (5,)	text: .
====== Sentence 2 tokens =======
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: second
id: (5,)	text: .
id: (6,)	text: This
id: (7,)	text: is
id: (8,)	text: a
id: (9,)	text: third
id: (10,)	text: .

yucongo

2021-01-05 15:23:34 +08:00

不知道楼主有没有找到解决方法？找到了话能不能分享一下？

如果只是对名词词组感兴趣，spacy/textaxy/textblob 里的 noun_chunks/noun phrases/NER 或许有用。但我也是希望能像 jieba 做中文分词那样将英语句子分成有意义的词组，例如：A match / is / a tool / for starting / a fire. Typically, / modern matches / are made of / small wooden sticks or stiff paper.

搜了一下，好像没有现成的工具，最接近的办法可能是用 spacy 的 rule based matching 匹配出 noun phrase （比较简单，有现成的）和 verb phrase 。textacy 里有个极简的 VP 常数（'<AUX>* <ADV>* <VERB>'）。

总之，离英语“词组分词”还很远