45 行 Python 代码写一个语言检测器

2018-01-11 16:24:29 +08:00
 amazing994
class NGram(object):
def __init__(self, text, n=3):
self.length = None
self.n = n
self.table = {}
self.parse_text(text)
self.calculate_length()

def parse_text(self, text):
chars = ' ' * self.n # initial sequence of spaces with length n

for letter in (" ".join(text.split()) + " "):
chars = chars[1:] + letter # append letter to sequence of length n
self.table[chars] = self.table.get(chars, 0) + 1 # increment count

def calculate_length(self):
""" Treat the N-Gram table as a vector and return its scalar magnitude
to be used for performing a vector-based search.
"""
self.length = sum([x * x for x in self.table.values()]) ** 0.5
return self.length

def __sub__(self, other):
""" Find the difference between two NGram objects by finding the cosine
of the angle between the two vector representations of the table of
N-Grams. Return a float value between 0 and 1 where 0 indicates that
the two NGrams are exactly the same.
"""
if not isinstance(other, NGram):
raise TypeError("Can't compare NGram with non-NGram object.")

if self.n != other.n:
raise TypeError("Can't compare NGram objects of different size.")

total = 0
for k in self.table:
total += self.table[k] * other.table.get(k, 0)

return 1.0 - (float(total) )/ (float(self.length) * float(other.length))

def find_match(self, languages):
""" Out of a list of NGrams that represent individual languages, return
the best match.
"""
return min(languages, lambda n: self - n)


更多代码请扣 1132032275
1743 次点击
所在节点    Python
0 条回复

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/422070

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX