试玩了一下去年腾讯开源的 800 w 的中文词词向量

最近搞点词嵌入相关的东西，无意中发现腾讯去年开源的词向量模型：
https://mp.weixin.qq.com/s/b9NWR0F7GQLYtgGSL50gQw

这个模型涵盖 800w 中文词（虽然里边很多错误词），但总体还是挺强大的。

简单搭了个 api 哈哈： https://zhuanlan.zhihu.com/p/94124468

一些有意思的测试：
1.红烧肉相似词
output：

{
   "top_similar_words":[
      [
         "糖醋排骨",
         0.8907967209815979
      ],
      [
         "红烧排骨",
         0.8726683259010315
      ],
      [
         "回锅肉",
         0.858664333820343
      ],
      [
         "红烧鱼",
         0.8542774319648743
      ],
      [
         "梅菜扣肉",
         0.8500987887382507
      ],
      [
         "糖醋小排",
         0.8475514650344849
      ],
      [
         "小炒肉",
         0.8435966968536377
      ],
      [
         "红烧五花肉",
         0.8424086570739746
      ],
      [
         "红烧肘子",
         0.8400496244430542
      ],
      [
         "糖醋里脊",
         0.8381932377815247
      ],
      [
         "红烧猪蹄",
         0.8374584913253784
      ],
      [
         "青椒炒肉",
         0.8344883918762207
      ],
      [
         "粉蒸肉",
         0.8337559700012207
      ],
      [
         "水煮肉片",
         0.8311598300933838
      ],
      [
         "青椒肉丝",
         0.8294434547424316
      ],
      [
         "鱼香茄子",
         0.8291393518447876
      ],
      [
         "烧茄子",
         0.8272593021392822
      ],
      [
         "梅干菜扣肉",
         0.8267726898193359
      ],
      [
         "土豆炖牛肉",
         0.8263725638389587
      ],
      [
         "红烧茄子",
         0.8244959115982056
      ]
   ],
   "word":"红烧肉"
}

2.因吹斯汀相似词
output：

{
   "top_similar_words":[
      [
         "一颗赛艇",
         0.7618176937103271
      ],
      [
         "因吹斯听",
         0.7523878812789917
      ],
      [
         "城会玩",
         0.6856077909469604
      ],
      [
         "厉害了 word 哥",
         0.6615914702415466
      ],
      [
         "emmmmm",
         0.6590334177017212
      ],
      [
         "扎心了老铁",
         0.6527535915374756
      ],
      [
         "神吐槽",
         0.6382066011428833
      ],
      [
         "可以说是非常爆笑了",
         0.6365567445755005
      ],
      [
         "不明觉厉",
         0.6362186670303345
      ],
      [
         "段子哥",
         0.6293908357620239
      ],
      [
         "厉害了我的哥",
         0.6265187859535217
      ],
      [
         "脑洞大开",
         0.6255093216896057
      ],
      [
         "hhhhhh",
         0.6220428943634033
      ],
      [
         "233333",
         0.6189173460006714
      ],
      [
         "没想到你是这样的",
         0.6184067726135254
      ],
      [
         "屌炸天",
         0.6119771003723145
      ],
      [
         "interesting",
         0.6102393865585327
      ],
      [
         "emmmmmmm",
         0.6097372770309448
      ],
      [
         "开脑洞",
         0.6095746755599976
      ],
      [
         "猴赛雷",
         0.6095525026321411
      ]
   ],
   "word":"因吹斯汀"
}

3.ojbk 相似词
output：

{
   "top_similar_words":[
      [
         "我觉得 ok",
         0.6393940448760986
      ],
      [
         "emmmmmmm",
         0.6306545734405518
      ],
      [
         "hhhh",
         0.6229800581932068
      ],
      [
         "hhhhh",
         0.6225401163101196
      ],
      [
         "不存在的",
         0.6077110767364502
      ],
      [
         "溜了溜了",
         0.603063702583313
      ],
      [
         "hhhhhhh",
         0.6008774638175964
      ],
      [
         "emmmm",
         0.6002634167671204
      ],
      [
         "emmm",
         0.5958442687988281
      ],
      [
         "emmmmm",
         0.592516303062439
      ],
      [
         "阿喵",
         0.5918310880661011
      ],
      [
         "哈哈哈",
         0.590988039970398
      ],
      [
         "略略略",
         0.590296745300293
      ],
      [
         "hhhhhh",
         0.5870903730392456
      ],
      [
         "微笑脸",
         0.5860881209373474
      ],
      [
         "tan90°",
         0.5825910568237305
      ],
      [
         "没毛病",
         0.5802331566810608
      ],
      [
         "233333",
         0.5794929265975952
      ],
      [
         "我觉得不行",
         0.5762011408805847
      ],
      [
         "就酱",
         0.5751103162765503
      ]
   ],
   "word":"ojbk"
}

leiuu

2019-11-28 19:00:42 +08:00

@nieyujiang
还有呢，烤串相似词：
```json
{
"top_similar_words":[
[
"我觉得 ok",
0.6393940448760986
],
[
"emmmmmmm",
0.6306545734405518
],
[
"hhhh",
0.6229800581932068
],
[
"hhhhh",
0.6225401163101196
],
[
"不存在的",
0.6077110767364502
],
[
"溜了溜了",
0.603063702583313
],
[
"hhhhhhh",
0.6008774638175964
],
[
"emmmm",
0.6002634167671204
],
[
"emmm",
0.5958442687988281
],
[
"emmmmm",
0.592516303062439
],
[
"阿喵",
0.5918310880661011
],
[
"哈哈哈",
0.590988039970398
],
[
"略略略",
0.590296745300293
],
[
"hhhhhh",
0.5870903730392456
],
[
"微笑脸",
0.5860881209373474
],
[
"tan90°",
0.5825910568237305
],
[
"没毛病",
0.5802331566810608
],
[
"233333",
0.5794929265975952
],
[
"我觉得不行",
0.5762011408805847
],
[
"就酱",
0.5751103162765503
]
],
"word":"ojbk"
}
```

leiuu

2019-11-28 19:01:40 +08:00

@nieyujiang 搞错了，重来。
{
"top_similar_words":[
[
"烤串儿",
0.927384614944458
],
[
"羊肉串",
0.894095778465271
],
[
"肉串",
0.8555537462234497
],
[
"烤腰子",
0.8516057729721069
],
[
"撸串",
0.8469321727752686
],
[
"涮串",
0.8465385437011719
],
[
"大肉串",
0.8420960903167725
],
[
"烤肉串",
0.838364839553833
],
[
"牛肉串",
0.8371975421905518
],
[
"烤海鲜",
0.8364357948303223
],
[
"烧烤摊",
0.8351374864578247
],
[
"炸串",
0.8339198231697083
],
[
"烧烤",
0.831093430519104
],
[
"烤羊肉串",
0.8277176022529602
],
[
"各种烤串",
0.8274507522583008
],
[
"烤鱿鱼",
0.8235615491867065
],
[
"烤羊腿",
0.8228681683540344
],
[
"烤猪蹄",
0.8225207328796387
],
[
"烤生蚝",
0.8220213055610657
],
[
"吃串",
0.820912778377533
]
],
"word":"烤串"
}

leiuu

2019-11-29 11:20:12 +08:00

@elfive
官方的说明是这样的。
Data collection.
Our training data contains large-scale text collected from news, webpages, and novels. Text data from diverse domains enables the coverage of various types of words and phrases. Moreover, the recently collected webpages and news data enable us to learn the semantic representations of fresh words.

Vocabulary building. To enrich our vocabulary, we involve phrases in Wikipedia and Baidu Baike. We also apply the phrase discovery approach in Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches, which enhances the coverage of emerging phrases.

大概是说用了新闻、网页、小说、维基百科、百度百科的数据。
没提到聊天数据，不过新闻网页都有评论数据，可能也是数据来源之一。