V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
V2EX 提问指南
seth19960929
V2EX  ›  问与答

Elasticsearch 使用不同分词器导致搜索排名的问题

  •  
  •   seth19960929 · 2020-11-30 22:58:09 +08:00 · 437 次点击
    这是一个创建于 1259 天前的主题,其中的信息可能已经有所发展或是发生改变。
    • 相信我们很多人做中文搜索的时候,在Github找了ik中分分词插件
    • 然后建立mapping的时候,很自然的使用这样的参数(参照官方分词文档实例)
    {
            "properties": {
                "title": {
                    "type": "text",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_smart"
                }
            }
    }
    

    • 那么我们来看一下全部数据(打火车和火车两条数据)
    curl 127.0.0.1:9200/test/_search | jq
    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 2,
          "relation": "eq"
        },
        "max_score": 1,
        "hits": [
          {
            "_index": "test",
            "_type": "_doc",
            "_id": "Video_1",
            "_score": 1,
            "_source": {
              "id": 1,
              "title": "打火车"
            }
          },
          {
            "_index": "test",
            "_type": "_doc",
            "_id": "Video_2",
            "_score": 1,
            "_source": {
              "id": 2,
              "title": "火车"
            }
          }
        ]
      }
    }
    
    • 这时候我们开始搜索(打火车)
    curl 127.0.0.1:9200/test/_search?q=打火车 | jq
    {
      "took": 2,
      "timed_out": false,
      "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 2,
          "relation": "eq"
        },
        "max_score": 0.21110919,
        "hits": [
          {
            "_index": "test",
            "_type": "_doc",
            "_id": "Video_2",
            "_score": 0.21110919,
            "_source": {
              "id": 2,
              "title": "火车"
            }
          },
          {
            "_index": "test",
            "_type": "_doc",
            "_id": "Video_1",
            "_score": 0.160443,
            "_source": {
              "id": 1,
              "title": "打火车"
            }
          }
        ]
      }
    }
    
    • 这时候我们惊奇的发现火车的分值是0.21110919居然比打火车0.160443还高

    curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "Video_1",
      "_version": 1,
      "found": true,
      "took": 0,
      "term_vectors": {
        "title": {
          "field_statistics": {
            "sum_doc_freq": 3,
            "doc_count": 2,
            "sum_ttf": 3
          },
          "terms": {
            "打火": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 0,
                  "start_offset": 0,
                  "end_offset": 2
                }
              ]
            },
            "火车": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 1,
                  "start_offset": 1,
                  "end_offset": 3
                }
              ]
            }
          }
        }
      }
    }
    
    • 很惊奇的发现打火车被划分成打火火车两个词, 所以这之中肯定有问题了(当然对于搜索引擎是没有问题的).
    • 打火车文档中的火车得到了分值,但打火会使搜索得分下降, 导致火车文档的排名靠前
    • 所以我决定把两个分词器设置成一样
    {
            "properties": {
                "title": {
                    "type": "text",
                    "analyzer": "ik_smart",
                    "search_analyzer": "ik_smart"
                }
            }
    }
    
    • 然后再看一下分词数据(这次分词的数据的确是我们预想的)
    curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "Video_1",
      "_version": 1,
      "found": true,
      "took": 0,
      "term_vectors": {
        "title": {
          "field_statistics": {
            "sum_doc_freq": 3,
            "doc_count": 2,
            "sum_ttf": 3
          },
          "terms": {
            "打": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 0,
                  "start_offset": 0,
                  "end_offset": 1
                }
              ]
            },
            "火车": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 1,
                  "start_offset": 1,
                  "end_offset": 3
                }
              ]
            }
          }
        }
      }
    }
    
    • 这时我们再搜索一次数据排名, 看到得分值排名的确是我们想要的了.
    curl  127.0.0.1:9200/test/_search?q=打火车 | jq
    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 2,
          "relation": "eq"
        },
        "max_score": 0.77041256,
        "hits": [
          {
            "_index": "test",
            "_type": "_doc",
            "_id": "Video_1",
            "_score": 0.77041256,
            "_source": {
              "id": 1,
              "title": "打火车"
            }
          },
          {
            "_index": "test",
            "_type": "_doc",
            "_id": "Video_2",
            "_score": 0.21110919,
            "_source": {
              "id": 2,
              "title": "火车"
            }
          }
        ]
      }
    }
    
    目前尚无回复
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   3283 人在线   最高记录 6547   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 25ms · UTC 12:48 · PVG 20:48 · LAX 05:48 · JFK 08:48
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.