评分机制详解

评分机制 TF\IDF

算法介绍

relevance score算法，简单来说，就是计算出，一个索引中的文本，与搜索文本，他们之间的关联匹配程度。

Elasticsearch使用的是 term frequency/inverse document frequency算法，简称为TF/IDF算法。TF词频(Term Frequency)，IDF逆向文件频率(Inverse Document Frequency)

Term frequency：搜索文本中的各个词条在field文本中出现了多少次，出现次数越多，就越相关。

1571494142950

举例：搜索请求：hello world

doc1 : hello you and me,and world is very good.

doc2 : hello,how are you

Inverse document frequency：搜索文本中的各个词条在整个索引的所有文档中出现了多少次，出现的次数越多，就越不相关.

1571494159465

1571494176760

举例：搜索请求：hello world

doc1 : hello ,today is very good

doc2 : hi world ,how are you

整个index中1亿条数据。hello的document 1000个，有world的document 有100个。

doc2 更相关

Field-length norm：field长度，field越长，相关度越弱

举例：搜索请求：hello world

doc1 : {"title":"hello article","content ":"balabalabal 1万个"}

doc2 : {"title":"my article","content ":"balabalabal 1万个,world"}

_score是如何被计算出来的

 GET /book/_search?explain=true
 {
   "query": {
     "match": {
       "description": "java程序员"
     }
   }
 }

 {
   "took" : 5,
   "timed_out" : false,
   "_shards" : {
     "total" : 1,
     "successful" : 1,
     "skipped" : 0,
     "failed" : 0
   },
   "hits" : {
     "total" : {
       "value" : 2,
       "relation" : "eq"
     },
     "max_score" : 2.137549,
     "hits" : [
       {
         "_shard" : "[book][0]",
         "_node" : "MDA45-r6SUGJ0ZyqyhTINA",
         "_index" : "book",
         "_type" : "_doc",
         "_id" : "3",
         "_score" : 2.137549,
         "_source" : {
           "name" : "spring开发基础",
           "description" : "spring 在java领域非常流行，java程序员都在用。",
           "studymodel" : "201001",
           "price" : 88.6,
           "timestamp" : "2019-08-24 19:11:35",
           "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
           "tags" : [
             "spring",
             "java"
           ]
         },
         "_explanation" : {
           "value" : 2.137549,
           "description" : "sum of:",
           "details" : [
             {
               "value" : 0.7936629,
               "description" : "weight(description:java in 0) [PerFieldSimilarity], result of:",
               "details" : [
                 {
                   "value" : 0.7936629,
                   "description" : "score(freq=2.0), product of:",
                   "details" : [
                     {
                       "value" : 2.2,
                       "description" : "boost",
                       "details" : [ ]
                     },
                     {
                       "value" : 0.47000363,
                       "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                       "details" : [
                         {
                           "value" : 2,
                           "description" : "n, number of documents containing term",
                           "details" : [ ]
                         },
                         {
                           "value" : 3,
                           "description" : "N, total number of documents with field",
                           "details" : [ ]
                         }
                       ]
                     },
                     {
                       "value" : 0.7675597,
                       "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                       "details" : [
                         {
                           "value" : 2.0,
                           "description" : "freq, occurrences of term within document",
                           "details" : [ ]
                         },
                         {
                           "value" : 1.2,
                           "description" : "k1, term saturation parameter",
                           "details" : [ ]
                         },
                         {
                           "value" : 0.75,
                           "description" : "b, length normalization parameter",
                           "details" : [ ]
                         },
                         {
                           "value" : 12.0,
                           "description" : "dl, length of field",
                           "details" : [ ]
                         },
                         {
                           "value" : 35.333332,
                           "description" : "avgdl, average length of field",
                           "details" : [ ]
                         }
                       ]
                     }
                   ]
                 }
               ]
             },
             {
               "value" : 1.3438859,
               "description" : "weight(description:程序员 in 0) [PerFieldSimilarity], result of:",
               "details" : [
                 {
                   "value" : 1.3438859,
                   "description" : "score(freq=1.0), product of:",
                   "details" : [
                     {
                       "value" : 2.2,
                       "description" : "boost",
                       "details" : [ ]
                     },
                     {
                       "value" : 0.98082924,
                       "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                       "details" : [
                         {
                           "value" : 1,
                           "description" : "n, number of documents containing term",
                           "details" : [ ]
                         },
                         {
                           "value" : 3,
                           "description" : "N, total number of documents with field",
                           "details" : [ ]
                         }
                       ]
                     },
                     {
                       "value" : 0.6227967,
                       "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                       "details" : [
                         {
                           "value" : 1.0,
                           "description" : "freq, occurrences of term within document",
                           "details" : [ ]
                         },
                         {
                           "value" : 1.2,
                           "description" : "k1, term saturation parameter",
                           "details" : [ ]
                         },
                         {
                           "value" : 0.75,
                           "description" : "b, length normalization parameter",
                           "details" : [ ]
                         },
                         {
                           "value" : 12.0,
                           "description" : "dl, length of field",
                           "details" : [ ]
                         },
                         {
                           "value" : 35.333332,
                           "description" : "avgdl, average length of field",
                           "details" : [ ]
                         }
                       ]
                     }
                   ]
                 }
               ]
             }
           ]
         }
       },
       {
         "_shard" : "[book][0]",
         "_node" : "MDA45-r6SUGJ0ZyqyhTINA",
         "_index" : "book",
         "_type" : "_doc",
         "_id" : "2",
         "_score" : 0.57961315,
         "_source" : {
           "name" : "java编程思想",
           "description" : "java语言是世界第一编程语言，在软件开发领域使用人数最多。",
           "studymodel" : "201001",
           "price" : 68.6,
           "timestamp" : "2019-08-25 19:11:35",
           "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
           "tags" : [
             "java",
             "dev"
           ]
         },
         "_explanation" : {
           "value" : 0.57961315,
           "description" : "sum of:",
           "details" : [
             {
               "value" : 0.57961315,
               "description" : "weight(description:java in 0) [PerFieldSimilarity], result of:",
               "details" : [
                 {
                   "value" : 0.57961315,
                   "description" : "score(freq=1.0), product of:",
                   "details" : [
                     {
                       "value" : 2.2,
                       "description" : "boost",
                       "details" : [ ]
                     },
                     {
                       "value" : 0.47000363,
                       "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                       "details" : [
                         {
                           "value" : 2,
                           "description" : "n, number of documents containing term",
                           "details" : [ ]
                         },
                         {
                           "value" : 3,
                           "description" : "N, total number of documents with field",
                           "details" : [ ]
                         }
                       ]
                     },
                     {
                       "value" : 0.56055,
                       "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                       "details" : [
                         {
                           "value" : 1.0,
                           "description" : "freq, occurrences of term within document",
                           "details" : [ ]
                         },
                         {
                           "value" : 1.2,
                           "description" : "k1, term saturation parameter",
                           "details" : [ ]
                         },
                         {
                           "value" : 0.75,
                           "description" : "b, length normalization parameter",
                           "details" : [ ]
                         },
                         {
                           "value" : 19.0,
                           "description" : "dl, length of field",
                           "details" : [ ]
                         },
                         {
                           "value" : 35.333332,
                           "description" : "avgdl, average length of field",
                           "details" : [ ]
                         }
                       ]
                     }
                   ]
                 }
               ]
             }
           ]
         }
       }
     ]
   }
 }

分析一个document是如何被匹配上的

 GET /book/_explain/3
 {
   "query": {
     "match": {
       "description": "java程序员"
     }
   }
 }

Doc value

搜索的时候，要依靠倒排索引；排序的时候，需要依靠正排索引，看到每个document的每个field，然后进行排序，所谓的正排索引，其实就是doc values

在建立索引的时候，一方面会建立倒排索引，以供搜索用；一方面会建立正排索引，也就是doc values，以供排序，聚合，过滤等操作使用

doc values是被保存在磁盘上的，此时如果内存足够，os会自动将其缓存在内存中，性能还是会很高；如果内存不足够，os会将其写入磁盘上

倒排索引

doc1: hello world you and me

doc2: hi, world, how are you

term	doc1	doc2
hello	*
world	*	*
you	*	*
and	*
me	*
hi		*
how		*
are		*

搜索时：

hello you --> hello, you

hello --> doc1

you --> doc1,doc2

doc1: hello world you and me

doc2: hi, world, how are you

sort by 出现问题

正排索引

doc1: { "name": "jack", "age": 27 }

doc2: { "name": "tom", "age": 30 }

document	name	age
doc1	jack	27
doc2	tom	30

query phase

（1）搜索请求发送到某一个coordinate node，构构建一个priority queue，长度以paging操作from和size为准，默认为10

（2）coordinate node将请求转发到所有shard，每个shard本地搜索，并构建一个本地的priority queue

（3）各个shard将自己的priority queue返回给coordinate node，并构建一个全局的priority queue

replica shard如何提升搜索吞吐量

一次请求要打到所有shard的一个replica/primary上去，如果每个shard都有多个replica，那么同时并发过来的搜索请求可以同时打到其他的replica上去

fetch phase

fetch phbase工作流程

（1）coordinate node构建完priority queue之后，就发送mget请求去所有shard上获取对应的document

（2）各个shard将document返回给coordinate node

（3）coordinate node将合并后的document结果返回给client客户端

一般搜索，如果不加from和size，就默认搜索前10条，按照_score排序

搜索参数小总结

preference

决定了哪些shard会被用来执行搜索操作

_primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, _shards:2,3

bouncing results问题，两个document排序，field值相同；不同的shard上，可能排序不同；每次请求轮询打到不同的replica shard上；每次页面上看到的搜索结果的排序都不一样。这就是bouncing result，也就是跳跃的结果。

搜索的时候，是轮询将搜索请求发送到每一个replica shard（primary shard），但是在不同的shard上，可能document的排序不同

解决方案就是将preference设置为一个字符串，比如说user_id，让每个user每次搜索的时候，都使用同一个replica shard去执行，就不会看到bouncing results了

timeout

已经讲解过原理了，主要就是限定在一定时间内，将部分获取到的数据直接返回，避免查询耗时过长

routing

document文档路由，_id路由，routing=user_id，这样的话可以让同一个user对应的数据到一个shard上去

search_type

default：query_then_fetch

dfs_query_then_fetch，可以提升revelance sort精准度

评分机制 TF\IDF

算法介绍

_score是如何被计算出来的

分析一个document是如何被匹配上的

Doc value

query phase

query phase

replica shard如何提升搜索吞吐量

fetch phase

fetch phbase工作流程

一般搜索，如果不加from和size，就默认搜索前10条，按照_score排序

搜索参数小总结

preference

timeout

routing

search_type

相关文章

热点文章

精彩视频

Tags