» Python:使用ElasticSearch构建全文检索API » 3. 搜索文档 » 3.3 中文搜索

中文搜索

列出索引

curl 'http://localhost:9200/_cat/indices?v'

如下结果:

health status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.size dataset.size
yellow open   book_idx c1a4fmDyRkyLQoKMcDlvLw   1   1         10            0     21.9kb         21.9kb       21.9kb

现有的默认索引 books_idx 对英文处理效果不错,但是对中文的效果不理想。可以尝试使用上一节安装的 smartcn 分词器。

创建基于 smartcn 的索引

curl -X PUT "http://localhost:9200/book_idx_cn?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "smartcn"
        }
      }
    }
  }
}'

创建结果:

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "book_idx_cn"
}

更新代码

修改 books/infrastructure/config/config.py

@@ -5,6 +5,7 @@ import yaml
 @dataclass
 class SearchConfig:
     address: str
+    index: str
 
 
 @dataclass

修改 config.yml:

@@ -2,3 +2,4 @@ app:
   page_size: 10
 search:
   address: "http://localhost:9200"
+  index: "book_idx_cn"

修改 books/infrastructure/search/es.py

@@ -7,20 +7,18 @@ from ...domain.gateway import BookManager
 from ...domain.model import Book
 
 
-INDEX_BOOK = "book_idx"
-
-
 class ElasticSearchEngine(BookManager):
-    def __init__(self, address: str, page_size: int):
+    def __init__(self, address: str, index: str, page_size: int):
         self.page_size = page_size
+        self.index = index
         self.client = Elasticsearch(address)
 
     def index_book(self, b: Book) -> str:
-        result = self.client.index(index=INDEX_BOOK, document=asdict(b))
+        result = self.client.index(index=self.index, document=asdict(b))
         return result['_id']
 
     def search_books(self, query: str) -> List[Book]:
-        response = self.client.search(index=INDEX_BOOK, query={
+        response = self.client.search(index=self.index, query={
             "multi_match": {
                 "query": query,
                 "fields": ["title", "author", "content"]

修改 books/application/wire_helper.py

@@ -9,7 +9,8 @@ class WireHelper:
 
     @classmethod
     def new(cls, c: Config):
-        es = ElasticSearchEngine(c.search.address, c.app.page_size)
+        es = ElasticSearchEngine(
+            c.search.address, c.search.index, c.app.page_size)
         return cls(es)
 
     def book_manager(self) -> BookManager:

编辑完成后,重新启动服务器。

填入中文测试数据

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"title":"哈利·波特与魔法石","author":"J.K. 罗琳","published_at":"1997-06-26","content":"一个年轻男孩发现自己是一名巫师,并开始在霍格沃茨魔法学校接受教育,在那里他揭开了“魔法石”的秘密。"}' \
  http://localhost:8000/books

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"title":"杀死一只知更鸟","author":"哈珀·李","published_at":"1960-07-11","content":"小说设置在大萧条期间的美国南部,通过年轻的斯考特·芬奇的眼睛探索了种族不义和道德成长的主题。"}' \
  http://localhost:8000/books

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"title":"魔戒","author":"J.R.R. 托尔金","published_at":"1954-07-29","content":"一个名叫弗罗多·巴金斯的霍比特人踏上了一场艰险的旅程,摧毁一枚强大的戒指,并拯救中土世界,免受黑暗领主索伦的侵害。"}' \
  http://localhost:8000/books

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"title":"麦田里的守望者","author":"J.D. 赛林格","published_at":"1951-07-16","content":"霍尔顿·考菲尔德在被驱逐出预备学校后,讲述了他在纽约市的经历,探讨了疏远、身份和纯真等主题。"}' \
  http://localhost:8000/books

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"title":"牧羊少年奇幻之旅","author":"保罗·柯艾略","published_at":"1988-01-01","content":"牧羊人圣地亚哥从西班牙前往埃及,寻找埋在金字塔附近的宝藏。沿途,他学到了追随梦想的重要性。"}' \
  http://localhost:8000/books

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"title":"饥饿游戏","author":"苏珊·柯林斯","published_at":"2008-09-14","content":"在一个反乌托邦的未来,青少年被迫参加名为饥饿游戏的电视死斗。凯特尼斯·艾弗丁自愿代替妹妹参加,并成为叛乱的象征。"}' \
  http://localhost:8000/books

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"title":"1984","author":"乔治·奥威尔","published_at":"1949-06-08","content":"温斯顿·史密斯生活在一个由党领导的极权社会中。他反抗压迫性的政权,但最终屈服于其控制。"}' \
  http://localhost:8000/books

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"title":"龙纹身的女孩","author":"斯蒂格·拉尔森","published_at":"2005-08-01","content":"记者米卡埃尔·布隆奎斯特和黑客丽斯贝特·萨兰德调查了一个富裕家庭的年轻女子失踪案,揭示了黑暗的秘密和腐败。"}' \
  http://localhost:8000/books

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"title":"消失的女孩","author":"吉莲·弗林","published_at":"2012-06-05","content":"在他们的第五个结婚纪念日上,尼克·邓恩的妻子艾米失踪了。随着媒体的狂热报道和怀疑的增加,尼克发现自己陷入了一场欺骗和背叛的旋涡中。"}' \
  http://localhost:8000/books

验证效果

用 ”记者牧羊人“ 进行检索验证。

Query 的分词结果

curl -X POST 'http://localhost:9200/_analyze' -H 'Content-Type: application/json' -d '{
  "analyzer": "smartcn",
  "text": "记者牧羊人"
}'
{"tokens":[{"token":"记者","start_offset":0,"end_offset":2,"type":"word","position":0},{"token":"牧羊人","start_offset":2,"end_offset":5,"type":"word","position":1}]}

Query 的查询结果

curl 'http://localhost:8000/books?q=%E8%AE%B0%E8%80%85%E7%89%A7%E7%BE%8A%E4%BA%BA'

%E8%AE%B0%E8%80%85%E7%89%A7%E7%BE%8A%E4%BA%BA 是“记者牧羊人”URL 编码后的值。

[
	{
		"title": "牧羊少年奇幻之旅",
		"author": "保罗·柯艾略",
		"published_at": "1988-01-01",
		"content": "牧羊人圣地亚哥从西班牙前往埃及,寻找埋在金字塔附近的宝藏。沿途,他学到了追随梦想的重要性。"
	},
	{
		"title": "龙纹身的女孩",
		"author": "斯蒂格·拉尔森",
		"published_at": "2005-08-01",
		"content": "记者米卡埃尔·布隆奎斯特和黑客丽斯贝特·萨兰德调查了一个富裕家庭的年轻女子失踪案,揭示了黑暗的秘密和腐败。"
	}
]

两个结果的匹配得分及其原因解释。

{'value': 2.2704897, 'description': 'max of:', 'details': [{'value': 2.2704897, 'description': 'sum of:', 'details': [{'value': 2.2704897, 'description': 'weight(content:牧羊人 in 4) [PerFieldSimilarity], result of:', 'details': [{'value': 2.2704897, 'description': 'score(freq=1.0), computed as boost * idf * tf from:', 'details': [{'value': 2.2, 'description': 'boost', 'details': []}, {'value': 1.9924302, 'description': 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:', 'details': [{'value': 1, 'description': 'n, number of documents containing term', 'details': []}, {'value': 10, 'description': 'N, total number of documents with field', 'details': []}]}, {'value': 0.5179809, 'description': 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:', 'details': [{'value': 1.0, 'description': 'freq, occurrences of term within document', 'details': []}, {'value': 1.2, 'description': 'k1, term saturation parameter', 'details': []}, {'value': 0.75, 'description': 'b, length normalization parameter', 'details': []}, {'value': 22.0, 'description': 'dl, length of field', 'details': []}, {'value': 31.4, 'description': 'avgdl, average length of field', 'details': []}]}]}]}]}]}
{'value': 1.8797743, 'description': 'max of:', 'details': [{'value': 1.8797743, 'description': 'sum of:', 'details': [{'value': 1.8797743, 'description': 'weight(content:记者 in 7) [PerFieldSimilarity], result of:', 'details': [{'value': 1.8797743, 'description': 'score(freq=1.0), computed as boost * idf * tf from:', 'details': [{'value': 2.2, 'description': 'boost', 'details': []}, {'value': 1.9924302, 'description': 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:', 'details': [{'value': 1, 'description': 'n, number of documents containing term', 'details': []}, {'value': 10, 'description': 'N, total number of documents with field', 'details': []}]}, {'value': 0.42884457, 'description': 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:', 'details': [{'value': 1.0, 'description': 'freq, occurrences of term within document', 'details': []}, {'value': 1.2, 'description': 'k1, term saturation parameter', 'details': []}, {'value': 0.75, 'description': 'b, length normalization parameter', 'details': []}, {'value': 36.0, 'description': 'dl, length of field', 'details': []}, {'value': 31.4, 'description': 'avgdl, average length of field', 'details': []}]}]}]}]}]}

再用 ”麦田龙女孩“ 进行检索验证。

Query 的分词结果

curl -X POST 'http://localhost:9200/_analyze' -H 'Content-Type: application/json' -d '{
  "analyzer": "smartcn",
  "text": "麦田龙女孩"
}'
{"tokens":[{"token":"麦田","start_offset":0,"end_offset":2,"type":"word","position":0},{"token":"龙","start_offset":2,"end_offset":3,"type":"word","position":1},{"token":"女孩","start_offset":3,"end_offset":5,"type":"word","position":2}]}

Query 的查询结果

curl 'http://localhost:8000/books?q=%E9%BA%A6%E7%94%B0%E9%BE%99%E5%A5%B3%E5%AD%A9'
[
	{
		"title": "龙纹身的女孩",
		"author": "斯蒂格·拉尔森",
		"published_at": "2005-08-01",
		"content": "记者米卡埃尔·布隆奎斯特和黑客丽斯贝特·萨兰德调查了一个富裕家庭的年轻女子失踪案,揭示了黑暗的秘密和腐败。"
	},
	{
		"title": "麦田里的守望者",
		"author": "J.D. 赛林格",
		"published_at": "1951-07-16",
		"content": "霍尔顿·考菲尔德在被驱逐出预备学校后,讲述了他在纽约市的经历,探讨了疏远、身份和纯真等主题。"
	},
	{
		"title": "消失的女孩",
		"author": "吉莲·弗林",
		"published_at": "2012-06-05",
		"content": "在他们的第五个结婚纪念日上,尼克·邓恩的妻子艾米失踪了。随着媒体的狂热报道和怀疑的增加,尼克发现自己陷入了一场欺骗和背叛的旋涡中。"
	}
]

三个结果的匹配得分及其原因解释。

{'value': 3.1877713, 'description': 'max of:', 'details': [{'value': 3.1877713, 'description': 'sum of:', 'details': [{'value': 1.8282523, 'description': 'weight(title:龙 in 7) [PerFieldSimilarity], result of:', 'details': [{'value': 1.8282523, 'description': 'score(freq=1.0), computed as boost * idf * tf from:', 'details': [{'value': 2.2, 'description': 'boost', 'details': []}, {'value': 1.9924302, 'description': 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:', 'details': [{'value': 1, 'description': 'n, number of documents containing term', 'details': []}, {'value': 10, 'description': 'N, total number of documents with field', 'details': []}]}, {'value': 0.41709054, 'description': 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:', 'details': [{'value': 1.0, 'description': 'freq, occurrences of term within document', 'details': []}, {'value': 1.2, 'description': 'k1, term saturation parameter', 'details': []}, {'value': 0.75, 'description': 'b, length normalization parameter', 'details': []}, {'value': 5.0, 'description': 'dl, length of field', 'details': []}, {'value': 4.1, 'description': 'avgdl, average length of field', 'details': []}]}]}]}, {'value': 1.3595191, 'description': 'weight(title:女孩 in 7) [PerFieldSimilarity], result of:', 'details': [{'value': 1.3595191, 'description': 'score(freq=1.0), computed as boost * idf * tf from:', 'details': [{'value': 2.2, 'description': 'boost', 'details': []}, {'value': 1.4816046, 'description': 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:', 'details': [{'value': 2, 'description': 'n, number of documents containing term', 'details': []}, {'value': 10, 'description': 'N, total number of documents with field', 'details': []}]}, {'value': 0.41709054, 'description': 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:', 'details': [{'value': 1.0, 'description': 'freq, occurrences of term within document', 'details': []}, {'value': 1.2, 'description': 'k1, term saturation parameter', 'details': []}, {'value': 0.75, 'description': 'b, length normalization parameter', 'details': []}, {'value': 5.0, 'description': 'dl, length of field', 'details': []}, {'value': 4.1, 'description': 'avgdl, average length of field', 'details': []}]}]}]}]}]}
{'value': 1.8282523, 'description': 'max of:', 'details': [{'value': 1.8282523, 'description': 'sum of:', 'details': [{'value': 1.8282523, 'description': 'weight(title:麦田 in 3) [PerFieldSimilarity], result of:', 'details': [{'value': 1.8282523, 'description': 'score(freq=1.0), computed as boost * idf * tf from:', 'details': [{'value': 2.2, 'description': 'boost', 'details': []}, {'value': 1.9924302, 'description': 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:', 'details': [{'value': 1, 'description': 'n, number of documents containing term', 'details': []}, {'value': 10, 'description': 'N, total number of documents with field', 'details': []}]}, {'value': 0.41709054, 'description': 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:', 'details': [{'value': 1.0, 'description': 'freq, occurrences of term within document', 'details': []}, {'value': 1.2, 'description': 'k1, term saturation parameter', 'details': []}, {'value': 0.75, 'description': 'b, length normalization parameter', 'details': []}, {'value': 5.0, 'description': 'dl, length of field', 'details': []}, {'value': 4.1, 'description': 'avgdl, average length of field', 'details': []}]}]}]}]}]}
{'value': 1.6642681, 'description': 'max of:', 'details': [{'value': 1.6642681, 'description': 'sum of:', 'details': [{'value': 1.6642681, 'description': 'weight(title:女孩 in 8) [PerFieldSimilarity], result of:', 'details': [{'value': 1.6642681, 'description': 'score(freq=1.0), computed as boost * idf * tf from:', 'details': [{'value': 2.2, 'description': 'boost', 'details': []}, {'value': 1.4816046, 'description': 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:', 'details': [{'value': 2, 'description': 'n, number of documents containing term', 'details': []}, {'value': 10, 'description': 'N, total number of documents with field', 'details': []}]}, {'value': 0.5105853, 'description': 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:', 'details': [{'value': 1.0, 'description': 'freq, occurrences of term within document', 'details': []}, {'value': 1.2, 'description': 'k1, term saturation parameter', 'details': []}, {'value': 0.75, 'description': 'b, length normalization parameter', 'details': []}, {'value': 3.0, 'description': 'dl, length of field', 'details': []}, {'value': 4.1, 'description': 'avgdl, average length of field', 'details': []}]}]}]}]}]}
上页下页