elasticsearch 拼音分词器自动补全。

文章列表

elasticsearch 拼音 分词器 & 自动补全。

文章目录

- elasticsearch 拼音分词器 & 自动补全。
- - 2. 自动补全。
  - - - 2.1. 拼音分词器。
      - 2.2. 自定义分词器。
      - 2.3. 自动补全查询。
      - 2.4. 实现酒店搜索框自动补全。
      - 2.4.1. 修改酒店映射结构。
        
        2.4.2. 修改 HotelDoc 实体。
        
        2.4.3. 重新导入。
        
        2.4.4. 自动补全查询的 JavaAPI。
        
        2.4.5. 实现搜索框自动补全。

2. 自动补全。

当用户在搜索框输入字符时，我们应该提示出与该字符有关的搜索项，如图。

elasticsearch 拼音分词器自动补全。

这种根据用户输入的字母，提示完整词条的功能，就是自动补全了。

因为需要根据拼音字母来推断，因此要用到拼音分词功能。

2.1. 拼音分词器。

要实现根据字母做补全，就必须对文档按照拼音分词。在 GitHub 上恰好有 elasticsearch 的拼音分词插件。地址：https://github.com/medcl/elasticsearch-analysis-pinyin。

课前资料中也提供了拼音分词器的安装包。

安装方式与 IK 分词器一样，分三步。

①解压。

②上传到虚拟机中，elasticsearch 的 plugin 目录。

③重启 elasticsearch

④测试。

详细安装步骤可以参考 IK 分词器的安装过程。

测试用法如下。

POST /_analyze
{"text": ["如家酒店还不错"],"analyzer": "ik_max_word"
}

结果。

{"tokens" : [{"token" : "ru","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 0},{"token" : "rjjdhbc","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 0},{"token" : "jia","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 1},{"token" : "jiu","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 2},{"token" : "dian","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 3},{"token" : "hai","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 4},{"token" : "bu","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 5},{"token" : "cuo","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 6}]
}

2.2. 自定义分词器。

默认的拼音分词器会将每个汉字单独分为拼音，而我们希望的是每个词条形成一组拼音，需要对拼音分词器做个性化定制，形成自定义分词器。

elasticsearch 中分词器（analyzer）的组成包含三部分。

character filters：在 tokenizer 之前对文本进行处理。例如删除字符、替换字符。
tokenizer：将文本按照一定的规则切割成词条（term）。例如 keyword，就是不分词；还有 ik_smart。

term
n. 学期（尤用于英国，学校一年分三个学期）；术语；期限；任期；期；词语；措辞；到期；项
vt. 把 … 称为；把 … 叫做

tokenizer filter：将 tokenizer 输出的词条做进一步处理。例如大小写转换、同义词处理、拼音处理等。

文档分词时会依次由这三部分来处理文档。

elasticsearch 拼音分词器自动补全。
声明自定义分词器的语法如下。

在创建索引库时通过 settings 配置自定义的 analyzer（分词器）。

PUT /test
{"settings": {"analysis": {// 自定义分词器。"analyzer": {// 分词器名称。"my_analyzer": {"tokenizer": "ik_max_word","filter": "pinyin"}}}}
}

PUT /test
{"settings": {"analysis": {// 自定义分词器。"analyzer": {// 分词器名称。"my_analyzer": {"tokenizer": "ik_max_word","filter": "py"}},// 自定义 tokenizer filter。"filter": {// 过滤器名称。"py": {// 过滤器类型，这里是 pinyin。"type": "pinyin","limit_first_letter_length": 16,"keep_full_pinyin": false,"keep_joined_full_pinyin": true,"none_chinese_pinyin_tokenize": false,"keep_original": true,"remove_duplicated_term": true}}}},"mappings": {"properties": {"name": {"type": "text","analyzer": "my_analyzer","search_analyzer": "my_analyzer"}}}
}

{"acknowledged" : true,"shards_acknowledged" : true,"index" : "test"
}

测试。

POST /test/_analyze
{"text": ["如家酒店还不错"],"analyzer": "my_analyzer"
}

{"tokens" : [{"token" : "如家","start_offset" : 0,"end_offset" : 2,"type" : "CN_WORD","position" : 0},{"token" : "rujia","start_offset" : 0,"end_offset" : 2,"type" : "CN_WORD","position" : 0},{"token" : "rj","start_offset" : 0,"end_offset" : 2,"type" : "CN_WORD","position" : 0},{"token" : "酒店","start_offset" : 2,"end_offset" : 4,"type" : "CN_WORD","position" : 1},{"token" : "jiudian","start_offset" : 2,"end_offset" : 4,"type" : "CN_WORD","position" : 1},{"token" : "jd","start_offset" : 2,"end_offset" : 4,"type" : "CN_WORD","position" : 1},{"token" : "还不","start_offset" : 4,"end_offset" : 6,"type" : "CN_WORD","position" : 2},{"token" : "haibu","start_offset" : 4,"end_offset" : 6,"type" : "CN_WORD","position" : 2},{"token" : "hb","start_offset" : 4,"end_offset" : 6,"type" : "CN_WORD","position" : 2},{"token" : "不错","start_offset" : 5,"end_offset" : 7,"type" : "CN_WORD","position" : 3},{"token" : "bucuo","start_offset" : 5,"end_offset" : 7,"type" : "CN_WORD","position" : 3},{"token" : "bc","start_offset" : 5,"end_offset" : 7,"type" : "CN_WORD","position" : 3}]
}

总结。

如何使用拼音分词器？

① 下载 pinyin 分词器。
② 解压并放到 elasticsearch 的 plugin 目录。
③ 重启即可。

如何自定义分词器？

① 创建索引库时，在 settings 中配置，可以包含三部分。
② character filter
③ tokenizer
④ filter

拼音分词器注意事项？

为了避免搜索到同音字，搜索时不要使用拼音分词器。

字段在创建倒排索引时应该用 my_analyzer 分词器。
字段在搜索时应该使用 ik_smart 分词器。

PUT /test
{"settings": {"analysis": {// 自定义分词器。"analyzer": {// 分词器名称。"my_analyzer": {"tokenizer": "ik_max_word","filter": "py"}},// 自定义 tokenizer filter。"filter": {// 过滤器名称。"py": {// 过滤器类型，这里是 pinyin。"type": "pinyin","limit_first_letter_length": 16,"keep_full_pinyin": false,"keep_joined_full_pinyin": true,"none_chinese_pinyin_tokenize": false,"keep_original": true,"remove_duplicated_term": true}}}},"mappings": {"properties": {"name": {"type": "text","analyzer": "my_analyzer","search_analyzer": "ik_smart"}}}
}

2.3. 自动补全查询。

elasticsearch 提供了 Completion Suggester 查询来实现自动补全功能。这个查询会匹配以用户输入内容开头的词条并返回。为了提高补全查询的效率，对于文档中字段的类型有一些约束。

参与补全查询的字段必须是 completion 类型。
字段的内容一般是用来补全的多个词条形成的数组。

比如，一个这样的索引库。

// 创建索引库。
PUT test
{"mappings": {"properties": {"title": {"type": "completion"}}}
}

然后插入下面的数据。

// 示例数据。
POST test/_doc
{"title": ["Sony","WH-1000XM5"]
}
POST test/_doc
{"title": ["SK-II","PITERA"]
}
POST test/_doc
{"title": ["Nintendo","switch"]
}

查询的 DSL 语句如下。

// 自动补全查询。
GET /test/_search
{"suggest": {"titleSuggest": {// 关键字。"text": "s","completion": {// 补全查询的字段。"field": "title",// 跳过重复的。"skip_duplicates": true,// 获取前 10 条结果。"size": 10}}}
}

{"took" : 305,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 0,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"suggest" : {"titleSuggest" : [{"text" : "s","offset" : 0,"length" : 1,"options" : [{"text" : "SK-II","_index" : "test","_type" : "_doc","_id" : "xceQcIcBAo7LWD6k-sCY","_score" : 1.0,"_source" : {"title" : ["SK-II","PITERA"]}},{"text" : "Sony","_index" : "test","_type" : "_doc","_id" : "xMeQcIcBAo7LWD6k9MBJ","_score" : 1.0,"_source" : {"title" : ["Sony","WH-1000XM5"]}},{"text" : "switch","_index" : "test","_type" : "_doc","_id" : "xseQcIcBAo7LWD6k_8DL","_score" : 1.0,"_source" : {"title" : ["Nintendo","switch"]}}]}]}
}

2.4. 实现酒店搜索框自动补全。

现在，我们的 hotel 索引库还没有设置拼音分词器，需要修改索引库中的配置。但是我们知道索引库是无法修改的，只能删除然后重新创建。

另外，我们需要添加一个字段，用来做自动补全，将 brand、suggestion、city 等都放进去，作为自动补全的提示。

因此，总结一下，我们需要做的事情包括。

修改 hotel 索引库结构，设置自定义拼音分词器。
修改索引库的 name、all 字段，使用自定义分词器。
索引库添加一个新字段 suggestion，类型为 completion 类型，使用自定义的分词器。
给 HotelDoc 类添加 suggestion 字段，内容包含 brand、business。
重新导入数据到 hotel 库。

2.4.1. 修改酒店映射结构。

代码如下。

DELETE /hotel// 酒店数据索引库。
PUT /hotel
{"settings": {"analysis": {"analyzer": {"text_analyzer": {"tokenizer": "ik_max_word","filter": "py"},"completion_analyzer": {"tokenizer": "keyword","filter": "py"}},"filter": {"py": {"limit_first_letter_length": 16,"type": "pinyin","keep_full_pinyin": false,"keep_joined_full_pinyin": true,"none_chinese_pinyin_tokenize": false,"keep_original": true,"remove_duplicated_term": true}}}},"mappings": {"properties": {"all": {"type": "text","analyzer": "text_analyzer","search_analyzer": "ik_smart"},"id": {"type": "keyword"},"address": {"type": "keyword","index": false},"brand": {"type": "keyword","copy_to": "all"},"business": {"type": "keyword","copy_to": "all"},"city": {"type": "keyword"},"location": {"type": "geo_point"},"name": {"type": "text","analyzer": "text_analyzer","search_analyzer": "ik_smart","copy_to": "all"},"pic": {"type": "keyword","index": false},"price": {"type": "integer"},"score": {"type": "integer"},"starName": {"type": "keyword"},"suggestion": {"type": "completion","analyzer": "completion_analyzer"}}}
}

2.4.2. 修改 HotelDoc 实体。

HotelDoc 中要添加一个字段，用来做自动补全，内容可以是酒店品牌、城市、商圈等信息。按照自动补全字段的要求，最好是这些字段的数组。

因此我们在 HotelDoc 中添加一个 suggestion 字段，类型为 List<String>，然后将 brand、city、business 等信息放到里面。

代码如下。

package com.geek.elasticsearchgeek.hotel.pojo;import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;import java.io.Serializable;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;/* @author geek*/
@Data
@AllArgsConstructor
@NoArgsConstructor
public class HotelDoc implements Serializable {private Long id;private String name;private String address;private Integer price;private Integer score;private String brand;private String city;private String starName;private String business;private String location;private String pic;/* 排序时的距离值。*/private Object distance;private Boolean bAdvertise;private List<String> suggestion;public HotelDoc(Hotel hotel) {this.id = hotel.getId();this.name = hotel.getName();this.address = hotel.getAddress();this.price = hotel.getPrice();this.score = hotel.getScore();this.brand = hotel.getBrand();this.city = hotel.getCity();this.starName = hotel.getStarName();this.business = hotel.getBusiness();this.location = hotel.getLatitude() + ", " + hotel.getLongitude();this.pic = hotel.getPic();// 组装 suggestion。if (this.business.contains("/")) {// business 有多个值，需要切割。String[] split = this.business.split("/");// 添加元素。this.suggestion = new ArrayList<>();this.suggestion.add(this.brand);Collections.addAll(this.suggestion, split);} else {this.suggestion = Arrays.asList(this.brand, this.business);}}}

2.4.3. 重新导入。

重新执行之前编写的导入数据功能，可以看到新的酒店数据中包含了 suggestion。

测试。

GET /hotel/_search
{"suggest": {"suggestions": {"text": "h","completion": {"field": "suggestion","skip_duplicates": true,"size": 10}}}
}

2.4.4. 自动补全查询的 JavaAPI。

之前我们学习了自动补全查询的 DSL，而没有学习对应的 JavaAPI，这里给出一个示例。

elasticsearch 拼音分词器自动补全。

而自动补全的结果也比较特殊，解析的代码如下。

elasticsearch 拼音分词器自动补全。

2.4.5. 实现搜索框自动补全。

查看前端页面，可以发现当我们在输入框键入时，前端会发起 ajax 请求。

elasticsearch 拼音分词器自动补全。

返回值是补全词条的集合，类型为 List<String>。

1）在 com.geek.elasticsearchgeek.hotel.controller 包下的 HotelController 中添加新接口，接收新的请求。

@RequestMapping("/suggestion")public List<String> getSuggestions(@RequestParam("key") String prefix) {return this.hotelService.getSuggestions(prefix);}

2）在 com.geek.elasticsearchgeek.hotel.service 包下的 IhotelService 中添加方法。

List<String> getSuggestions(String prefix);

3）在 com.geek.elasticsearchgeek.hotel.service.impl.HotelService 中实现该方法。

@Overridepublic List<String> getSuggestions(String prefix) {// 准备 Request。SearchRequest searchRequest = new SearchRequest("hotel");// 准备 DSL。searchRequest.source().suggest(new SuggestBuilder().addSuggestion("suggestions",SuggestBuilders.completionSuggestion("suggestion").prefix(prefix).skipDuplicates(true).size(10)));// 发起请求。SearchResponse searchResponse = null;try {searchResponse = this.restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);} catch (IOException e) {throw new RuntimeException(e);}// 解析结果。Suggest suggest = searchResponse.getSuggest();// 根据补全查询名称，获取补全结果。CompletionSuggestion suggestions = suggest.getSuggestion("mySuggestions");// 获取 options。List<CompletionSuggestion.Entry.Option> options = suggestions.getOptions();// 遍历。List<String> list = new ArrayList<>(options.size());for (CompletionSuggestion.Entry.Option option : options) {// 补全的词条。String text = option.getText().toString();list.add(text);}return list;}

elasticsearch 拼音分词器自动补全。

elasticsearch 拼音 分词器 & 自动补全。

文章目录

2. 自动补全。

2.1. 拼音分词器。

2.2. 自定义分词器。

2.3. 自动补全查询。

2.4. 实现酒店搜索框自动补全。

2.4.1. 修改酒店映射结构。

2.4.2. 修改 HotelDoc 实体。

2.4.3. 重新导入。

2.4.4. 自动补全查询的 JavaAPI。

2.4.5. 实现搜索框自动补全。

公告

标签

elasticsearch 拼音分词器 自动补全。

elasticsearch 拼音分词器 & 自动补全。

文章目录

2. 自动补全。

2.1. 拼音分词器。

2.2. 自定义分词器。

2.3. 自动补全查询。

2.4. 实现酒店搜索框自动补全。

2.4.1. 修改酒店映射结构。

2.4.2. 修改 HotelDoc 实体。

2.4.3. 重新导入。

2.4.4. 自动补全查询的 JavaAPI。

2.4.5. 实现搜索框自动补全。

相关问题

公告

标签

elasticsearch 拼音分词器自动补全。