背景

项目使用nodejs开发

数据使用nodejieba+pinyin进行分词处理

现需要将这里的逻辑放到es处理

目标

搜索框搜索的内容包含了三个字段: 品牌, 主机厂, 车型, 用于搜索的search_key字段使用${品牌} ${主机厂} ${车型}格式存储数据

  1. 全中文关键词查询: 一汽奥迪 -> 奥迪 一汽大众(奥迪) 100
  2. 全英文(拼音)或者英文(拼音)加数字: cg k1 -> 成功 成功汽车 K1
  3. 中文带英文(拼音)或者数字: 福特进口 bir -> 福特 福特(进口) Thunderbird [雷鸟]

安装分词器(ik,pinyin)

参考站内es中的各种分词器

配置字典表及热更新

参考站内ik分词器自定义词库热更新

keyword.dic

1
2
3
恒润
荣放
...

索引配置

  • put index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
{
"settings": {
"index.max_ngram_diff":10,
"analysis": {
"filter": {
# wildcard替代方案
"ngram_filter": {
"type": "ngram",
"min_gram": "1",
"max_gram": "10"
},
"pinyin_simple_filter": {
"type": "pinyin",
"keep_first_letter": true,
"keep_none_chinese": false,
"keep_none_chinese_together": true,
"keep_none_chinese_in_first_letter":false,
"none_chinese_pinyin_tokenize": false,
"keep_separate_first_letter": false,
"keep_full_pinyin": false,
"limit_first_letter_length": 50,
"lowercase": true
},
"pinyin_full_filter": {
"type": "pinyin",
"keep_first_letter": false,
"keep_separate_first_letter": false,
"keep_joined_full_pinyin": true,
"keep_full_pinyin": true,
"keep_original": false,
"limit_first_letter_length": 50,
"lowercase": true,
"keep_none_chinese": false,
"keep_none_chinese_together": true,
"none_chinese_pinyin_tokenize": false,
"keep_none_chinese_in_first_letter":false
},
# eg: 奥迪 -> 奥迪,aodi,ad
"pinyin": {
"type": "pinyin",
"keep_full_pinyin": false,
"keep_first_letter": true,
"keep_joined_full_pinyin": true,
"keep_original": true,
"lowercase": true,
"limit_first_letter_length": 16,
"remove_duplicated_term": false,
"keep_separate_first_letter": false,
"none_chinese_pinyin_tokenize": false
}
},
"analyzer": {
# eg: 奥迪 -> 奥迪,aodi,ad
"separate_analyzer": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": [
"pinyin"
]
},
"ikIndexAnalyzer": {
"type": "custom",
"tokenizer": "ik_max_word"
},
"ikSearchAnalyzer": {
"type": "custom",
"tokenizer": "ik_max_word"
},
# 仅包含首字母, 但是为了匹配包含关系, 所以还使用了ngram分析器
"pinyiSimpleIndexAnalyzer": {
"tokenizer": "ik_smart",
"filter": [
"pinyin_simple_filter",
"ngram_filter",
"lowercase"
]
},
"pinyiSimpleSearchAnalyzer": {
"tokenizer": "ik_smart",
"filter": [
"pinyin_simple_filter",
"lowercase"
]
},
# 仅包含拼音, 但是为了匹配包含关系, 所以还使用了ngram分析器
"pinyiFullIndexAnalyzer": {
"tokenizer": "ik_smart",
"filter": [
"pinyin_full_filter",
"ngram_filter",
"lowercase"
]
},
"pinyiFullSearchAnalyzer": {
"tokenizer": "ik_smart",
"filter": [
"pinyin_full_filter",
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"品牌": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"analyzer": "separate_analyzer",
"search_analyzer": "ik_smart"
},
"主机厂": {
"type": "keyword"
},
"车型": {
"type": "keyword"
},
# 用于搜索, 该字段数据使用`${品牌} ${主机厂} ${车型}`格式存储
"search_key": {
"type": "keyword",
"fields": {
"SPY": {
"type": "text",
"analyzer": "pinyiSimpleIndexAnalyzer",
"search_analyzer": "pinyiSimpleSearchAnalyzer"
},
"FPY": {
"type": "text",
"analyzer": "pinyiFullIndexAnalyzer",
"search_analyzer": "pinyiFullSearchAnalyzer"
},
"IKS": {
"type": "text",
"analyzer": "ikIndexAnalyzer",
"search_analyzer": "ikSearchAnalyzer"
}
}
}
}
}
}

查询整理

索引品牌, 将品牌数据放在前面: 奥迪/aodi/ad

keyword匹配品牌的时候, 需要将品牌数据单独放在前面

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"cm_brand": {
"query": "aodi",
"analyzer": "keyword"
}
}
}
]
}
}
}

纯中文: 一汽奥迪

  • 如果keyword匹配正则/^[\u4e00-\u9fa5,-\s.]+$/g
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"query": {
"bool": {
"should": [
{
"match": {
"search_key.IKS": {
"query": "一汽奥迪"
}
}
}
]
}
}
}

全英文(拼音)或者英文(拼音)加数字: cg k1

  • 如果keyword匹配正则/^[a-zA-Z0-9,-\s.]+$/g
  • 先将keyword空格/,/./-分割: keyword.split(/[,-\s.]/);, 然后每一部分(除空字符外)作为match_phrase查询的关键词(包含查询SPYFPY)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
{
"query": {
"bool": {
"should": [
{
"match": {
"search_key.FPY": {
"query": "cg k1"
}
}
},
{
"match": {
"search_key.SPY": {
"query": "cg k1"
}
}
},
{
"match": {
"search_key.IKS": {
"query": "cg k1"
}
}
},
{
"bool": {
"should": [
{
"match_phrase": {
"search_key.FPY": {
"query": "cg",
"boost": 5
}
}
},
{
"match_phrase": {
"search_key.SPY": {
"query": "cg",
"boost": 5
}
}
},
{
"match_phrase": {
"search_key.FPY": {
"query": "k1",
"boost": 5
}
}
},
{
"match_phrase": {
"search_key.SPY": {
"query": "k1",
"boost": 5
}
}
}
],
"boost": 1
}
}
]
}
}
}

中文带英文(拼音)或者数字: 福特进口 bir

  • 如果keyword匹配正则/^[\u4e00-\u9fa50-9a-zA-Z,-\s.]+$/g
  • 先将keyword替换中文为逗号, 然后按正则提取出英文/拼音/数字部分(包含特殊符号,下面会进行分割): keyword.replace(/[\u4e00-\u9fa5]/g, ',').match(/[a-zA-Z0-9,-\s.]/g).join("")
  • 再将英文/拼音/数字部分空格/,/./-分割: noChineseKeyword.split(/[,-\s.]/);, 然后每一部分(除空字符外)作为match_phrase查询的关键词(包含查询SPYFPY)
  • 最后提取出keyword的中文部分: keyword.match(/[\u4e00-\u9fa5]/g).join(""), 作为match查询的关键字
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
{
"query": {
"bool": {
"should": [
{
"match": {
"search_key.FPY": {
"query": "福特进口"
}
}
},
{
"match": {
"search_key.SPY": {
"query": "福特进口"
}
}
},
{
"match": {
"search_key.IKS": {
"query": "福特进口"
}
}
},
{
"bool": {
"should": [
{
"match_phrase": {
"search_key.FPY": {
"query": "bir",
"boost": 5
}
}
},
{
"match_phrase": {
"search_key.SPY": {
"query": "bir",
"boost": 5
}
}
}
],
"boost": 1
}
}
]
}
}
}