Elasticsearch 教程

Elasticsearch 分析

Durante la gestione della query durante l'operazione di ricerca, il modulo di analisi analizzerà il contenuto di qualsiasi indice. Il modulo è composto da analizzatore, generatore di token, filtro di token e filtro di caratteri. Se non è definito un analizzatore, l'analizzatore integrato, i token, i filtri e il generatore di token vengono registrati nel modulo di analisi di default.

Nell'esempio seguente, utilizziamo un analizzatore standard, che viene utilizzato per impostazione predefinita quando non viene specificato un altro analizzatore. Analizzerà la frase in base alla grammatica e genererà le parole utilizzate nella frase.

POST _analyze
{
　　　"analyzer": "standard",
　　　"text": "Today's weather is beautiful"
}

Dopo aver eseguito il codice sopra, otteniamo la risposta seguente:

{
　　　"tokens": [
　　　　　　{
　　　　　　　　　"token": "today's",
　　　　　　　　　"start_offset": 0,
　　　　　　　　　"end_offset" : 7,
　　　　　　　　　"type" : "",
　　　　　　　　　"position": 0
　　　　　　},
　　　　　　{
　　　　　　　　　"token": "weather",
　　　　　　　　　"start_offset" : 8,
　　　　　　　　　"end_offset" : 15,
　　　　　　　　　"type" : "",
　　　　　　　　　"position": 1
　　　　　　},
　　　　　　{
　　　　　　　　　"token": "is",
　　　　　　　　　"start_offset": 16,
　　　　　　　　　"end_offset": 18,
　　　　　　　　　"type" : "",
　　　　　　　　　"position": 2
　　　　　　},
　　　　　　{
　　　　　　　　　"token": "beautiful",
　　　　　　　　　"start_offset": 19,
　　　　　　　　　"end_offset" : 28,
　　　　　　　　　"type" : "",
　　　　　　　　　"position": 3
　　　　　　}
　　　]
}

Configurazione dell'analizzatore standard

Possiamo utilizzare vari parametri per configurare l'analizzatore standard per soddisfare le nostre esigenze personalizzate.

Nell'esempio seguente, configuriamo l'analizzatore standard con max_token_length impostato a 5.

Per questo, prima creiamo l'indice utilizzando l'analizzatore con il parametro max_length_token.

PUT index_4_analysis
{
　　　"settings": {
　　　　　　"analysis": {
　　　　　　　　　"analyzer": {
　　　　　　　　　　　　"my_english_analyzer": {
　　　　　　　　　　　　　　　"type": "standard",
　　　　　　　　　　　　　　　"max_token_length": 5,
　　　　　　　　　　　　　　　"stopwords": "_english_"
　　　　　　　　　　　　}
　　　　　　　　　}
　　　　　　}
　　　}
}

Quindi, applichiamo l'analizzatore come indicato di seguito. Nota come i token non vengono visualizzati, poiché all'inizio ci sono due spazi e alla fine ci sono due spazi. Per " La parola "is", che inizia con uno spazio e termina con uno spazio. Togliendoli tutti, diventa 4 lettere con spazi, il che non significa che sia una parola. Almeno all'inizio o alla fine dovrebbe esserci un carattere non spaziale per farla diventare una parola da contare.

POST index_4_analysis/_analyze
{
　　　"analyzer": "my_english_analyzer",
　　　"text": "Today's weather is beautiful"
}

Dopo aver eseguito il codice sopra, otteniamo la risposta seguente:

{
　　　"tokens": [
　　　　　　{
　　　　　　　　　"token" : "today",
　　　　　　　　　"start_offset": 0,
　　　　　　　　　"end_offset" : 5,
　　　　　　　　　"type" : "",
　　　　　　　　　"position": 0
　　　　　　},
　　　　　　{
　　　　　　　　　"token" : "s",
　　　　　　　　　"start_offset" : 6,
　　　　　　　　　"end_offset" : 7,
　　　　　　　　　"type" : "",
　　　　　　　　　"position": 1
　　　　　　},
　　　　　　{
　　　　　　　　　"token" : "weath",
　　　　　　　　　"start_offset" : 8,
　　　　　　　　　"end_offset" : 13,
　　　　　　　　　"type" : "",
　　　　　　　　　"position": 2
　　　　　　},
　　　　　　{
　　　　　　　　　"token" : "er",
　　　　　　　　　"start_offset" : 13,
　　　　　　　　　"end_offset" : 15,
　　　　　　　　　"type" : "",
　　　　　　　　　"position": 3
　　　　　　},
　　　　　　{
　　　　　　　　　"token" : "beaut",
　　　　　　　　　"start_offset": 19,
　　　　　　　　　"end_offset" : 24,
　　　　　　　　　"type" : "",
　　　　　　　　　"position": 5
　　　　　　},
　　　　　　{
　　　　　　　　　"token" : "iful",
　　　　　　　　　"start_offset" : 24,
　　　　　　　　　"end_offset" : 28,
　　　　　　　　　"type" : "",
　　　　　　　　　"position": 6
　　　　　　}
　　　]
}

La tabella seguente elenca l'elenco dei vari analizzatori e le loro descrizioni-

序号	Analizzatori e descrizioni
1	Analizzatore standard (standard) Le impostazioni stopwords e max_token_length possono essere configurate per questo analizzatore. Di default, la lista stopwords è vuota e max_token_length è impostato a 255.
2	Analizzatore semplice (simple) Questo analizzatore è composto dal tokenizer minuscolo.
3	Analizzatore di spazi bianchi (whitespace) Questo analizzatore è composto da un tokenizzatore di spazi
4	Analizzatore di stop (stop) Può configurare stopwords e stopwords_path. Per impostazione predefinita, stopwords è inizializzato con le parole di stop in inglese, e stopwords_path contiene il percorso del file di testo che contiene le parole di stop

Tokenizer

Il generatore di token viene utilizzato per generare token dal testo in Elasticsearch. Considerando gli spazi o altri segni di punteggiatura, il testo può essere diviso in etichette. Elasticsearch ha molti tokenizer integrati che possono essere utilizzati negli analizzatori personalizzati.

Di seguito è riportato un esempio di tokenizer, che divide il testo in più parole quando incontra caratteri non alfanumerici, ma anche tutte le parole vengono minuscolizzate, come segue-

POST _analyze
{
　　　"tokenizer": "lowercase",
　　　"text": "It Was a Beautiful Weather 5 Days ago."
}

Dopo aver eseguito il codice sopra, otteniamo la risposta seguente:

{
　　　"tokens": [
　　　　　　{
　　　　　　　　　"token": "it",
　　　　　　　　　"start_offset": 0,
　　　　　　　　　"end_offset": 2,
　　　　　　　　　"type": "word",
　　　　　　　　　"position": 0
　　　　　　},
　　　　　　{
　　　　　　　　　"token": "was",
　　　　　　　　　"start_offset": 3,
　　　　　　　　　"end_offset": 6,
　　　　　　　　　"type": "word",
　　　　　　　　　"position": 1
　　　　　　},
　　　　　　{
　　　　　　　　　"token": "a",
　　　　　　　　　"start_offset": 7,
　　　　　　　　　"end_offset": 8,
　　　　　　　　　"type": "word",
　　　　　　　　　"position": 2
　　　　　　},
　　　　　　{
　　　　　　　　　"token": "beautiful",
　　　　　　　　　"start_offset": 9,
　　　　　　　　　"end_offset": 18,
　　　　　　　　　"type": "word",
　　　　　　　　　"position": 3
　　　　　　},
　　　　　　{
　　　　　　　　　"token": "weather",
　　　　　　　　　"start_offset": 19,
　　　　　　　　　"end_offset": 26,
　　　　　　　　　"type": "word",
　　　　　　　　　"position": 4
　　　　　　},
　　　　　　{
　　　　　　　　　"token": "days",
　　　　　　　　　"start_offset": 29,
　　　　　　　　　"end_offset": 33,
　　　　　　　　　"type": "word",
　　　　　　　　　"position": 5
　　　　　　},
　　　　　　{
　　　　　　　　　"token": "ago",
　　　　　　　　　"start_offset": 34,
　　　　　　　　　"end_offset": 37,
　　　　　　　　　"type": "word",
　　　　　　　　　"position": 6
　　　　　　}
　　　]
}

L'elenco del generatore di token e la sua descrizione sono come segue:

序号	分词器和说明
1	标准标记器 (standard) 这是基于基于语法的标记器构建的，max_token_length可以为这个标记器配置。
2	边缘 NGram 标记器(edgeNGram) 像min_gram, max_gram, token_chars这样的设置可以为这个标记器设置。
3	关键字标记器 (keyword) 这将生成整个输入作为输出，buffer_size可以为此设置。
4	字母标记器(letter) 这将捕获整个单词，直到遇到一个非字母。

Elasticsearch 模块 Elasticsearch 映射