Understanding Elasticsearch Analyzers

If you want to truly understand the analysis process in Elasticsearch, you need to get familiar with analyzers. It's what sets Elasticsearch apart from NoSQL databases like MongoDB.

An analyzer in Elasticsearch is a pipeline. You feed it text, and it gives you back a bunch of tokens.

The analyzer pipeline consists of three steps:

Analyzer
├── 1. Char filters
├── 2. Tokenizer
└── 3. Token filters

Think of them as different stages through which your text flows.

Char filters

First up, we have character filters. These filters preprocess the text before it gets split into tokens by the tokenizer.

For example, it can transform emojis like :) into the word _happy.”

Tokenizer

Next, we have the Tokenizer. The Tokenizer splits the text into smaller units called tokens.

For instance, if we use the whitespace tokenizer on the phrase "Hello World," it would intelligently split it into two tokens: Hello and World.

Token filters

Now that our text is split into tokens, the token filters come to play. They are responsible for applying changes to the generated tokens.

One popular use case is stemming, where the token went is stemmed to go.

Analysis

Bringing it all together, this entire process is referred to as analysis.

Note that analyzers can be customized by configuring different combinations of character filters, tokenizers, and token filters based on your requirements.

Understanding analyzers is like holding the key to the relevancy capabilities in Elasticsearch. It allows you to fine-tune your search queries and ultimately enhance the overall user experience.

Blog

Understanding Elasticsearch Analyzers

Nico Orfanos

Char filters

Tokenizer

Token filters

Analysis

Join Our Newsletter. No Spam, Only the good stuff.

Related