Elastic Search
vikash-agrawal
Posted on August 3, 2021
What is Elastic Search
Elastic Search is one of the most powerful search engine, which provides following functionality:
- Full Text Search
- Analyze metric and performance data (APM: Application Performance Management)
- Receiving events from any of the application
How does Elastic Search work
Elastic Search works with the document in the form of JSON, containing the list of fields and nested fields along with the corresponding values.
Elastic Search enables the functionality over the REST API, with request response format.
Elastic Search is written in java built on top of Apache Lucene.
Elastic Search Stack
- Elastic Search
- Kibana:
- An analytic and Visualization Platform to monitor the performance of application, Elastic Search Cluster,
- Also provides platform for machine learning.
- Log Stash
- A data processing pipeline.
- The data received can be customer data, application logs from any kind of source like file, kafka, database etc.
- The data processed and shipped to any destination like kafka, elastic search.
- The Log stash goes through following:
- Input
- Filter
- Output
- Each of these steps is enabled through the plugins.
- Beat: Light wight data shipper installed in the application server.
- File Beat: Send the logs.
- Metric Beat: Collects and sends the application/service level metrics
- X-Pack: Adds additional functionalities on top of Elastic Search and Kibana, like:
- Authentication by integrating it with LDAP.
- Authorization
- Define alerts, trigger emails etc on performance monitoring.
- Machine learning e.g. Abnormality Detection, forecast future value etc.
- Elastic Search SQL, good to send SQL to ES over the http and JDBC.
Installation
Mac
- Download the Elastic Search latest from the elasticsearch-download
- Download the Kibana through one of the following steps:
- kibana-download
- Brew
brew tap elastic/tap
brew install elastic/tap/kibana-full
if you happen to see any error, follow the instructions as mentioned in the error message.
e.g. if the error is due to xcode then
xcode-select --install
Post Installation
GET /_cat/nodes?v
GET /_cat/nodes?v=true&h=id,ip,port,v,m&pretty
# return just process
GET /_nodes/process?pretty
# same as above
GET /_nodes/_all/process?pretty
Sharding and Scalability
- Elastic search data is stored in the index.
- Elastic Search Data in the given index is stored in the shards.
- 1 TB of data can be stored in a cluster of more than one nodes with 500 GB hard disk capacity.
- Each shard is kind of independent index and actually an Apache Lucene Index.
- The size of the shard would increase based on the index size.
- It can store up to 2 Billion documents.
- It also helps in performance improvement if each of the queries are being executed in different shard and hence more than 1 shard for the given indes can reside in the same node.
- An index is created with 1 shard by default.
- Elastic Search less than 7.0.0 used to create index with 5 shards by default.
- During any of the queries, the shard is decided through:
value of _routing % number of shard
- Once the index is created, it's advisable not to change the number of shards as the existing documents wouldn't be fetched if query using the _routing parameters.
- To change the number of shards, use Elastic Search split API to increase the shards and Elastic Search Shrink API to decrease the shards, it internally performs the reindex operations.
- By default the routing is defined on the document id, which can be unpredictable on case the document id is randomly generated.
- Based on the use cases and to have the predictable behaviour, it's better to define the routing on any of the existing field.
- But the custom routing comes with more responsibility, e.g.
- if the value of the field on which the routing is defined, is changed then the reindexing is required to performed on the concerned document.
- GET search on document id, would need the routing also to passed otherwise the GET API might not return the results with the default routing.
- Following factors should be considered to define number of shards:
- Number of Nodes.
- Size of the index.
- Size of Hard disk in the node.
- Number of index
- Number of queries.
Replication
- Replication is created at the index level
- Replica works by coping the shards called replication shards.
- The original shard is called primary shard
- A primary shards and its replica shard is called replication group.
- An index is created with 1 replica by default.
- The replica shard will always be in the different nodes.
- In a cluster of one node, the status of any user index will be yellow.
- As soon as the new node is added the status would turn to green.
-
There are following ways to add a new node to the cluster in local:
- Go inside the Elastic Search directory and execute:
bin/elasticsearch -Nnode.name=node-2 -Epath.data=./data/node-2 -Epath.logs=./logs/node-2
- Extract the elastic zip file to another directory, change the value of node in conf/elasticsearch.yml file and start the process.
Number of replica for kibana index is changed based on the number of nodes.
-
Number of replica should depend upon many factors:
- It should be 2 for critical APIs else 1.
- Is the data stored in other DBs also?
Query pointing to the same shard can be served by the replicas as well. Hence replication increase the availability and throughput of the query.
Replicas is a recovery system in the given real time while snapshot is a recovery system in the given time
GET /_cluster/health
GET /_cat/indices?v
PUT /product-index
GET /_cluster/health
Node Roles
- Master:
- node.master = true|false
- Responsible for creating and deleting indices, allotong shards/replicas to the node.
- This role is assigned to the node on rotation basis.
- In a big cluster, it make sense to declare one of the node as true.
- Data Node
- node.data = true|false
- This is enabled for all nodes by default.
- If you have the dedicated master node, then don't make it as data node.
- Ingest
- node.ingest = true|false
- Series of steps to perform some steps while ingesting the data, mainly used for log stash.
- Machine Learning
- node.ml: true|false
- This lets the machine learning related jobs.
- xpack.ml.enabled = true|false
- This enables the Machine learning APIs
- Coordinations
- Enabled by disabling all other node roles.
- This lets the coordinations related job like shards selection and aggregations.
Index
- Create the Index
PUT /product-index
It supports many settings to define the number of shards, replicas etc.
- Add the document to index
POST /product-index/_doc
{
"product":"Dove Soap",
"reviewsCount": 100,
"price": 120
}
POST /product-index/_doc/100
{
"product":"Dove Cream",
"reviewsCount": 110,
"price": 150
}
PUT /product-index/_doc/110
{
"product":"Dove Cream",
"reviewsCount": 120,
"price": 150
}
Analyze the value of shards.total, it should be equal to 1 + no of replicas, 1 stands for primary shard as the data would go to only 1 primary shard and would get replicated in all replica shards in the __replication group_
Document can be added using POST and PUT both with only difference is PUT defines that the document to be added on the mentioned URI hence id is mandatory field for PUT.
- Retrieve the documents by id
GET /product-index/_doc/100
- Update the document
- All the documents in the ES are immutable means the documents can not be changed.
- The update query retrieves the document and updates the fields and reindex the document with the same id. So it replaces the existing document not updates the document.
POST /product-index/_update/100
{
"doc": {"reviewsCount": 99, "inStock":120}
}
- Scripted updates
- "ctx.op = 'noop'" will not cause any updates hence there is no change in the primary term and sequence number.
- Otherwise the response would say updates with the change in the primary term and sequence number regardless of the condition is met or not.
POST /product-index/_update/100
{
"script" : {
"source": "ctx._source.price++"
}
}
POST /product-index/_update/100
{
"script" : {
"source": "ctx._source.price = 500"
}
}
POST /product-index/_update/100
{
"script" : {
"source": """
if(ctx._source.price > 100){
ctx.op = 'noop';
}
ctx._source.price = 150
"""
}
}
POST /product-index/_update/100
{
"script" : {
"source": """
if(ctx._source.price < 100){
ctx._source.price = 150;
}
"""
}
}
POST /product-index/_update/100
{
"script" : {
"source": """
if(ctx._source.price > 100){
ctx.op = 'delete';
}
ctx._source.price = 150
"""
}
}
- Upserts
- If the document doesn't exist, it would index the new document.
- If the document exists, it would execute the script (make the updates)
#upserts
GET /product-index/_doc/101
POST /product-index/_update/101
{
"script": {"source": "ctx._source.reviewsCount++"},
"upsert": {
"product":"Dove Soap",
"reviewsCount": 125,
"price": 123
}
}
GET /product-index/_doc/101
POST /product-index/_update/101
{
"script": {"source": "ctx._source.reviewsCount++"},
"upsert": {
"product":"Dove Soap",
"reviewsCount": 125,
"price": 123
}
}
GET /product-index/_doc/101
- Replace the Document
# Add the document to the index
PUT /product-index/_doc/101
{
"product":"Dove Soap",
"reviewsCount": 100,
"price": 120
}
- Delete the document
DELETE /product-index/_doc/101
Posted on August 3, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.