NLP HandsOn
Priscilla Parodi
Posted on June 2, 2022
| Menu |
Note: This HandsOn assumes that you have already followed the step-by-step Setup of your Elastic Cloud Trial account, and also that you have read the blog NLP and Elastic: Getting started.
Config: To prepare for the NLP HandsOn, we will need an Elasticsearch cluster running at least version 8.0 with an ML node.
To start using NLP in your Stack you will need to import your model. The first thing we need to do is upload your model into a cluster.
In our eland library, a Python Elasticsearch client for exploring and analyzing data in Elasticsearch, we have some simple methods and scripts that allow you to upload models from local disk, or to pull models down from the Hugging Face model hub.
Once models are uploaded into the cluster, you’ll be able to allocate those models to specific ML nodes. Once model allocation is complete, we’re ready for inference.
Eland can be installed from PyPI via pip.
Before you go any further, make sure you have Python installed.
You can check this by running:
Unix/macOS
python3 --version
You should get some output like:
Python 3.8.8
Additionally, you’ll need to make sure you have pip available.
You can check this by running:
Unix/macOS
python3 -m pip --version
You should get some output like:
pip 21.0.1 from …
If you installed Python from source, with an installer from python.org, or via Homebrew you should already have pip.
If you don't have Python and pip installed, install it first.
With that, Eland can be installed from PyPI via pip:
$ python3 -m pip install eland
Getting started
To interact with your cluster through the API, we will need to use your Elasticsearch cluster endpoint information.
The endpoint looks like:
https://<user>:<password>@<hostname>:<port>
Open your deployment settings to find your endpoint information and click on the gear icon.
Copy your Elasticsearch endpoint as in the image below.
Note: If you want to try out examples with your own cluster, remember to include your endpoint URLs and authentication details.
Now add the username and password so your request can be authenticated, your endpoint will look like this:
https://elastic:123456789@00c1f8.es.uscentral1.gcp.cloud.es.io:9243
username: elastic
is a built-in superuser. Grants full access to cluster management and data indices.
password: If you don't have your password, you will need to reset it and generate a new password.
Copy your endpoint, you'll need it later.
In parallel, let's proceed locating the first model to be imported.
We will import the model from Hugging Face, an AI community to build, train and deploy open source machine learning models.
In this demo we will use a random sentiment analysis model but feel free to import the model you want to use. You can read more details about this model on the Hugging Face webpage.
Copy the model name as in the image below.
Now that we have all the necessary information (elasticsearch cluster endpoint information and the name of the model we want to import) let's proceed by importing the model:
Open your terminal and update the following command with your endpoint and model name:
eland_import_hub_model --url https://<user>:<password>@<hostname>:<port> \
--hub-model-id <model_name> \
--task-type <task_type>
In this case we are importing the bhadresh-savani/distilbert-base-uncased-emotion
model to run the text_classification
task.
In Huggning Face filters you will be able to see the task of each model. Supported values are fill_mask, ner, question_answering, text_classification, text_embedding, and zero_shot_classification.
eland_import_hub_model --url https://elastic:<password>@<hostname>:<port> \
--hub-model-id bhadresh-savani/distilbert-base-uncased-emotion \
--task-type text_classification
You will see that the Hugging Face model will be loaded directly from the model hub and then your model will be imported into Elasticsearch.
Wait for the process to end.
Let's check if the model was imported.
Click Machine Learning
in your Kibana menu.
Under model management click Trained Models
:
Your model needs to be on this list as shown in the image below, if it is not on this list check if there was any error message in the previous process.
If your model is on this list it means it was imported but now you need to start the deployment. To do this, in the last column under Actions
click Start deployment
.
After deploying, the State
column will have the value started
and under Actions
the Start deployment
option will be disabled, which means that the deploy has been done.
Let's test our model!
Copy your model ID:
In Kibana's menu, click Dev Tools
.
In this UI you will have a console to interact with the REST API of Elasticsearch.
We will to use the inference processor to evaluate this model.
POST _ml/trained_models/<model_id>/deployment/_infer
{
"docs": { "text_field": "<input>"}
}
This POST
method contains a docs
array with a field matching your configured trained model input, typically the field name is text_field
. The text_field
value is the input you want to infer.
In our case it will be:
POST _ml/trained_models/bhadresh-savani__distilbert-base-uncased-emotion/deployment/_infer
{
"docs": { "text_field": "Elastic is the perfect platform for knowledgebase NLP applications"}
}
Where the model_id is bhadresh-savani__distilbert-base-uncased-emotion
and the value that I am using as a test is Elastic is the perfect platform for knowledgebase NLP applications
.
Clicking the play button you can send the request:
In this case the predicted sentiment is "joy".
That's it, the model is working. 🚀
Note: You can run more tests to determine if this model works for what you need.
To get all the statistics of your model you can use the _stats
request:
GET _ml/trained_models/<model_id>/_stats
Let's continue with part 2, How to run this model on data being ingested?
To do this, let's start by importing a .csv file into Elasticsearch. So we can run the model while importing data.
I think it's interesting to run an analysis on random texts and tweets are good use cases.
Recently Elon Musk announced his interest in buying Twitter, but before that he was famously active on the platform. As we have a sentiment analysis model, let's proceed with analyzing a sample of Elon's tweets.
I found this database on Kaggle, this is a good website for locating datasets.
Note: We don't have a huge amount of data, 172Kb between November 16, 2012 and September 29, 2017. But as this is not a research paper this is not a problem.
Feel free to use whatever data you prefer, or even the twitter API.
Let's download this file:
And import into Elasticsearch.
There are different ways to do this, but since this is a small .csv file, we can use the Upload a file
integration.
In the Kibana menu, click Integrations
, you will see a list of integrations we have for collecting data.
Search for Upload a file
as in the image below:
And then click Select or drag and drop a file
and choose your csv file, in our case data_elonmusk.csv
that you downloaded earlier.
You will see something similar to the image below:
Click Override settings
to rename the Tweet column to text_field
. As explained before, there needs to be a field that matches your configured trained model input which is typically called text_field
. With this, the model will be able to identify the field to be analyzed.
Rename the Tweet column/field to text_field
. Click Apply
.
After the page loads, click Import
.
And then click Advanced
to edit the import process settings.
The import process has several steps:
Processing file - Turning the data into NDJSON documents so they can be ingested using the bulk api
Creating index - Creating the index using the settings and mappings objects
Creating ingest pipeline - Creating the ingest pipeline using the ingest pipeline object
Uploading data - Loading data into the new Elasticsearch index
Creating a data view (Index pattern) - Create a Kibana index pattern (if the user has opted to)
As you can see the CSV processor is being used in the ingest pipeline to import your document.
Feel free to edit the mapping or ingest pipeline.
In our case we need to edit the ingest pipeline to add our previously trained and imported model.
Add the model that will infer the data being ingested into the processor as in the image below:
{
"inference": {
"model_id": "bhadresh-savani__distilbert-base-uncased-emotion"
}
}
After that add your index name and click Import. If for some reason it doesn't work, repeat the process and check if you typed something incorrectly.
Note: What we are doing is adding your model for inference in the ingest pipeline, it doesn't need to be a .csv. Read more about it here.
When it finishes loading, your screen will look like mine, click View index in Discover
.
If you didn't disable Create data view
when you were importing data you should be able to locate your index by the name you used. Now you can explore your index data.
Next to the word Documents
, click Field statistics
, so far this is a beta feature but excellent for exploring your data. As we can see, Elon was feeling Joyful in 70% of the analyzed tweets considering this sentiment analysis model. The second most popular sentiment in Elon's tweets was anger and then fear.
Let's click on the lens button on the right side of the screen to open Kibana Lens and explore this data.
When the screen loads. Click and drag the Time field to explore this data considering the date of each tweet.
Considering time, some suggestions will appear, I liked one of them, but instead of every 30 days I edited it for an annual review. Also try filtering only by prediction probability between 0.90 and 1 for better accuracy. Here you can have fun with the analysis you want to run.
Apparently anger has increased over time, but joy remains the most common in Elon's Tweets. Fear increased until the beginning of 2016 but decreased in 2017.
Well, there are several interpretations for data, we always need to take into account the model used, accuracy, the quality of our data, the information we seek, the type of analysis and our interpretation, context and knowledge, but I believe that now it is possible to see how useful it is to analyze language.
For example, try running a classification model with the inference data (which is now a new field) to predict sentiment in addition to checking for influencers. Also try importing other models and using other datasets.
I also imported a NER model to identify entities in the same dataset so we can start to correlate text topics (keywords) with sentiment. The year Elon talked about Tesla the most in this dataset was 2015, which coincides with the year with the greatest increase in joy.
This news is from 2015 and Elon was really positive about Tesla even with the company reporting losses.
Again, these are not necessarily facts. But my goal is to show a little bit of what we can do with NLP analysis and correlation (which does not imply causation 😅).
Let's proceed with the last part, How to run this model on an existing index?
If your data is already indexed and you want to infer your model considering this data but without changing the index content, this is possible. If this is your case, let's proceed with this test.
In the Kibana menu click Ingest Pipeline
and then Create pipeline
and New pipeline
.
Give your pipeline a name
and click Add a processor
.
The first step is to rename the field that will be inferred to text_field
.
For that add the Rename processor, in the message field add the field to be renamed and in the target field add text_field
. And then click Add
.
Now we will add the Inference processor, for that click again Add processor
and then under Model ID add your Model ID, in our case: bhadresh-savani__distilbert-base-uncased-emotion
Click Add
.
Click Create pipeline
and copy the name
of your pipeline, you will need it later.
Now open Dev Tools
and run the following request (adding your source index, dest index and pipeline name):
POST _reindex
{
"source": {
"index": "<your-source-index-name>"
},
"dest": {
"index": "<your-ml-dest-index-name>",
"pipeline": "<your-pipeline-name>"
}
}
This copies documents from a source to a destination. You can copy all documents to the destination index, or reindex a subset of the documents, you can also use source filtering to reindex a subset of the fields in the original documents.
This will take some time, wait for the successful response as in the image below:
For this new index you don't have the Data View yet, you need it to access the Elasticsearch data that you want to explore, to do that click Stack Management
in the Kibana menu and then click Data Views
.
Click Create new data view
and then for the Name field add the name of your new index, in my case it is elon-output-ml
. Click Create data view
.
Now open Discover
and select the new index.
That's it, without making changes to your current index you have a new index with the result of this model.
I hope you enjoy using NLP with the Elastic Stack! Feedback is always welcome.
This post is part of a series that covers Artificial Intelligence with a focus on Elastic's (Creators of Elasticsearch) Machine Learning solution, aiming to introduce and exemplify the possibilities and options available, in addition to addressing the context and usability.
Posted on June 2, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.