How I built a real-time Machine Learning system with Kafka, Elasticsearch, Kibana, and Docker
Dipankar Medhi
Posted on December 28, 2022
We will design and build a real-time sentiment analysis and hate detection system.
This is a project that I made in the Turn Language into Action, Natural Language Hackathon by Expert.ai .
I have always been interested in real-time systems and have always wondered how things work under the hood.
HOW? 🤔
So, I found this hackathon to be a perfect opportunity for me to learn and build something new.
Well then, Lets ROLL!!!
Project Architecture
This is what the complete pipeline looks like. Dont worry I will cover everything in detail.
But before we move on with the tools and architecture, let me talk about our data sources.
I have used Twitter API for real-time tweets, specifically pythons tweepy library for streaming tweets. In addition to that, I have used NewsAPI for daily news articles.
I have used docker to set up all the necessary tools as containers for this project.
Now lets talk about each component.
Apache Kafka
For ingesting the real-time data, I have used Apache Kafka.
Now, what is Apache Kafka? Well
Apache Kafka (Kafka) is an open source, distributed streaming platform that enables (among other things) the development of real-time, event-driven applications. IBM
Since I have used Python, there is a python client kafka-python available that makes working with Kafka relatively easy.
Using the KafkaProducer , Ive sent the messages (Twitter and NewsAPI) via 2 Kafka topics to the KafkaConsumer. One for the tweets and the other one for the news articles respectively.
KafkaConsumer then calls the Machine Learning service to classify the sentiments of the news media articles and detect hate in the tweets.
Machine Learning service
Expert.ai turns language into data so teams can make better decisions.
Since I built this project as a part of the Expert.ai hackathon, I have used their API for sentiment analysis/classification and hate detection.
However, you can always use your own Tensorflow or PyTorch model. Also, Huggingface has some very relevant models for sentiment classifications and they are straightforward to set up. You should check them out!
I am using the Sentiment Analysis and Hate speech detection APIs from Expert.ai NL API.
Elasticsearch
Okay, we have the classified data. Now What?
We have to store that data somewhere to use it for further analytics. I have used Elasticsearch and Kibana to visualize the stored data.
You might ask, why Kibana?
Let me introduce you to the ELK stack.
ELK is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a serverside data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a stash like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch. Elastic.co
Elasticsearch, Logstash and Kibana go hand in hand in most data engineering or data ingestion use cases. But I have omitted Logstash to keep the pipeline simple and clear to its goal.
But, you can always add Logstash and scale the pipeline further as needed.
That is enough about the ELK stack. Lets jump into the Elasticsearch design.
Elasticsearch: The Official Distributed Search & Analytics Engine
Like databases, Elasticsearch has " Indexes". These indexes store data defined with certain mappings type. Mapping is more like a schema in other databases.
The mapping describes the fields in the JSON documents along with their data type, as well as how they should be indexed in the indexes.
Databases ~ Indexes
The above image will give you a better idea about Elasticsearch indexes compared to MySQL or PostgreSQL.
Kibana
Done with storing the messages/data in the Elasticsearch indexes? Okay, Great! We can finally use that resultant data to visualize and get more insights about the data.
We use Kibana for that.
Kibana: Explore, Visualize, Discover Data | Elastic
Your window into the Elastic Stack Kibana is a free and open user interface that lets you visualize your Elasticsearch
Kibana is a free and open user interface that lets you visualize your Elasticsearch data and navigate the Elastic Stack.
Kibana Dashboard
This is what my final Kibana dashboard looks like. You can check out the code at my GitHub repo.
Feel free to leave a star if you like the project.
This part covers only the idea or the overview of the project along with the project architecture. Ill soon add the coding section in a separate part so stay tuned for that
Thats all folks. See you soon 👋
Happy coding.
Posted on December 28, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
December 28, 2022