Creating a Subtitle Search Engine using the Stanford Parts of Speech Tagger

firecentaur

Paul Preibisch

Posted on June 2, 2022

Creating a Subtitle Search Engine using the Stanford Parts of Speech Tagger

In this post, I will be talking about how I integrated the Standford NLP Parts of speech tagger into a subtitle search engine I built for Globify.com Normally, search engines only allow the user to specify the search term they are interested in. In our use case however, we wanted the ability for our editors to be able to specify the parts of speech as well, so that they could find appropriate educational content quicker.

In the picture below, you will see a screenshot of the Parts of Speech search engine I created. In the top bar is a query field, then a combo-box that allows the user to specify Noun, Verb, Adverb, or Adjective. Next, we have a search button. below that is the results of the search. On the right, are a list of videos where the search term was found. On the left is the current video selected, along with a list of its subtitles. For each subtitle, I then list the Stanford Taggers Parts of Speech conversion. The search query is highlighted in Yellow. The user can mouse over any term to see the converted parts of speech.

How I did it:

Converting Each word into its constituent parts of speech

In order to convert each word of our subtitle database into its constituent parts of speech, I needed to dive into the world of Natural Language Processing. For this, Google sent me over to The Stanford Natural language Processing Group. The great folks at the Stanford NLP have graciously created a very cool opensource POS tagger called: Stanford Log-linear Part-of-Speech Tagger. This software however, was written in JAVA, so in order to integrate it with our Laravel Backend I would need a PHP Wrapper class. Fortunately, I was able to utilize an already existing PHP-Wrapper written by of Patrick Schur. Once integrated into Laravel, I created a quick test – and wham-mo! I saw beautiful verbs, adjectives, and noun classifications scrolling across my screen. With a functioning prototype in place, I then created code to start converting our entire subtitle database… but that’s when I ran into a speed roadblock. In order to process several millions of words, I would need to dive into the world of Big Data.

Big Data and Parallel Processing on Heroku with Laravel / Amazon SQS

Fortunately, the hosting platform Heroku allows developers to easily spin up armies of worker servers with a few key presses (and some configuration of course 😉

But wait a second, its not that easy! Heroku doesn’t provide the software to coordinate process delegation! For that, I utilized Laravel’s built in Queuing system, and connected it to Amazon’s Simple Queue Service (SQS). With these tools wired up and ready, I then launched an army of servers on our data to churn out the parts of speech word by word.

Delivering fast Search Results with AWS Elastic Search and Kibana

In order to search through millions of database records for the search query, a good indexing solution was needed. For this, I turned to Amazon Elastic search service. I had used AWS Elastic Search before, so it was just a mater of getting the search query right. To aid the development of the search query, I used Amazons Built in Query visualizer called Kibana. This too took a bit of configuration, but once set up, I crafted the correct query, and then created an indexing Job in Laravel, which I again through at my army of Heroku servers for processing…. and voila — a subtitle search engine filtered by parts of speech!

This project was quite fun to put together. I am amazed at the amount of power a developer can harness when utilizing Laravel Queues, Amazon SQS and Heroku!

For those interested, here is a list of some of the software used for this project:

Technologies Used

The post Creating a Subtitle Search Engine using the Stanford Parts of Speech Tagger appeared first on Paul Preibisch.

💖 💪 🙅 🚩
firecentaur
Paul Preibisch

Posted on June 2, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related