🏖️ Generates a TL;DR of news using Natural Language Processing 🏖️
TL;DR
Uses an algorithm to determine the most important sentences in a news article and displays them at the top of the news article. Only works for BBC news articles. Isn't a proper Firefox extension either (although you're free to create one).
Everything inside the red box has been selected by the algorithm as the most important sentences, ordered from most important to least important.
What algorithm?
Term Frequency * Inverse Document Frequency
I explain all the code and the algorithm in the below blog post. Enjoy
We’re going to create a summary of BBC News Articles and place them at the top using a Firefox extension. This article is about the gnarly algorithm Term Frequency-Inverse Document Frequency (TF-IDF). We’re going to create a real-world usage in the form of a Firefox extension. I know what you’re thinking. “TF-IDF? Yawn 😴” but bare with me, it’s quite interesting!
When we’re done, it’ll look like this:
I promise you it's not as hard / boring as the algorithm's name makes it out to be!
Term frequency * Inverse Document Frequency
Don’t worry, the name of the algorithm makes me fall asleep every time I hear it said out loud too. This algorithm is 2 algorithms multiplied together. Let’s see how both of these work:
Term Frequency
Term frequency (TF) is how often a word appears in a document, divided by how many words there are.
Let’s say you’re reading a news article on Brexit. The word “Brexit” will appear a lot, so the term frequency of the word “Brexit” is high.
Quite often, we would want to build a dictionary (hashmap) of term frequencies alongside the term. Like {word: term frequency of that word} and then iterate through this dictionary to find out which word appears the most times.
Now, what if I told you that the term frequency dictionary would look a little like this:
{"and": 0.87, "the": 0.73}
You can see how these common English words aren’t useful to us. Of course, most English texts have these words in them, but we call English words like these stopwords. Stopwords usually refer to the most common words in a language, although there isn’t one singular definition. You have to choose stopwords per usage. You have to decide on what words to use. Before processing some text you’ll usually want to remove stopwords to better process the text.
Words with capital letters in them differ from words without capitals. In programming, “Africa” and “africa” are two different things. Because of this, we want to turn everything into lowercase or uppercase to better process our text. We’re going to turn all words into lowercase.
Given a string, we want to remove stop words and turn it into lowercase. Our extension will give us a string of all text on a BBC news article. Don’t worry about where we get the text from yet, that’s done later in the Firefox extension section. For now, assume we have text that looks like this:
... These are external links and will open in a new windowA neat feature of many modern laptops is the ability to power them up through the USB port. Unlike the rectangular USB ports of old, the newer type - USB-C - can carry enough power to charge your
machine.That’s great news: it means ...
The above text is shortened to prevent the reader from falling asleep.
functionprettify(document){// Turns an array of words into lowercase and removes stopwordsconststopwords=["a","share","linkthese","about","above","after","again","against","all","am","an","and","any","","are","aren't","as","at","be","because","been","before","being","below","between","both","but","by","can't","cannot","could","couldn't","did","didn't","do","does","doesn't","doing","don't","down","during","each","few","for","from","further","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","he's","her","here","here's","hers","herself","him","himself","his","how","how's","i","i'd","i'll","i'm","i've","if","in","into","is","isn't","it","it's","its","itself","let's","me","more","most","mustn't","my","myself","no","nor","not","of","off","on","once","only","or","other","ought","our","ours","ourselves","out","over","own","same","shan't","she","she'd","she'll","she's","should","shouldn't","so","some","such","than","that","that's","the","their","theirs","them","themselves","then","there","there's","these","they","they'd","they'll","they're","they've","this","those","through","to","too","under","until","up","very","was","wasn't","we","we'd","we'll","we're","we've","were","weren't","what","what's","when","when's","where","where's","which","while","who","who's","whom","why","why's","with","won't","would","wouldn't","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves","this"];// turn document into lowercase words, remove all stopwordsvardocument=document.replace(/[.,]/g,'');letdocument_in_lowercase=document.split("").map(function(x){returnx.toLowerCase()});returndocument_in_lowercase.filter(x=>!stopwords.includes(x));}
This is the function which will “prettify” our documents. Line 3 is an array of stopwords I found on StackOverflow. I added “share” and “linkthese” since these are commons words in the news article we don’t want.
Line 5 is Regex. The square brackets mean or. [,.] means “activate on a comma or a full stop ”. /g means global. Once you find one ‘,’ or ‘.’ don’t stop, continue searching the string. The empty string is what we’re replacing it with. If we find a full stop or a comma, replace it with nothing — delete it. This is because the words “Africa.” and “Africa” would be classified as two different words without this.
Line 4 splits the document up into separate words. The map function applies a function to every element in an array. Once the string is split up into an array of words we apply the toLowerCase() method to every element. It makes every word lowercase.
We then return the lowercase words once we’ve filtered out stop words. Filter() creates a new array with only the elements for which the function inside returns True.
If a word is a stop word, it’ll result in True which means we’ll get a new array of only the stopwords in the document. We use the negation operator “!” to get the opposite, which is what we want. To return a list of words with no stopwords in it.
Now we want to count how many times each word appears in the document. This will be useful for both Term Frequency and Inverse Document Frequency. First, we want to get all the unique words from an array of words.
We convert the array into a set because sets have no repetitions. This lets us get only the unique words in the array. Sets also don’t have an order, so we can’t use array indices to access elements. We need to turn it straight back into an array. For more about Set Theory, check out this article I wrote.
Okay, now time to count how many times a word appears in the words array.
functioncountWords(words){// returns a dictionary of {WORD: COUNT} where count is// how many times that word appears in "words".constunique_words=uniqueWords(words);letdict={};// for every single unique wordfor (leti=0;i<=unique_words.length-1;i++){dict[unique_words[i]]=0// see how many times this unique word appears in all wordsfor (letx=0;x<=words_without_stopwords.length-1;x++){if (unique_words[i]==words[x]){dict[unique_words[i]]=dict[unique_words[i]]+1;}}}returndict;}
This function goes through every single unique word and counts how many times that word appears in the array of words. The Term frequency function is quite long, so I’m going to break it down.
functiontermFrequency(document){// calculates term frequency of each sentencewords_without_stopwords=prettify(document);// gets rid of trailing spacesconstsentences=document.split(".").map(item=>item.trim());sentences[0]=sentences[0].substring(146);constTFVals=countWords(words_without_stopwords)constunique_words=uniqueWords(words_without_stopwords);// actually makes it TF values according to formulafor (const[key,value]ofObject.entries(TFVals)){TFVals[key]=TFVals[key]/words_without_stopwords.length;}
Line 6 divides the document up into sentences. Sometimes sentences have white space before them. “Brandon. Dogs.” Has whitespace before “Dogs.” we apply the trim() method to each item to get rid of these trailing whitespace.
Regarding line 7, The first 146 characters of the first word is social media links. The rest of that word is a title or sub-title. Here, look:
Share this withEmailFacebookMessengerMessengerTwitterPinterestWhatsAppLinkedInCopy this linkThese are external links and will open in a new window **Ryanair is tightening the rules on what passengers pay to take luggage onto the plane, in order to "speed up boarding".**
This is annoying, as the title is an essential part of the story and needs to be taken into account. So we remove the first 146 characters of the first word to get:
Ryanair is tightening the rules on what passengers pay to take luggage onto the plane, in order to "speed up boarding"
Remember this formula?
The variable “TFVals” is calculating this formula. If we run the sentence “Hello, my name is Brandon. Brandon Brandon. The elephant jumps over the moon” through the term frequency function, we’ll get something that looks like this:
We have the term frequencies of words, but we want to calculate the most important sentences , not words. To do that, we go through every single sentence and see what words come up in that sentence which are in TFVals.
We just need to add them all up and divide by how many words we have. Since we’re only adding up the TF values of non stop words, it’s only fair if we divide by how many non stop words there are, instead of how many words there are in a sentence. If we don’t divide by how many words we have, long sentences have an advantage over shorter ones.
This is what line 20 onwards does below. We go through every single sentence and calculate the TF values of each sentence, just as we did above.
functiontermFrequency(document){// calculates term frequency of each sentencewords_without_stopwords=prettify(document);// gets rid of trailing spacesconstsentences=document.split(".").map(item=>item.trim());sentences[0]=sentences[0].substring(146);constTFVals=countWords(words_without_stopwords)constunique_words=uniqueWords(words_without_stopwords);// actually makes it TF values according to formulafor (const[key,value]ofObject.entries(TFVals)){TFVals[key]=TFVals[key]/words_without_stopwords.length;}// splits it up into sentences nowvarTFSentences={};// for every sentencefor (leti=0;i<=sentences.length-1;i++){// for every word in that sentenceletsentence_split_words=sentences[i].split("");// get the assiocated TF values of each word// temp.add is the "TF" value of a sentence, we need to divide it at the endlettemp_add=0.0;letwords_no_stop_words_length=prettify(sentences[i]).length;for (letx=0;x<=sentence_split_words.length-1;x++){// get the assiocated TF value and add it to temp_addif (sentence_split_words[x].toLowerCase()inTFVals){// adds all the TF values uptemp_add=temp_add+TFVals[sentence_split_words[x].toLowerCase()];}else{// nothing, since it's a stop word.}}// TF sentences divide by X number of items on topTFSentences[sentences[i]]=temp_add/words_no_stop_words_length;}returnTFSentences;}
And that’s it. But we have a problem with using only Term Frequency. As you may have seen earlier, “Brandon Brandon” was the highest scoring TF out of all 3 sentences we looked at.
Popularity isn’t enough. We don’t want sentences that have the most keywords in them as they may not make sense, or they may be repetitions of each other. Such as in the “Brandon” Brandon” sentence. It has a high TF value but doesn’t hold much content.
It doesn’t contain much information and isn’t useful. We want a sentence that is both rare, unique and contains keywords common in the article. This is where inverse document frequency comes in.
Inverse document frequency
Term frequency is how common a word is, inverse document frequency (IDF) is how unique or rare a word is. The formula for IDF is:
IDF used over many documents, whereas TF is built for one document. You can decide what a document is. In this article, each sentence is its own document.
The first few steps of IDF is the same as TF. We prettify the document, count the words in the document and get all the unique words.
functioninverseDocumentFrequency(document){// calculates the inverse document frequency of every sentenceconstwords_without_stopwords=prettify(document);constunique_words_set=uniqueWords(words_without_stopwords);constsentences=document.split(".").map(item=>item.trim());sentences[0]=sentences[0].substring(146);constlengthOfDocuments=sentences.length;// prettifys each sentence so it doesn't have stopwordsconstwordCountAll=countWords(words_without_stopwords);// counts words of each sentence// as each sentence is a documentwordCountSentences=[];for (leti=0;i<=lengthOfDocuments-1;i++){wordCountSentences.push(countWords(prettify(sentences[i])));}// calculate TF values of all documentsletIDFVals={};
Lines 1–6 is nothing new. The for loop on line 17 loops through every sentence in the document. Since each sentence is a new ‘document’, we need to count the words of each sentence individually. We have to prettify them to get rid of the stopwords and turn them into an array of words. We push the wordcount object of each new sentence into wordCountSentences.
We’re now going to go through every single word and count how many times that word appears in every sentence and calculate the IDF score using the below formula.
Now we just do this for every non stop word.
And the code for this is:
functioninverseDocumentFrequency(document){// calculates the inverse document frequency of every sentenceconstwords_without_stopwords=prettify(document);constunique_words_set=uniqueWords(words_without_stopwords);constsentences=document.split(".").map(item=>item.trim());sentences[0]=sentences[0].substring(146);constlengthOfDocuments=sentences.length;// prettifys each sentence so it doesn't have stopwordsconstwordCountAll=countWords(words_without_stopwords);// counts words of each sentence// as each sentence is a documentwordCountSentences=[];for (leti=0;i<=lengthOfDocuments-1;i++){wordCountSentences.push(countWords(prettify(sentences[i])));}// calculate TF values of all documentsletIDFVals={};// how many times that word appears in all sentences (documents)wordCountSentencesLength=wordCountSentences.length;// for every unique wordfor (leti=0;i<=unique_words_set.length-1;i++){lettemp_add=0;// count how many times unique word appears in all sentencesfor (letx=0;x<=wordCountSentencesLength-1;x++){if (unique_words_set[i]inwordCountSentences[x]){temp_add=+1;}}IDFVals[unique_words_set[i]]=Math.log10(wordCountAll[unique_words_set[i]]/temp_add);}
Now we want to get the IDF Values of all the sentences, we use the same code from TF here but replace some things to make it work.
If I’m being truthful with you, I did a simple “find and replace” the variables. Instead of “TF” in the comments, I replaced them with IDF. Instead of “TFVals”, I replaced it with “IDFVals”. Nothing important has happened here, so feel free to skip this part.
functioninverseDocumentFrequency(document){// calculates the inverse document frequency of every sentenceconstwords_without_stopwords=prettify(document);constunique_words_set=uniqueWords(words_without_stopwords);constsentences=document.split(".").map(item=>item.trim());sentences[0]=sentences[0].substring(146);constlengthOfDocuments=sentences.length;// prettifys each sentence so it doesn't have stopwordsconstwordCountAll=countWords(words_without_stopwords);// counts words of each sentence// as each sentence is a documentwordCountSentences=[];for (leti=0;i<=lengthOfDocuments-1;i++){wordCountSentences.push(countWords(prettify(sentences[i])));}// calculate TF values of all documentsletIDFVals={};// how many times that word appears in all sentences (documents)wordCountSentencesLength=wordCountSentences.length;// for every unique wordfor (leti=0;i<=unique_words_set.length-1;i++){lettemp_add=0;// count how many times unique word appears in all sentencesfor (letx=0;x<=wordCountSentencesLength-1;x++){if (unique_words_set[i]inwordCountSentences[x]){temp_add=+1;}}IDFVals[unique_words_set[i]]=Math.log10(wordCountAll[unique_words_set[i]]/temp_add);}letIDFSentences={};// for every sentencefor (leti=0;i<=lengthOfDocuments-1;i++){// for every word in that sentenceletsentence_split_words=sentences[i].split("");// get the assiocated IDF values of each word// temp.add is the "IDF" value of a sentence, we need to divide it at the endlettemp_add=0.0;letwords_no_stop_words_length=prettify(sentences[i]).length;for (letx=0;x<=sentence_split_words.length-1;x++){// if the word is not a stopword, get the assiocated IDF value and add it to temp_addif (sentence_split_words[x].toLowerCase()inIDFVals){// adds all the IDF values uptemp_add=temp_add+IDFVals[sentence_split_words[x].toLowerCase()];}else{// nothing, since it's a stop word.}}IDFSentences[sentences[i]]=temp_add/words_no_stop_words_length;}returnIDFSentences;}
We now know how unique or rare a sentence is. This isn’t that useful since we want the sentence to be information-rich too. We want some way to combine the popularity of TF with the uniqueness of IDF. This leads us to our next section…
TF-IDF revisited
We now have TF and IDF functions implemented. The only thing left to do is to multiply them together.
The objects TF and IDF both stem from the same data, so TF isn’t going to contain something that isn’t in IDF. Because of this, we can iterate through one object and use the same key. We multiply the value in TFVals by the value from in IDFVals.
Our next step is to calculate the 3 most important sentences in our TF-IDF Object. Iterating over the [key, value] of the object with a couple of if statements works perfectly.
functionTFIDF(documents){// calculates TF*IDFconstTFVals=termFrequency(documents);constIDFVals=inverseDocumentFrequency(documents);letTFidfDict={};for (const[key,value]ofObject.entries(TFVals)){if (keyinIDFVals){TFidfDict[key]=TFVals[key]*IDFVals[key];}}letmax=0.0;letmax2=0.0;letmax3=0.0;letmax_sentence="";letmax2Sent="";letmax3Sent="";// finds the top 3 sentences in TFidfDictfor (const[key,value]ofObject.entries(TFidfDict)){if (TFidfDict[key]>max){max=TFidfDict[key];max_sentence=key;}elseif (TFidfDict[key]>max2&&TFidfDict[key]<max){max2=TFidfDict[key];max2Sent=key;}elseif (TFidfDict[key]>max3&&TFidfDict[key]<max2&&TFidfDict[key]<max){max3=TFidfDict[key];max3Sent=key;}}return ("<br>"+"•"+max_sentence+"<br><br>"+"•"+max2Sent+"<br><br>"+"•"+max3Sent);}
You’ll see at the bottom we return the formatted string. We format it so it looks nice when we insert it into the webpage. Each <br> is a line break, a space in the text. The black dots are bullet points. We’re now going to implement this algorithm into a Firefox extension. 🔥🦊
Getting & changing text in a BBC news article
Go to any BBC news article, right-click and press “inspect element”. You’ll see a nice box at the bottom of the screen. Use the element selector tool in the top left-hand corner and hover over the article. We can see that the whole article is encompassed within a CSS class of ‘story-body’.
If we go further in, we can see that all the actual text in the article is encompassed by paragraph tags, inside this CSS class.
// get all text from .story-body within p tags on a BBC news web articlelet$article=$('.story-body').find('p').text();
This line selects all the <p> tags within the story-body class. Now we want to get the text, we do this by applying the method .text().
We want to add our text to the top of the article. JQuery has a method called prepend which lets us prepend data to the top of an object.
// get all text from .story-body within p tags on a BBC news web articlelet$article=$('.story-body').find('p').text();// insert text into body of documentletinsert=$('.story-body').prepend(TFIDF($article));
And we’re done! We can now identify the most important sentences in a BBC News article and display them right at the top. Just time to turn it into an extension.
Firefox extension basics
Firefox extensions have 2 main parts. The Javascript you wrote and the manifest.json file which tells Mozilla what your extension does. We’re going over manifest.json now.
{"manifest_version":2,"name":"TL;DR - Summary of BBC news articles","version":"1.0","description":"This extension creates a summary of BBC news articles using TF*IDF","content_scripts":[{"matches":["*://www.bbc.co.uk/news/*"],"js":["jquery.js","tldr.js"]}]}
manifest_version tells Firefox what version of manifest you are using. Name tells Firefox what the name of your extension is. Version tells Firefox what version number your extension is. These 3 are mandatory.
description tells Firefox what your extension does.
content_scripts tells Firefox what scripts to load when the URL matches what you have inputted. In order for the scripts you have specified to run, the current URL must match at least one of the URLs you have specified. You can use 2 special characters here:
“ * ” Matches zero or more characters. In this instance, I don’t know whether the user will load HTTP or HTTPS so I have it step to load both. I also don’t know what exact article the user will look at, so I have it set to activate on any article.
For example: "*na?i" would match "illuminati" and "annunaki", but not "sagnarelli".
Since we’re going to be using jQuery, we’ll import the jQuery JS file as well into the website, before our script executes. You can grab the jQuery file from here. Copy and paste into a file named “jquery.js”.
Enter “about:debugging” into your Firefox URL to load this page:
From here, click “Load Temporarary Add-on…” and then click any of the files in the extension. Once you do, you should see this:
Mozilla have a nice article on the basics of Firefox extensions, here.
You’ve now seen the awesome power of TF-IDF and a real-world application for it. This idea came to me because I have email anxiety. I get so nervous about reading emails that I wanted a quick summary of them to calm my thoughts. Alas, this is my first time ever writing Javascript. I started out on something easier like BBC news articles.
Here are some ways you can improve upon this code, if you so wish:
Dynamically select how many sentences in a summary you want. You can find out the average TF*IDF value in the whole article and anything over X you can include in the summary. This makes it so long articles are treated as equally as shorter articles.
Extending this to work on any other websites you wish.
Want to become a better developer? Sign up to my email list. Find out more here. You'll receive 7 one-a-day articles of my best content. No spam. No commitments. Unsubscribe anytime.