Building abstractive text summaries

davidmezzetti

David Mezzetti

Posted on March 27, 2021

Building abstractive text summaries

In the field of text summarization, there are two primary categories of summarization, extractive and abstractive summarization.

Extractive summarization takes subsections of the text and joins them together to form a summary. This is commonly backed by graph algorithms like TextRank to find the sections/sentences with the most commonality. These summaries can be highly effective but they are unable to transform text and don't have a contextual understanding.

Abstractive summarization uses Natural Language Processing (NLP) models to build transformative summaries of text. This is similar to having a human read an article and asking what was it about. A human wouldn't just give a verbose reading of the text. This article shows how blocks of text can be summarized using an abstractive summarization pipeline.

Install dependencies

Install txtai and all dependencies. Since this article is using optional pipelines, we need to install the pipeline extras package.

pip install txtai[pipeline]
Enter fullscreen mode Exit fullscreen mode

Create a Summary instance

The Summary instance is the main entrypoint for text summarization. This is a light-weight wrapper around the summarization pipeline in Hugging Face Transformers.

In addition to the default model, additional models can be found on the Hugging Face model hub.

from txtai.pipeline import Summary

# Create summary model
summary = Summary()
Enter fullscreen mode Exit fullscreen mode

Summarize text

The example below shows how a large block of text can be distilled down into a smaller summary.

text = ("Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation "
       "of the internet and an ever-growing challenge that is never solved or done. The field of Natural Language Processing (NLP) is "
       "rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability "
       "allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models "
       "and advancements coming in at what seems a weekly basis. This article introduces txtai, an AI-powered search engine "
       "that enables Natural Language Understanding (NLU) based search in any application."
)

summary(text, maxlength=10)
Enter fullscreen mode Exit fullscreen mode
Search is the foundation of the internet
Enter fullscreen mode Exit fullscreen mode

Notice how the summarizer built a sentence using parts of the document above. It takes a basic understanding of language in order to understand the first two sentences and how to combine them into a single transformative sentence.

Summarize a document

The next section retrieves an article, extracts text from it (more to come on this topic) and summarizes that text.

!wget "https://medium.com/neuml/time-lapse-video-for-the-web-a7d8874ff397"

from txtai.pipeline import Textractor

textractor = Textractor()
text = textractor("time-lapse-video-for-the-web-a7d8874ff397")

summary(text)
Enter fullscreen mode Exit fullscreen mode
Time-lapse video is a popular way to show an area or event over a long period of time. The same concept can be applied to a dynamic real-time website with frequently updated data. webelapse is an open source project developed to provide this functionality. It can be used as is or modified for different use cases.
Enter fullscreen mode Exit fullscreen mode

Click through the link to see the full article. This summary does a pretty good job of covering what the article is about!

πŸ’– πŸ’ͺ πŸ™… 🚩
davidmezzetti
David Mezzetti

Posted on March 27, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Granting autonomy to agents
ai Granting autonomy to agents

November 25, 2024

Generative Audio
ai Generative Audio

October 13, 2024

Speech to Speech RAG
ai Speech to Speech RAG

September 27, 2024