Find all texts of a webpage with Python

In this article, we will make a small script to find all texts of a website with python.

To implement this we will use an amazing python library called Beautiful Soup and requests.

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

So let's get started...

First of all, we will create a virtual environment :

mkdir TextExtractor && cd TextExtractor
pip3 venv .venv

Then we activate this environment

source .venv/bin/activate

We install Beautiful Soup library

pip install beautifulsoup4==4.11.1

We install requests library

pip install requests

Then we create a file called main.py

In this file, we first import the BeautifulSoup Library and requests libraries

import requests 
from bs4 import BeautifulSoup

We will take an example of a website for example Medium.com

We first will create a function to get content of a website with requests library

def get_page_content(page_url):
     response = requests.get(page_url)
     if response.status_code == 200:
          return response.content
     return None

We can get the content like this

content = get_page_content('https://medium.com/')

To parse the text we need to create the soup object like this

soup = BeautifulSoup(html)

We find all the elements that has text in this way:

tags_with_text = soup.find_all(text=True)

Then we can get the text list:

texts = [tag.text for tag in tags_with_text]

This will return a list of texts like this:

['', 'Medium – Where good ... find you.', '{"@context":"http:\\u...ght":168}}',...]

Here we see that we have a lot of script texts and texts that we don't want.

We need to ignore some tags,

TAGS_TO_IGNORE = ['script','style', 'meta']

and we get all texts with this one liner

texts = [tag.text.strip() for tag in tags_with_text if (tag.text and tag.name not in TAGS_TO_IGNORE)]

We can create a function to get all the texts from a page, it will be like this:

def get_texts_from_page(page_url):
    content = get_page_content(page_url)
    soup = BeautifulSoup(content, "html.parser")
    tags_with_text = soup.findAll(text=True)
    TAGS_TO_IGNORE = ['script','style', 'meta']
    texts = [tag.text.strip() for tag in tags_with_text if (tag.text and tag.name not in TAGS_TO_IGNORE)]

    return list(set(texts))

And the complete file main.py

import requests
from bs4 import BeautifulSoup


def get_page_content(page_url):
    response = requests.get(page_url)
    if response.status_code == 200:
       return response.content
    return None


def get_texts_from_page(page_url):
    content = get_page_content(page_url)
    soup = BeautifulSoup(content, "html.parser")
    tags_with_text = soup.findAll(text=True)
    TAGS_TO_IGNORE = ['script','style', 'meta']
    texts = [tag.text.strip() for tag in tags_with_text if (tag.text and tag.name not in TAGS_TO_IGNORE)]
    return list(set(texts))

# USAGE
content = get_texts_from_page('https://medium.com/')

ENJOY!

Reference:
pybuddy.com

Blog

Find all texts of a webpage with Python

Alex Tread

Join Our Newsletter. No Spam, Only the good stuff.

Related