Find all texts of a webpage with Python

treadalex

Alex Tread

Posted on October 15, 2022

Find all texts of a webpage with Python

In this article, we will make a small script to find all texts of a website with python.

To implement this we will use an amazing python library called Beautiful Soup and requests.

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

So let's get started...

First of all, we will create a virtual environment :

mkdir TextExtractor && cd TextExtractor
pip3 venv .venv
Enter fullscreen mode Exit fullscreen mode

Then we activate this environment

source .venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

We install Beautiful Soup library

pip install beautifulsoup4==4.11.1
Enter fullscreen mode Exit fullscreen mode

We install requests library

pip install requests
Enter fullscreen mode Exit fullscreen mode

Then we create a file called main.py

In this file, we first import the BeautifulSoup Library and requests libraries

import requests 
from bs4 import BeautifulSoup
Enter fullscreen mode Exit fullscreen mode

We will take an example of a website for example Medium.com

We first will create a function to get content of a website with requests library

def get_page_content(page_url):
     response = requests.get(page_url)
     if response.status_code == 200:
          return response.content
     return None
Enter fullscreen mode Exit fullscreen mode

We can get the content like this

content = get_page_content('https://medium.com/')
Enter fullscreen mode Exit fullscreen mode

To parse the text we need to create the soup object like this

soup = BeautifulSoup(html)
Enter fullscreen mode Exit fullscreen mode

We find all the elements that has text in this way:

tags_with_text = soup.find_all(text=True)
Enter fullscreen mode Exit fullscreen mode

Then we can get the text list:

texts = [tag.text for tag in tags_with_text]
Enter fullscreen mode Exit fullscreen mode

This will return a list of texts like this:

['', 'Medium – Where good ... find you.', '{"@context":"http:\\u...ght":168}}',...]
Enter fullscreen mode Exit fullscreen mode

Here we see that we have a lot of script texts and texts that we don't want.

We need to ignore some tags,

TAGS_TO_IGNORE = ['script','style', 'meta']
Enter fullscreen mode Exit fullscreen mode

and we get all texts with this one liner

texts = [tag.text.strip() for tag in tags_with_text if (tag.text and tag.name not in TAGS_TO_IGNORE)]
Enter fullscreen mode Exit fullscreen mode

We can create a function to get all the texts from a page, it will be like this:

def get_texts_from_page(page_url):
    content = get_page_content(page_url)
    soup = BeautifulSoup(content, "html.parser")
    tags_with_text = soup.findAll(text=True)
    TAGS_TO_IGNORE = ['script','style', 'meta']
    texts = [tag.text.strip() for tag in tags_with_text if (tag.text and tag.name not in TAGS_TO_IGNORE)]

    return list(set(texts))
Enter fullscreen mode Exit fullscreen mode

And the complete file main.py

import requests
from bs4 import BeautifulSoup


def get_page_content(page_url):
    response = requests.get(page_url)
    if response.status_code == 200:
       return response.content
    return None


def get_texts_from_page(page_url):
    content = get_page_content(page_url)
    soup = BeautifulSoup(content, "html.parser")
    tags_with_text = soup.findAll(text=True)
    TAGS_TO_IGNORE = ['script','style', 'meta']
    texts = [tag.text.strip() for tag in tags_with_text if (tag.text and tag.name not in TAGS_TO_IGNORE)]
    return list(set(texts))

# USAGE
content = get_texts_from_page('https://medium.com/')
Enter fullscreen mode Exit fullscreen mode

ENJOY!

Reference:
pybuddy.com

💖 💪 🙅 🚩
treadalex
Alex Tread

Posted on October 15, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related