Scrape but Validate: Data scraping with Pydantic Validation

ajitkumar

Ajit Kumar

Posted on November 16, 2024

Scrape but Validate: Data scraping with Pydantic Validation

Note: Not an output of chatGPT/ LLM

Data scraping is process of collecting data from public web sources and it is mostly done using script in a automated way. Due to automation, often collected data have errors and need to filter out and clean for use. However, it will be better if scraped data can be validate during scraping.

Considering the data validation requirement, most of scraping framework like Scrapy have inbuilt pattern that can be used for data validation. However, many a time, during the data scraping process, we often just use general purpose modules like requests and beautifulsoup for scraping. In such case, it is hard to validate the collected data, so this blog post explain a simple approach for data scraping with validation using Pydantic.
https://docs.pydantic.dev/latest/
Pydantic is a data validation python module. It is the backbone of popular api module FastAPI too, like Pydantic there are other python modules too, that can be used for validation during data scraping. However, this blog explore pydantic and here are link of alternatives packages (you can try changing pydantic with any other module as a learning exercise )

Plan of scraping :

In this blog, we will scrap quotes from the quotes site.
We will use requests and beautifulsoup to get the data Will create a pydantic data class to validate each scraped data Save the filtered and validated data in a json file.

For better arrangement and understanding, each step is implemented as a python method that can be used under main section.

Basic import

import requests # for web request
from bs4 import BeautifulSoup # cleaning html content

# pydantic for validation

from pydantic import BaseModel, field_validator, ValidationError

import json

Enter fullscreen mode Exit fullscreen mode

1. Target site and getting quotes

We are using (http://quotes.toscrape.com/) to scrape the quotes. Each quote will have three fields: quote_text, author, and tags. For example:

Quote example

Below method is a general script to get html content for a given url.


def get_html_content(page_url: str) -> str:
    page_content =""
    # Send a GET request to the website
    response = requests.get(url)
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        page_content = response.content
    else:
        page_content = f'Failed to retrieve the webpage. Status code: {response.status_code}'
    return page_content

Enter fullscreen mode Exit fullscreen mode

2. Get the quote data from scraping

We will use requests and beautifulsoup to scraped the data from given urls. The process is broken into three parts: 1) Get the html content from the web 2) Extract the desired html tags for each targeted fields 3) Get the values from each tags


def get_tags(tags):
    tags =[tag.get_text() for tag in tags.find_all('a')]
    return tags

Enter fullscreen mode Exit fullscreen mode

def get_quotes_div(html_content:str) -> str :    
    # Parse the page content with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find all the quotes on the page
    quotes = soup.find_all('div', class_='quote')

    return quotes
Enter fullscreen mode Exit fullscreen mode

Below script get the data point from each quote's div.

    # Loop through each quote and extract the text and author
    for quote in quotes_div:
        quote_text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        tags = get_tags(quote.find('div', class_='tags'))

        # yied data to a dictonary 
        quote_temp ={'quote_text': quote_text,
                'author': author,
                'tags':tags
        }
Enter fullscreen mode Exit fullscreen mode

3. Create Pydantic dataclass and Validate the data for each quote

As per each fields of the quote, create a pydantic class and use same class for data validation during data scraping.

The pydantic model Quote

Below is the Quote class that is extended from BaseModel having three fields like quote_text, author, and tags. Out of these three, quote_text and author are type of string (str) and tags is a list type.

We have two validator methods (with decorators):

1) tags_more_than_two () : Will check that it must have more than two tags. (it is just for example, you can have any rule here)

2.) check_quote_text(): This method will remove "" from quote and test for text.


class Quote(BaseModel):
    quote_text:str
    author:str
    tags: list

    @field_validator('tags')
    @classmethod
    def tags_more_than_two(cls, tags_list:list) -> list:
        if len(tags_list) <=2:
            raise ValueError("There should be more than two tags.")
        return tags_list

    @field_validator('quote_text')
    @classmethod    
    def check_quote_text(cls, quote_text:str) -> str:
        return quote_text.removeprefix('').removesuffix('')
Enter fullscreen mode Exit fullscreen mode

Getting and validating data

Data validation is very easy with pydantic, for example, below code, pass scraped data to pydantic class Quote.

quote_data = Quote(**quote_temp)
Enter fullscreen mode Exit fullscreen mode

def get_quotes_data(quotes_div: list) -> list:
    quotes_data = []

    # Loop through each quote and extract the text and author
    for quote in quotes_div:
        quote_text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        tags = get_tags(quote.find('div', class_='tags'))

        # yied data to a dictonary 
        quote_temp ={'quote_text': quote_text,
                'author': author,
                'tags':tags
        }

        # validate data with Pydantic model
        try:
            quote_data = Quote(**quote_temp)            
            quotes_data.append(quote_data.model_dump())            
        except  ValidationError as e:
            print(e.json())
    return quotes_data
Enter fullscreen mode Exit fullscreen mode

4. Store the data

Once data is validated that will be save to a json file. (A general purpose method is written that will convert Python dictionary to json file)


def list_of_dict_to_json(list_data_dict:list, filename:str)-> None:
    """This is a utiltiy method to write python dictonary data
    to a json file.

    Args:
        list_data_dict (list): Python list of dict having the data
        filename (str): Filename for the json file
    example_data = [
    {'name':'ajit', 'age':30},
    {'name':'Raushan', 'age':30}
    ]

    list_of_dict_to_json(example_data, 'names') # output: names.json
    """
    # check given filename, if ends with .json or not
    if not filename.endswith('.json'):
        filename = filename+'.json'

    # Write data to a JSON file
    with open(filename, 'w') as json_file:
        json.dump(list_data_dict, json_file, indent=4)
Enter fullscreen mode Exit fullscreen mode

Putting all together

After understanding each piece of scraping, now , you can put all together and run the scraping for data collection.


if __name__ == '__main__':        
    # URL of the Quotes to Scrape website    
    url = 'http://quotes.toscrape.com/'
    # 1. get the content
    page_html = get_html_content(url)
    # 2. get the div for each quote
    quotes_div = get_quotes_div(page_html)
    # 3. validate and get the data point
    quotes_data = get_quotes_data(quotes_div)
    # 4. Store the data to json
    list_of_dict_to_json(quotes_data, 'quotes')

Enter fullscreen mode Exit fullscreen mode

Note: A revision is planned, let me know your idea or suggestion to include in the revised version.

Links and resources:

💖 💪 🙅 🚩
ajitkumar
Ajit Kumar

Posted on November 16, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related