How to Scrape YouTube Video Data from a Playlist Using Python and BeautifulSoup
Nchanji Faithful
Posted on November 2, 2024
Are you looking to scrape essential data from YouTube video pages, like the title, channel name, publish date, view count, and video URL? In this tutorial, I’ll walk you through creating a Python script to do just that. By the end of this, you’ll be able to scrape data from a list of YouTube URLs and save it to a cleanly formatted CSV file.
Step 1: Setting Up the Environment
To get started, you’ll need to have Python installed on your computer. If you don’t have it already, you can download it from python.org.
Install Required Libraries
We’ll use requests
for making HTTP requests and BeautifulSoup
from bs4
for parsing HTML content. Open your terminal and run:
pip install requests beautifulsoup4
Step 2: Writing the Code
We’ll create a script that reads video URLs from a CSV file, extracts video details using BeautifulSoup, and writes the collected data into a new CSV file.
Full Code
Here's the complete Python script:
import requests
from bs4 import BeautifulSoup
import csv
def extract_youtube_data(url):
"""Extracts relevant data from a YouTube video page.
Args:
url: The URL of the YouTube video.
Returns:
A dictionary containing the extracted data.
"""
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract title
title_element = soup.find('meta', itemprop='name')
title = title_element['content'] if title_element else "N/A"
# Extract channel name
channel_name_element = soup.find('link', itemprop='name')
channel_name = channel_name_element['content'] if channel_name_element else "N/A"
# Extract publish date
publish_date_element = soup.find('meta', itemprop='datePublished')
publish_date = publish_date_element['content'] if publish_date_element else "N/A"
# Extract view count
view_count_element = soup.find('meta', itemprop='interactionCount')
view_count = view_count_element['content'] if view_count_element else "N/A"
return {
'Title': title,
'Channel Name': channel_name,
'Publish Date': publish_date,
'View Count': view_count,
'URL': url
}
def main():
input_file = 'my_data.csv' # CSV file with URLs
output_file = 'youtube_data_output.csv'
with open(input_file, 'r') as file:
reader = csv.reader(file)
urls = [row[0] for row in reader]
with open(output_file, 'w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=['Number', 'Title', 'Channel Name', 'Publish Date', 'View Count', 'URL'])
writer.writeheader()
for i, url in enumerate(urls, start=1):
data = extract_youtube_data(url)
# Add the 'Number' field to the data dictionary
data['Number'] = i
writer.writerow(data)
print(f"Processed: {i}")
if __name__ == '__main__':
main()
Step 3: Understanding the Code
Let’s break down what each part of the code does.
1. Import Libraries
import requests
from bs4 import BeautifulSoup
import csv
- requests: Used to send HTTP requests to fetch YouTube video pages.
- BeautifulSoup: Parses and extracts data from HTML content.
- csv: Handles reading and writing CSV files.
2. Function to Extract YouTube Data
def extract_youtube_data(url):
# Fetch the page content
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract the title
title_element = soup.find('meta', itemprop='name')
title = title_element['content'] if title_element else "N/A"
# Extract the channel name
channel_name_element = soup.find('link', itemprop='name')
channel_name = channel_name_element['content'] if channel_name_element else "N/A"
# Extract the publish date
publish_date_element = soup.find('meta', itemprop='datePublished')
publish_date = publish_date_element['content'] if publish_date_element else "N/A"
# Extract the view count
view_count_element = soup.find('meta', itemprop='interactionCount')
view_count = view_count_element['content'] if view_count_element else "N/A"
# Return all extracted data
return {
'Title': title,
'Channel Name': channel_name,
'Publish Date': publish_date,
'View Count': view_count,
'URL': url
}
3. The Main Function
def main():
input_file = 'my_data.csv' # CSV file with URLs
output_file = 'youtube_data_output.csv'
# Read URLs from the input CSV
with open(input_file, 'r') as file:
reader = csv.reader(file)
urls = [row[0] for row in reader]
# Open the output file for writing
with open(output_file, 'w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=['Number', 'Title', 'Channel Name', 'Publish Date', 'View Count', 'URL'])
writer.writeheader()
# Loop through each URL, extract data, and write to the CSV
for i, url in enumerate(urls, start=1):
data = extract_youtube_data(url)
data['Number'] = i
writer.writerow(data)
print(f"Processed: {i}")
Explanation
-
Input File:
my_data.csv
is expected to contain the list of YouTube video URLs. -
Output File:
youtube_data_output.csv
will store the extracted data. - Progress Indicator: The script prints the number of videos processed to keep track.
Step 4: Preparing Your Input CSV
Create a file named my_data.csv
in the same directory as the script. This file should contain one YouTube video URL per line, like so:
https://www.youtube.com/watch?v=VIDEO_ID1
https://www.youtube.com/watch?v=VIDEO_ID2
...
Step 5: Running the Script
To run the script, open your terminal and navigate to the directory where the script is saved. Then, execute:
python your_script_name.py
The script will fetch data from each URL, extract the relevant details, and write them to youtube_data_output.csv
.
Conclusion
You now have a fully functional script that can scrape data from YouTube videos and save it to a CSV file. This is especially useful for analyzing video details for research, content management, or SEO purposes.
Feel free to extend the script further by adding more data fields or refining the extraction logic. Happy scraping!
Have questions or suggestions? Drop a comment below!
Posted on November 2, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 2, 2024