David MMπ¨π»βπ»
Posted on September 6, 2019
Original post: Python scrapy tutorial for beginners β 01 β Creating your first spider
Learn how to fetch the data of any website with Python and the Scrapy Framework in just minutes. On the first lesson of 'Python scrapy tutorial for beginners', we will scrape the data from a book store, extracting all the information and storing in a file.
In this post you will learn:
- Prepare your environment and install everything
- How to create a Scrapy project and spider
- How to fetch the data from the HTML
- To manipulate the data and extract the data you want
- How to store the data into a .json, .csv and .xml file
Preparing your environment and installing everything
Before anything, we need to prepare our environment and install everything.
In Python, we create virtual environments to have a separated environment with different dependencies.
For example, Project1 has Python 3.4 and Scrapy 1.2, and Project2 Python 3.7.4 and Scrapy 1.7.3.
As we keep separated environments, one for each project, we will never have a conflict by having different versions of packages.
You can use Conda, virtualenv or Pipenv to create a virtual environment. In this course, I will use pipenv. You only need to install it with pip install __pipenv and to create a new virtual environment with pipenv shell.
Once you are set, install Scrapy with pip install scrapy. That's all you need.
Time to create the project and your spider.
Base image provided by Vecteezy
Creating a project and a spider β And what they are
Before anything, we need to create a Scrapy project. In your current folder, enter:
scrapy startproject books
This will create a project named 'books'. Inside you'll find a few files. I'll explain them in a more detailed post but here's a brief explanation:
books/
scrapy.cfg <-- Configuration file (DO NOT TOUCH!)
tutorial/
__init__.py <-- Empty file that marks this as a Python folder
items.py <-- Model of the item to scrap
middlewares.py <-- Scrapy processing hooks (DO NOT TOUCH)
pipelines.py <-- What to do with the scraped item
settings.py <-- Project settings file
spiders/ <-- Directory of our spiders (empty by now)
__init__.py
After creating a project, navigate to the project created (cd books) and once inside the folder, create a spider by passing it the name and the root URL without 'www':
scrapy genspider spider books.toscrape.com
Now we have our spider inside the spider folder! You will have something like this:
# -*- coding: utf-8 -*-
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
pass
First, we import scrapy. Then, a class is created inheriting 'Spider' from Scrapy. That class has 3 variables and a method.
The variables are the spider's name, the allowed_domains and the start_URL. Pretty self-explanatory. The name is what we will use in a second to run the spider, allowed_domains limit the scope of the scraping process (It can't go outside any URL not specified here) and start_urls are the starting point of the scrapy spider. In this case, just one.
The parse method is internally called when we start the Scrapy spider. Right now has only 'pass': It does nothing. Let's solve that.
How to fetch data from the HTML
We are going to query the HTML and to do so we need Xpath, a query language. Don't you worry, even if it seems weird at first, it is easy to learn as all you need are a few functions.
Parse method
But first, let's see what we have on 'parse' method.
Parse it's called automatically when the Scrapy spider starts. As arguments, we have self(the instance of the class) and a response. The response _is what the server returns when we request an HTML. In this class, we are requesting h_ttp://books.toscrape.com _and in _response_ _we have an object with all the HTML, a status message and more.
Replace "pass" with 'print(response.status)' and run the spider:
scrapy crawl spider
This is what we got:
Between a lot of information, we see that we have crawled the start_url, got a 200 HTTP message (Success) and then the spider stopped.
Besides 'status', our spider has a lot of methods. The one we are going to use right now is 'xpath'.
Our first steps with Xpath
Open the starting URL, and right-click -> inspect any book. A side menu will open with the HTML structure of the website (if not, make sure you have selected the 'Elements' tab). You'll have something like this:
We can see that each 'article' tag contains all the information we want.
The plan is to grab all articles, then, one by one, get all the information from each book.
First, let's see how we select all articles.
If we click on the HTML the side menu and press Control + F, the search menu opens:
At the bottom-right, you can read "Find by string, selector or Xpath". Scrapy uses Xpath, so let's use it.
To start a query with Xpath, write '//' then what you want to find. We want to grab all the articles, so type '//article'. We want to be more accurate, so let's grab all the articles with the attribute 'class = product_pod'. To specify an attribute, type it between brackets, like this: '//article[@class="product_pod"]'.
You can see now that we have selected 20 elements: The 20 initial books.
Seems like we got it! Let's copy that Xpath instruction and use it to select the articles in our spider. Then, we store all the books.
def parse(self, response):
all_books = response.xpath('//article[@class="product_pod"]')
Once we have all the books, we want to look inside each book for the information we want. Let's start with the title. Go to your URL and search where the full title is located. Right-click any title and then select 'Inspect'.
Inside the h3 tag, there is an 'a' tag with the book title as 'title' attribute. Let's loop over the books and extract it.
def parse(self, response):
all_books = response.xpath('//article[@class="product_pod"]')
for book in all_books:
title = book.xpath('.//h3/a/@title').extract_first()
We get all the books, and for each one of them, we search for the 'h3' tag, then the 'a' tag, and we select the @title attribute. We want that text, so we use 'extract_first' (we can also 'use extract' to extract all of them).
As we are scraping, not the whole HTML but a small subset (the one in 'book') we need to put a dot at the start of the Xpath function. Remember: '//' for the whole HTML response, './/' for a subset of that HTML we already extracted.
We have the title, now go the price. Right click the price and inspect it.
The text we want is inside a 'p' tag with the 'price_color' class inside a 'div' tag. Add this after the title:
price = book.xpath('.//div/p[@class="price_color"]/text()').extract_first()
We go to any 'div', with a 'p' child that has a 'price_color' class, then we use 'text()' function to get the text. And then, we extract_first() our selection.
Let's see if what we have. Print both the price and the title and run the spider.
def parse(self, response):
all_books = response.xpath('//article[@class="product_pod"]')
for book in all_books:
title = book.xpath('.//h3/a/@title').extract_first()
price = book.xpath('.//div/p[@class="price_color"]/text()').extract_first()
print(title)
print(price)
scrapy crawl spider
Everything is working as planned. Let's take the image URL too. Right-click the image, inspect it:
We don't have an URL here but a partial one.
The 'src' attribute has the relative URL, not the whole URL. The 'books.toscrape.com' is missing. Well, we just need to add it. Add this at the bottom of your method.
image_url = self.start_urls[0] + book.xpath('.//img[@class="thumbnail"]/@src').extract_first()
print(image_url)
We get the 'img' tag with the class 'thumbnail', we get the relative URL with 'src' then we add the first (and only) start_url. Again, let's print the result. Run the spider again.
Looking good! Open any of the URL and you'll see the cover's thumbnail.
Now let's extract the URL so we can buy any book if we are interested.
The book URL is stored in the href of both the title and the thumbnail. Any of both will do.
book_url = self.start_urls[0] + book.xpath('.//h3/a/@href').extract_first()
print(book_url)
Run the spider again:
Click on any URL and you'll go to that book website.
Now we are selecting all the fields we want, but we are not doing anything with it, right? We need to 'yield' (or 'return') them. For each book, we are going to return it's title, price, image and book URL.
Remove all the prints and yield the items like a dictionary:
def parse(self, response):
all_books = response.xpath('//article[@class="product_pod"]')
for book in all_books:
title = book.xpath('.//h3/a/@title').extract_first()
price = book.xpath('.//div/p[@class="price_color"]/text()').extract_first()
image_url = self.start_urls[0] + book.xpath('.//img[@class="thumbnail"]/@src').extract_first()
book_url = self.start_urls[0] + book.xpath('.//h3/a/@href').extract_first()
yield {
'title': title,
'price': price,
'Image URL': image_url,
'Book URL': book_url,
}
Run the spider and look at the terminal:
Saving the data into a file
While it looks cool on the terminal, there is no use. Why don't we store it into a file we can use later?
When we run our spider we have optional arguments. One of them is the name of the file you want to store. Run this.
scrapy crawl spider -o books.json
Wait until it's done⦠a new file has appeared! Double click it to open it.
All the information we saw on the terminal is now stored into a 'books.json'. Isn't that cool? We can do the same with .csv and .xml files:
Conclusion
I know the first time is tricky, but you have learnt the basics of Scrapy. You know how to:
- Create a Scrapy spider to navigate an URL
- A Scrapy project is structured
- Use Xpath to extract the data
- Store the data in .json, .csv and .xml files
I suggest you keep training. Look for an URL you want to scrape and try extracting a few fields as you did at the Beautiful Soup tutorial. The trick of Scrapy is learning how Xpath works.
Butβ¦do you remember that each book has an URL like this one?
Inside each item we scraped, there's more information we can take. And we'll do it in the second lesson of this series.
Posted on September 6, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
September 14, 2019