Goodreads Scraping using Python and Selenium
Rahul Kumar
Posted on April 12, 2022
Goodreads is the biggest library of books that allows saving your reading progress. Moreover, you can find a myriad of options for your next read.
Github - https://github.com/rahul-kumar-bly/Goodreads-Scrapper-Alpha
Note - This tutorial is for educational purposes only. No intent to scrape large-scale data from Goodreads. Logos used in cover-image are intellectual properties of Selenium and Goodreads.
Here in this tutorial, we are using Selenium to scrap a list of Must Read books available in their library along with we are going to create a module that allows us to search their website for a particular book.
So, let's get started.
Pre-requisite
In case you are new to selenium first read my how to get started with selenium in python guide, link here. The guide will help you to understand the basics you need to learn Selenium.
Goodreads account is needed for this tutorial along with its email id and password.
You need to have basic idea of Xpath (link).
Now, let's first import all the required modules.
Time to do the basic part now, creating instance of Service and Options class ready to your browser to open goodreads login page.
We will use instance of options later, first we need to check everything is working fine.
Run this script and it open the sign-in page of goodreads. Now, we are going to supply login id and password to this page as without it we are going to have some complication scraping data.
Once you load it will open the same login page but this time it login you and redirect to your profile page.
Let's move to the next step where we start doing scraping.
Our first objective to open a dedicated search page and send a book name and then retrieve a list of books.
This will search 20 books, unlike the first 2 result most of the other are summaries and translations or other related books with same title. So there is no reason to loop through all available pages and display every single result.
Open this link in your browser tab. Hit Ctrl+Shift+C to open inspect element and Ctrl + F inside inspect element window. Paste this code //table/tbody/tr[contains(@itemtype, 'http://schema.org/Book')]
and hover over highlighted area. It will highlight a block which includes the cover image, title, rating, and other stuff as in below screenshot.
XPath stands for XML Path-Language.
I am not going into too much detail about Xpath but for starter it is a flexible way to address any part of HTML (well XML originally) which means you can access anything in HTML (or XML) using Xpath.
In our snippet we want to access <tr itemscope="" itemtype="http://schema.org/Book">
so in order to access we need to use relative path //table/tbody/tr
tr contains itemptype="http://schema.org/Book" so for the same we will use [contains (@itemtype, 'http://schema.org/Book')]
thus the complete code will be //table/tbody/tr[contains (@itemtype, 'http://schema.org/Book')]
. There will be other instance of Xpath too in our tutorial that help you understand it better.
We also need to extract book-cover image url since the above code will not provide us with that. The image tag has class with a value bookcover which we can access using the .find_elements()
method.
Next step is to create a list and store Title, Auhtor info, Rating, Image as list-tuple which we can achieve using loop.
It's time to display data in our terminal using pandas.
import pandas
and add these three lines of codes
Run and this will generate an excel file like this.
So far we have,
- Open Goodreads and login with credentials
- Search for a particular book title
- Save the data into excel spreadsheet
Posted on April 12, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024