Scraping Millions of Google SERPs The Easy Way (Python Scrapy Spider)
Ian Kerins
Posted on November 17, 2020
Google is the undisputed king of search engines in just about every aspect. Making it the ultimate source of data for a whole host of use cases.
If you want to get access to this data you either need to extract it manually, pay a 3rd party for a expensive data feed or build your own scrape to extract the data for you.
In this article I will show you the easiest way to build a Google scraper that can extract millions of pages of data each day with just a few lines of code.
By combining Scrapy with Scraper API's proxy/autoparsing functionality we will build a Google scraper that can the search engine results from any Google query and return the following for each result:
- Title
- Link
- Related links
- Description
- Snippet
- Images
- Thumbnails
- Sources, and more
You can also refine your search queries with parameters, by specifying a keyword, the geographic region, the language, the number of results, results from a particular domain, or even to only return safe results. The possibilities are nearly limitless.
The code for this project is available on GitHub here.
For this guide, we're going to use:
- Scraper API as our proxy solution, as Instagram has pretty aggressive anti-scraping in place. You can sign up to a free account here which will give you 5,000 free requests.
- ScrapeOps to monitor our scrapers for free and alert us if they run into trouble. Live demo here: ScrapeOps Demo
How Querying Google Using Scraper API’s Autoparse Functionality
We will use Scraper API for two reasons:
- Proxies, so we won't get blocked.
- Parsing, so we don't have to worry about writing our own parsers.
Scraper API is a proxy management API that handles everything to do with rotating and managing proxies so our requests don't get banned. Which is great for a difficult site to scrape like Google.
However, what makes Scraper API extra useful for sites like Google and Amazon is that they provide auto parsing functionality free of charge so you don't need to write and maintain your own parsers.
By using Scraper API’s autoparse functionality for Google Search or Google Shopping, all the HTML will be automatically parsed into JSON format for you. Greatly simplifying the scraping process.
All we need to do to make use of this handy capability is to add the following parameter to our request:
"&autoparse=true"
We’ll send the HTTP request with this parameter via Scrapy which will scrape google results based on specified keywords. The results will be returned in JSON format which we will then parse using Python.
Scrapy Installation and Setup
First thing’s first, the requirements for this tutorial are very straightforward:
• You will need at least Python version 3, later
• And, pip to install the necessary software packages
So, assuming you have both of those things, you only need to run the following command in your terminal to install Scrapy:
pip install scrapy
Scrapy will automatically create some default folders where all the packages and project files will be located. So navigate to that folder, and then run the following commands:
scrapy startproject google_scraper
cd google_scraper
scrapy genspider google api.scraperapi.com
First Scrapy will create a new project folder called “google-scraper” which is also the project name. We then navigate into this folder and run the “genspider” command which will generate a web scraper for us with the name “google.”
You should now see a bunch of configuration files, a “spiders” folder with your scraper(s), and a Python modules folder with some package files.
Building URLs to Query Google
As you might expect, Google uses a very standard and easy to query URL structure. To build a URL to query Google with, you only need to know the URL parameters for the data you need. In this tutorial, I’ll use some of the parameters that will be the most useful for the majority of web scraping projects.
Every Google Search query will start with the following base URL:
http://www.google.com/search
You can then build out your query simply by adding one or more of the following parameters:
- The search keyword parameter denoted as q. For example, http://www.google.com/search?q=tshirt will search for results containing the “tshirt” keyword.
- The language parameter hl. For example, http://www.google.com/search?q=tshirt&hl=en
- The as_sitesearch parameter which will specify a domain (or, website) to search. For example, http://www.google.com/search?q=tshirt&as_sitesearch=amazon.com
- The num parameter that specifies the number of results per page (maximum is 100). For example, http://www.google.com/search?q=tshirt&num=50
- The start parameter which specifies the offset point. For example, http://www.google.com/search?q=tshirt&start=100
- The safe parameter which will only output “safe” results. For example, http://www.google.com/search?q=tshirt&safe=active
There are many more parameters to use for querying Google, such as date, encoding, or even operators such as ‘or’ or ‘and’ to implement some basic logic.
Building the Google Search Query URL
Below is the function I’ll be using to build the Google Search query URL. It creates a dictionary with key-value pairs for the q, num, and as_sitesearch parameters. If you want to add more parameters, this is where you could do it.
If no site is specified, it will return a URL without the as_sitesearch parameter. If one is specified, it will first extract network location using netloc (e.g. amazon.com), then add this key-value pair to google_dict, and, finally, encode it in the return URL with the other parameters:
from urllib.parse import urlparse
from urllib.parse import urlencode
def create_google_url(query, site=''):
google_dict = {'q': query, 'num': 100, }
if site:
web = urlparse(site).netloc
google_dict['as_sitesearch'] = web
return 'http://www.google.com/search?' + urlencode(google_dict)
return 'http://www.google.com/search?' + urlencode(google_dict)
Connecting to a Proxy via the Scraper API
When scraping an internet service like Google, you will need to use a proxy if you want to scrape at any reasonable scale. If you don’t, you could get flagged by its ant-botting countermeasures and get your IP-banned. Thankfully, you can use Scraper API’s proxy solution for free for up to 5,000 API calls, using up to 10 concurrent threads. You can also use some of Scraper API’s more advanced features, such as geotargeting, JS rendering, and residential properties.
To use the proxy, just head here to sign up for free. Once you have, find your API key in the dashboard as you’ll need it to set up a proxy connection.
The proxy is incredibly easy to implement into your web spider. In the get_url function below, we’ll create a payload with our Scraper API key and the URL we built in the create_google_url function. We’ll also enable the autoparse feature here as well as set the proxy location as the U.S.:
def get_url(url):
payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
To send our request via one of Scraper API’s proxy pools, we only need to append our query URL to Scraper API’s proxy URL. This will return the information that we requested from Google and that we’ll parse later on.
Querying Google Search
The start_requests function is where we will set everything into motion. It will iterate through a list of queries that will be sent through to the create_google_url function as keywords for our query URL.
def start_requests(self):
queries = ['scrapy’, ‘beautifulsoup’]
for query in queries:
url = create_google_url(query)
yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})
The query URL we built will then be sent as a request to Google Search using Scrapy’s yield via the proxy connection we set up in the get_url function. The result (which should be in JSON format) will be then be sent to the parse function to be processed. We also add the {'pos': 0} key-value pair to the meta parameter which is just used to count the number of pages scraped.
Scraping the Google Search Results
Because we used Scraper API’s autoparse functionality to return data in JSON format, parsing is very straightforward. We just need to select the data we want from the response dictionary.
First of all, we’ll load the entire JSON response and then iterate through each result, extracting some information and then putting it together into a new item we can use later on.
This process also checks to see if there is another page of results. If there is, it invokes ** yield scrapy.Request** again and sends the results to the parse function. In the meantime, pos is used to keep track of the number of pages we have scraped:
def parse(self, response):
di = json.loads(response.text)
pos = response.meta['pos']
dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for result in di['organic_results']:
title = result['title']
snippet = result['snippet']
link = result['link']
item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
pos += 1
yield item
next_page = di['pagination']['nextPageUrl']
if next_page:
yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})
Putting it All Together and Running the Spider
You should now have a solid understanding of how the spider works and the flow of it. The spider we created, google.py, should now have the following contents:
import scrapy
from urllib.parse import urlencode
from urllib.parse import urlparse
import json
from datetime import datetime
API_KEY = 'YOUR_KEY'
def get_url(url):
payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
def create_google_url(query, site=''):
google_dict = {'q': query, 'num': 100, }
if site:
web = urlparse(site).netloc
google_dict['as_sitesearch'] = web
return 'http://www.google.com/search?' + urlencode(google_dict)
return 'http://www.google.com/search?' + urlencode(google_dict)
class GoogleSpider(scrapy.Spider):
name = 'google'
allowed_domains = ['api.scraperapi.com']
custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
'CONCURRENT_REQUESTS_PER_DOMAIN': 10}
def start_requests(self):
queries = ['scrapy’, ‘beautifulsoup’]
for query in queries:
url = create_google_url(query)
yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})
def parse(self, response):
di = json.loads(response.text)
pos = response.meta['pos']
dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for result in di['organic_results']:
title = result['title']
snippet = result['snippet']
link = result['link']
item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
pos += 1
yield item
next_page = di['pagination']['nextPageUrl']
if next_page:
yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})
Before testing the scraper we need to configure the settings to allow it to integrate with the Scraper API free plan with 10 concurrent threads.
To do this we defined the following custom settings in our spider class.
custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
'CONCURRENT_REQUESTS_PER_DOMAIN': 10,
'RETRY_TIMES': 5}
We the concurrency to 10 threads to match the Scraper API free plan and et RETRY_TIMES
to tell Scrapy to retry any failed requests 5 times. In the settings.py file we also need to make sure that DOWNLOAD_DELAY
and RANDOMIZE_DOWNLOAD_DELAY
aren’t enabled as these will lower your concurrency and are not needed with Scraper API.
To test or run the spider, just make sure you are in the right location and then run the following crawl command which will also output the results to a .csv file:
scrapy crawl google -o test.csv
If all goes according to plan, the spider will scrape Google Search for all the keywords you provide. By using a proxy, you’ll also avoid getting banned for using a bot.
Setting Up Monitoring
To monitor our scraper we're going to use ScrapeOps, a free monitoring and alerting tool dedicated to web scraping.
With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.
Live demo here: ScrapeOps Demo
Getting setup with ScrapeOps is simple. Just install the Python package:
pip install scrapeops-scrapy
And add 3 lines to your settings.py
file:
## settings.py
## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
## Add In The ScrapeOps Extension
EXTENSIONS = {
'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500,
}
## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.
If you would like to run the spider for yourself or modify it for your particular Google project then feel free to do so. The code is on GitHub here. Just remember that you need to get your own Scraper API API_KEY
by signing up for a free account here.
Posted on November 17, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.