Web Scraping Google With Ruby

Introduction

Ruby is a high-level, interpreted general purpose Object Oriented Programming Language. It was created by Yukihiro “Matz” Matsumoto of Japan in 1993.

Ruby has a very simple syntax that is easy to understand. It is a powerful programming language and provides a large of built-in libraries that can be used for web scraping like HTTPart, Mechanize, and Nokogiri. This makes it a great choice for data extraction.

Web Scraping is the process of extracting valuable data from websites or other sources. It is used for various tasks such as data mining, price monitoring, lead generation, SEO, etc.

In this blog, we will learn to scrape Google Search Results using Ruby and its libraries.

Ruby for Scraping Google?

Ruby is quite popular for web scraping. Its ability to handle complex web scraping tasks, by launching multiple threads to scrape data from different parts of websites makes it an ideal choice for web scraping.

It can be used to parse both HTML and XML types of documents and it also provides a rich set of libraries that can help developers automate the process of web scraping.

Overall, Ruby is a high-performance language and has good community support. It doesn’t matter if you scrape Google or any other website, Ruby provides a ton of features and libraries that can help you to get started with web scraping.

Scraping Google Search Results With Ruby

In this post, we will be coding a scraper to extract the first 10 Google Search Results. The returned data would be comprised of the title, link, and description of the organic result. You can use the data for a variety of purposes like SERP Monitoring, Rank Tracking, Keyword Tracking, etc.

Google Search Results Scraping can be divided into two processes:

Extracting HTML from the target URL.
Parsing the extracted raw HTML to get the required data.

Requirements:

To scrape Google Search Results, we will be working with these two libraries:

HTTParty — Used to make HTTP requests and fetch the required data.
Nokogiri — Used to parse HTML and XML documents.

Set-Up:

If you have not already installed Ruby, I recommend you watch these videos, so we can start with the tutorial.

Process:

So, now we can get started with our project. We will pass this URL as the parameter to scrape the search results.

    https://www.google.com/search?q=ruby&gl=us

Let us first require the dependencies we have installed and are going to use in the tutorial.

    require "nokogiri"
    require "httparty"

Now, we will make a function scraper, to extract the information from Google.

    def scraper
        url = "https://www.google.com/search?q=ruby&gl=us"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
        }
        unparsed_page = HTTParty.get(url, headers: headers)
        parsed_page = Nokogiri::HTML(unparsed_page.body)
        results = []

Step-by-step explanation:

First, we set the URL we want to scrape.
Then, we set the header to the User Agent, which is used to identify the type of device or browser making the request.
After that, we made an HTTP request on the URL with the help of HTTParty by passing User Agent as the header.
In the next line, we used Nokogiri to parse the extracted HTML, and then we initialized a result array to store the scraped data.

Now, we will search for the tags from the HTML so we can get the relevant data.

If you inspect the Google webpage, you will get to know all our organic results are inside a div container with the class name g.
So, we will select or extract all the divs with g as the class name.

    parsed_page.css("div.g")

Then, we will loop through each of the selected divs.

    parsed_page.css("div.g").each do |result|
        link = result.css(".yuRUbf > a").first
        link_href = link.nil? ? "" : link["href"]
        result_hash = {
            title: result.css("h3").text,
            link: link_href,
            snippet: result.css(".VwiC3b").text
        }
        results << result_hash
        end
        puts results
        end
    scraper

In the above code, the second line will ensure that we scrape the first anchor tag within the class yuRUbf.

After that, we will check if the link is null. If yes, then do not store anything in link_href. If no, then initialize it with the scraped URL present in the href attribute of the link.

If you inspect the Google webpage again, you will find that the title is within the tag h3, the link is under the tag .yuRUbf > aand the snippet is under the tag VwiC3b.

After running the code without any error, your results should look like this:

    {
        :title=>"Ruby Programming Language",
        :link=>"https://www.ruby-lang.org/en/",
        :snippet=>"A dynamic, open source programming language with a focus on simplicity and productivity. It has an elegant syntax that is natural to read and easy to write."
        }
        {
        :title=>"Ruby - Wikipedia",
        :link=>"https://en.wikipedia.org/wiki/Ruby",
        :snippet=>"A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum (aluminium oxide). Ruby is one of the most popular traditional ..."
        }
        {
        :title=>"Ruby: #1 Virtual Receptionist & Live Chat Solution for Small ...",
        :link=>"https://www.ruby.com/",
        :snippet=>"14000+ small businesses trust the virtual receptionists at Ruby to create meaningful connections over the phone and through live chat, 24/7."
      }

But if you go by this method, Google may block your IP easily. You can use random User Agents for each request to avoid blockage to some extent. Let me show you, how you can do this:

Initialize an array of User Agents:

    headers = [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
        ]

Then, select a random user-agent each time you make a request.

    random_header = headers.sample

The sample method is used for returning random elements from the array.

And then pass it with the URL when you scrape the HTML.

    unparsed_page = HTTParty.get(url, headers: {
            "User-Agent": random_header
            })

So, this is how you can scrape Google Search Results with Ruby.

If you are looking for a more sophisticated and maintenance-free solution, you can try this Google SERP API for scraping Google Search Results.

Advantages of scraping Google Search Results

There are various advantages of scraping Google Search Results:

Rank Tracking — It can be used, to track your website position on the search engine, which can help you to remain informed and take decisions accordingly.

Scalable — Scraping Google Search Results allows you to gather a significant amount of data which can be used for variety of purposes such as keyword tracking, rank tracking, etc.

Price Tracking — Scraping Google Search Results can help you to remain well informed about the pricing of the products sold by your competitors.

Lead Generation — If you want to gather contact information about your potential clients, then scraping Google Search Results could be a great decision.

Real-time data — You can remain up-to-date with the current information as scraping Google Search Results enables you to get access to real-time data.

Inexpensive — Most businesses can’t afford official Google Search API as it can make a dent in their already tight budget, but scraping Google Search Results solve this problem also.

Problems with Offical Google Search API

There are a few reasons why businesses don’t use the official Google Search API:

Expensive: The Official Google Search API cost 5$ for 1k requests, which is currently the most expensive in the market.

Limited Access: Due to the access to a limited amount of data, businesses consider the web scrapers available in the market which gives them complete control over the results.

Complex Setup: Users with no technical knowledge can find it very difficult to set up the API.

Conclusion

In this tutorial, we learned to scrape Google Search Results using Ruby. Please do not hesitate to message me if I missed something. If you think we can complete your custom scraping projects feel free to contact us.

Follow me on Twitter. Thanks for reading!