Scraping with Ruby p.1

smackdh

Mattias Velamsson

Posted on January 17, 2023

Scraping with Ruby p.1

Hi,

This time I will be walking through how to scrape a website using Ruby. There are plenty of guides out there for this, and I used a lot of them to freshen up my memory and try new things.

Goal: To get a better understanding of scraping, getting the correct data and how to use Nokogiri + Watir together.

I will be splitting this up in parts as I go. Future parts will include fetching multiple pages, parsing to CSV and more.


Why use Nokogiri and Watir?

  • Nokogiri is a gem that makes it easy to parse and search HTML and XML documents.
  • Watir is a gem that allows us to interact with a website using HTTP requests.

Together, they are a good match and makes it very straight forward and easy.


Getting started

First, you'll need to install the Nokogiri and Watir gems. You can do this by running the following command in your terminal:

gem install nokogiri watir
Enter fullscreen mode Exit fullscreen mode

Adding them to your project.

require 'nokogiri'
require 'watir'

response = Watir.get('https://www.example.com')
doc = Nokogiri::HTML(response.body)
Enter fullscreen mode Exit fullscreen mode

In the above example, we use Watir to make a GET request to the website "https://www.example.com" and store the response in the "response" variable. We then use Nokogiri to parse the HTML from the response's body and store it in the "doc" variable.

Searching and Extracting Data

Now that we have the HTML from the website in the "doc" variable, we can use Nokogiri to search and extract information from it. Here's an example of how to search for all the h1 tags on the website:

h1_tags = doc.search('h1')

h1_tags.each do |h1|
  puts h1.text
end
Enter fullscreen mode Exit fullscreen mode

Above, we use Nokogiri's search method to find all the h1 tags, and then iterate over each h1 tag and print its text. Easy peasy, right?

But the search method is only one among many methods you can use. Another one that is really convenient is the CSS selector.

content_p = doc.css('.content p')

puts content_p.text
Enter fullscreen mode Exit fullscreen mode

This will search the HTML in doc for a paragraph with the class ".content", and then prints the text inside that paragraph.

And that's pretty much it! ✌🏼


Now, this is just the tip of the iceberg of what you can do with this.

In the next part, I will go more into things such:

  • Fetching all elements and adding logic
  • Passing data fetched to a file such as CSV
  • Error handling
  • Adding headers to your request
  • and more.
💖 💪 🙅 🚩
smackdh
Mattias Velamsson

Posted on January 17, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related