Mbonu Blessing
Posted on October 2, 2020
Hello everyone,
This week, we will be building a CLI(command line interface) to scrape data from dev.to. We will be filter by hashtag and return the title and author of articles related to a hashtag. We are going to be doing this with just ruby. A summary of how the CLI will work is;
- You run it and it shows the welcome message
- It asks you to enter an hashtag
- Then the result is displayed to you
Time to get our hands dirty..
First created a ruby file. I called mine dev_to_web_scraper.rb
and open with your code editor.
We are going to create a simple skeleton just to get our CLi going before scraping. We will be creating a module for instructions and a class to handle the scraping.
# instruction module
module Instructions
def introductions
puts 'Welcome to dev.to webscraper. This CLi tool gathered articles based on the hashtag provided'
puts 'If you want to quit, simple type (q) the next time you are prompted to enter a value'
puts 'Please provide a hashtag to continue..'
puts ''
end
def quit_message
puts 'You have quit the scraper'
end
def invalid_entry
puts 'Invalid entry, try again'
end
end
# scraper class
class Scraper
extend Instructions
def self.get_input
user_input = gets.chomp
get_hashtag(user_input)
end
def self.get_hashtag(user_input)
if user_input == 'q'
quit_message
elsif user_input.empty?
invalid_entry
get_input
else
scrape_data(user_input.to_s)
end
end
def self.scrape_data(hashtag)
puts "Scraped data for #{hashtag}"
get_input
end
end
Let me explain a little. I think the instruction module is pretty straight forward. We just created 3 methods to display instructions to scrape the page.
For the class, we include the instruction module. This class also have 3 methods, the first is used to get the input from the user and pass it to the next method. This next method is called get_hashtag
that takes an input, then decides what to do based on the input.
Based on the instruction, when the user enter's q
, they quit the CLI and a message is displayed. If the user puts an empty string, an invalid_entry
message is displayed and they are prompted to enter another input. And when it's not empty, we convert to a string in the case that it's a number and pass it to the scrape_data
method.
This is where the action will happen but for now, it simply logs a string with the user_input
and prompts them for another input.
To get it working, we need to call the introductions
and the get_input
.
All this is in one file. So the full file will resemble this:
# dev_to_web_scraper.rb
module Instructions
def introductions
puts 'Welcome to dev.to webscraper. This CLi tool gathered articles based on the hashtag provided'
puts 'If you want to quit, simple type (q) the next time you are prompted to enter a value'
puts 'Please provide a hashtag to continue..'
puts ''
end
def quit_message
puts 'You have quit the scraper'
end
def invalid_entry
puts 'Invalid entry, try again'
end
end
class Scraper
extend Instructions
def self.get_input
user_input = gets.chomp
get_hashtag(user_input)
end
def self.get_hashtag(user_input)
if user_input == 'q'
quit_message
elsif user_input.empty?
invalid_entry
get_input
else
scrape_data(user_input.to_s)
end
end
def self.scrape_data(hashtag)
puts "Scraped data for #{hashtag}"
get_input
end
end
Scraper.introductions
Scraper.get_input
Time to run in our console. I saved my file in the desktop folder so I need to cd into that folder to run my code.
$ cd Desktop
$ ruby dev_to_web_scraper.rb
And you should have this beautiful goodness
Onto the not-so-hard part. We need to install some gems:
$ gem install httparty #HTTP request gem
$ gem install nokogiri #parsing gem
We also need to get the url form dev.to that gives you access to search by hashtags. The url is https://dev.to/t/career
where we can change career to what we get from the user.
Updating our file to require httparty and updating our scrape_data
method:
require "HTTParty"
...
def self.scrape_data(hashtag)
url = "https://dev.to/t/#{hashtag}"
html = HTTParty.get(url)
puts "Scraped data for #{hashtag}"
puts html
get_input
end
...
Running the above will display a bunch of html. Time to turn it into something meaningful.
We are going to be gathering the title and author of the article into an array and returning the array. This is where nokogiri comes to play. For us to identify the title and author of the article, we need to use the dev tool to find the element and its class or id. Anything we can use to identify those information.
For the article title, the css identifer I identified is h2.crayons-story__title a
and for the author is div.crayons-story__top p
. Each of the articles are wrapped by a parent div whose css class is .crayons-story__body
.
Next, we import nokogiri and then use it to parse our html. Our updated code for scrape_data
should be:
require 'nokogiri'
...
def self.scrape_data(hashtag)
url = "https://dev.to/t/#{hashtag}"
puts 'getting data ....'
html = HTTParty.get(url)
response = Nokogiri::HTML(html)
info = []
response.css('.crayons-story__body').each do |section|
title_and_author = section.search('h2.crayons-story__title a', 'div.crayons-story__top p')
info.push({
title: title_and_author[0].text.gsub(/\n/, '').strip.gsub(/\s+/, ' '),
author: title_and_author[1].text.gsub(/\n/, '').strip.gsub(/\s+/, ' ')
})
end
puts info
get_input
end
First we parse out http call with nokogiri and save the response to the variable. Then we create an empty array to push our objects.
Then we use css method to find all the elements whose class matches .crayons-story__body
. We then loop through a search for h2.crayons-story__title a
and div.crayons-story__top p
elements within it. The search returns an array. We apply the text method on each of the 2 search results as well as clean up the newlines and multiple spaces around and within the string, and push them into the array and then log the array to the console.
Go ahead a run the code. We should have an array of objects displayed in the console.
We should be done here but i think we still have one small issue to resolve. I would like us to take care of the case where there are no articles for a hashtag. We do this by simply getting the count of response to see if its empty. If it is, we tell the user, else we run through and return the result. We should have our updated code as:
...
def self.scrape_data(hashtag)
url = "https://dev.to/t/#{hashtag}"
puts 'getting data ....'
html = HTTParty.get(url)
response = Nokogiri::HTML(html)
info = []
articles = response.css('.crayons-story__body')
if articles.empty?
puts "No article for for hashtag: #{hashtag}"
else
articles.each do |section|
title_and_author = section.search('h2.crayons-story__title a', 'div.crayons-story__top p')
info.push({
title: title_and_author[0].text.gsub(/\n/, '').strip.gsub(/\s+/, ' '),
author: title_and_author[1].text.gsub(/\n/, '').strip.gsub(/\s+/, ' ')
})
end
end
puts info
get_input
end
When I try the hashtag java, i get the result below
For javascript, i get the results below
When we use a tag that doesn't exist on dev.to yet like height
, we should get our little message:
Here is the link to a github gist that contains the code:
Github gist
Until next week..
Resources
nokogiri gem
httparty gem
Web Scraping with Ruby
HOWTO parse HTML with Ruby & Nokogiri
Posted on October 2, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 14, 2023
October 18, 2023