Adam Smolenski
Posted on August 16, 2020
Last post was explaining why web scraping is a good project to educate yourself about many aspects of web development. Let's go over somethings before we start.
I did learn Python first (Beautiful Soup is awesome) but personally I have taken a liking to some of the libraries available in Ruby.
What tools are you going to need?
If you work with Rails you probably have noticed there is this thing called Nokogiri... it takes fooooooooorever to load. Fun fact nokogiri is the Japanese word for woodsaw... Not sure why they named it that (maybe because saws scrape things...) but didn't research it so I could settle on the joke that it's named after the snoring I did while napping during installation. Nokogiri is an html parser, that you can use CSS selectors to find the information.
Now we have to get that parser some information. that's where the aptly named HTTParty comes in handy.
It will handle your get requests to retrieve the raw HTML. Another tool that isn't necessary but something I love about Ruby is Pry. Just drop a binding.pry in and manipulate variables, it's awesome. Especially if you are doing OOP instead of procedural. It really is useful to put it in a function to make sure your scope is right.
Those are the tools we will start with. In part 2 or 3 we will add selenium which is a webdriver.
It's best to break down a project into parts... Or Shakespeare was trying to communicate that there are probably only so many bad jokes people will sit through in one play... or post.
Let's Begin!
So here is what you should have at the top of your ruby scraper file:
require 'nokogiri'
require 'pry'
require 'httparty'
In your Gemfile you should have:
source "https://rubygems.org"
gem 'pry'
gem 'nokogiri'
gem 'httparty'
Don't forget to bundle install!
To start we are not trying to make this interactive, just a quick scraper to make sure we can build on it.
We're going to start with just an episode and try to get the information we want from that. If you didn't come from the first post, it talked about how when finding the information you are going to want to think about what datatypes you will use to store, we will go through this based upon what we want to scrape.
What's useful to know about an episode... title, characters/actors, airdate, plot, rating, writers and directors. Oh boy that's a lot.
Let's think about it though.
To be able to further expand later, why don't we make the episode the key to a hash? And then have a hash inside to further question the scraped information.
episode: {
title: _____,
airdate: _____,
rating: _____,
plot: ____,
Wait a minute... there can be many writers, directors and then characters with actors. The best way to tackle multiples of this would be an array you can further go into. To save you the suspense, I chose to do an array of hashes for actors (I started with the Simpsons and sometimes someone voices 8 different characters). The rest of the hash would look like this:
writers: [_____],
directors: [_____]
cast: { actor: [_____]}
}
For this walk through let's pick a new target... How about Pinky and the Brain, because who doesn't want to take over the world?
From previous wandering around IMDB I noticed the separate actors who played multiple characters with a " / " so finding an episode with someone who has several parts I landed on:
https://www.imdb.com/title/tt0954793/?ref_=ttep_ep10
Brain of the Future.
Ok, we have where we want to start let's head back to Ruby to see if we can get the page.
Again because ruby works best as an OOP let's make it a class, that will take in an argument of the URL when initializing that class so we can do it again with a different episode.
There will be a link to the full project, which has a class that contains everything. Build how you want to structure your data, this is by no means a practice set in stone. You may want other info or store it in a different way.
Just have fun with it!
Let's just go all in on the initialize and retrieve the page on initialization. Then let's create a function that parses the page and sets that to an instance variable as well. I am doing this because I want to make everything modular.
You can skip the step or @url as a variable, but I will be keeping it in for when we might be scraping many pages in one run and would like to have a reference to this location
Why are we having air_date, title and other information as separate functions? If IMDB changes, only a small thing breaks and you will have an easier time picking over the data for what you want.
So here's what we have right now:
class Episode
attr_reader :parsed_episode, :url, :page
def initialize(url)
@url = url
@page = HTTParty.get(self.url)
@parsed_episode = parse_page
end
def parse_page
return Nokogiri::HTML(@url)
end
end
With what we have no if you call:
narf = Episode.new('https://www.imdb.com/title/tt0954793/?ref_=ttep_ep10')
You will get a return of the parsed page, it should look something like:
#(Document:0x3ff12fd1ba54 {
name = "document",
children = [
#(DTD:0x3ff12fd16ff4 { name = "html" }),
#(Element:0x3ff12e191824 {
name = "html",
attributes = [
#(Attr:0x3ff12e19085c { name = "xmlns:og", value = "http://ogp.me/ns#" }),
#(Attr:0x3ff12e190848 {
name = "xmlns:fb",
That goes on for a while, that's the information nokogiri is reading from the get request.
I set my variable to narf = the new instance, this was done since I'm in pry and have some digging to do. Now nokogiri has a lot of neat functions but I am partial to #css, it really teaches you how to fine tune your selectors when you have to do styling. Other finders are #at_css (shows the first result of that css selector), and #x_path(let's you put in the xml path instead of css.
Anyway, we have the page and a slight idea of how to select how are we going to proceed?
Let's start with title. Hop back onto that website (I recommend using google chrome because their dev tools make the most sense to me, but most browsers will have an inspect function)
If you inspect element and click on the title you will see
<h1 class="">Brain of the Future </h1>
Think... the class is empty how am I going to pick this out, wait... Search engine optimization works best you want to only have a singular H1 tag to pick out for a website (not a rule but a guideline most websites stick to). Let's try it out.
[21] pry(main)> narf.css('h1')
NoMethodError: undefined method `css' for #<Episode:0x00007fe25c2db3d8>
from (pry):19:in `__pry__'
Whoops... we need to get at that css, which we put in the parsed page. narf is only our created instance. Let's make it easy and set the css level to another variable so you don't have to chain so many methods
css_level = narf.parsed_page
Try it again
[22] pry(main)> css_level.css('h1')
=> [#<Nokogiri::XML::Element:0x3ff12fd4bc18 name="h1" attributes=[#<Nokogiri::XML::Attr:0x3ff12fd4bba0 name="class">] children=[#<Nokogiri::XML::Text:0x3ff12fd4b6dc "Brain of the Future ">]>]
SUCCESS!!! But what is that exactly, it looks like an array of information. This will be useful in other tasks but we see the title right there! Again, the important thing here is it's an array (nokogiri has some neat functions to read it too) but since it's an array, we only got one result here but later on we may get a few. We should map the nokogiri method of #text on to it and see what happens.
[24] pry(main)> css_level.css('h1').map { |ele| ele.text }
=> ["Brain of the Future "]
[25] pry(main)>
Closer still, now ruby comes with the handy .strip or .chomp method that gets rid of that pesky whitespace, so throwing a .strip at the end of that ele.text will give you
=> ["Brain of the Future"]
Perfect we have our title.
Another one of the things you can map on is .to_html instead of text. If you are trying to output the information into a newly written website that would be the way to go, but here we are just going to write it to a JSON file in the end.
I'd like that to be a function in itself, again you never know. So let's add the code:
def title
self.parsed_episode.css('h1').map { |ele| ele.text.strip }[0]
end
Awesome we have a title function. In case you were wondering why the [0]
is there, it's because we are getting an array of things, here we only expect one result but you will see what to do with more items in part 2 of this walkthrough.
Sidenote of how I'm entering pry, I have a run file that requires pry and my ruby file and have narf = Episode.new('https://www.imdb.com/title/tt0954793/?ref_=ttep_ep10')
and css_level = narf.parsed_episode
in there before a binding.pry
Awesome. We're on a roll let's look at airdate
<a href="/title/tt0954793/releaseinfo?ref_=tt_ov_inf" title="See more release dates">Episode aired 8 February 1997
</a>
Hmm, that's not very descriptive but check another episode page to see what's unique... I don't see any other tags with title="See more release dates"
. Let's see if we can css select that. Ok time to throw some regex in there since we're lazy. css_level.css('[title^="See more"]'])
What does that even mean?!?!?!
So many websites have their own selectors in there other than class and id a way around that is the bracket notation. Be careful about your ' and " though, you don't want to escape too soon.
And what's with that ^? That means it starts with, did you really want to type that whole thing out? Another regex character they use for selectors is $ which means ends with.
[5] pry(main)> css_level.css("[title^='See more']")
=> [#<Nokogiri::XML::Element:0x3fdfeef433ec name="a" attributes=[#<Nokogiri::XML::Attr:0x3fdfeef43338 name="href" value="/title/tt0954793/releaseinfo">, #<Nokogiri::XML::Attr:0x3fdfeef43324 name="title" value="See more release dates">] children=[#<Nokogiri::XML::Text:0x3fdfeef42988 "Episode aired 8 February 1997\n">]>]
OOO success... now lets get the text.. oh wait we already have done that once before... Let's just make that a function shall we?
def get_text(css)
css.map {|ele| ele.text.strip}
end
So let's fix our title and add airdate
def title
get_text(parsed_episode.css('h1').map { |ele| ele.text.strip }[0])
end
def airdate
get_text(parsed_episode.css("[title^='See more']"))[0]
end
Now that you hopefully understand this a little we will quickly do one more and leave the more complicated data for next time.
For general information since we have been very specific, the most common used css selectors start with "." because that will select class and # will select id. While putting in 'p' will select a paragraph and 'div' will select a div. But what's the fun in being that general? Learning specificity will help you when building your own website. Also you can cheat in dev tools, find what you want and copy css selector path... no fun..
Let's try rating next. <span itemprop="ratingValue">7.9</span>
Oh great, ratingValue seems specific and we just learned the trick for the custom classifications so you guessed it
def rating
get_text(parsed_episode.css('[itemprop="ratingValue"]'))[0]
end
The rest involves a little more sorting (even logic with .next_element, real fancy stuff) and you have sat through enough of my bad jokes already. I'll save that for the next blog.
If you want to check out the full code for this walk-through you can head on over to:
https://github.com/AmSmo/webscraper_narf
Next episode we will build on that some more with writers, directors, cast. With all that information we will then write it to a JSON file.
If you would like to see where I will eventually go with this, I created an active record database seeded with the JSON file I built from this webscaper, you can find that at https://github.com/AmSmo/tvdb
Posted on August 16, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.