Web scraping in Ruby, great practice for aspiring Web Developers

amsmo

Adam Smolenski

Posted on August 16, 2020

Web scraping in Ruby, great practice for aspiring Web Developers

When looking into learning to code I was trying to figure out if there was a project I could do that would touch on multiple areas of coding. Really the ideal was to find something to give a basic knowledge of a broad variety of topics and see what really interested me and what I wanted research further. I came across web scraping and thought it was the perfect start. 


Here are a few things you have to research:

  1. HTML
  2. CSS
  3. A programming language, you need something to interact with the websites
    • Going through different datatypes, making decisions about them
    • Finding new and different libraries (what's up doc?)
    • How to keep and use that info (make your own API.... FUN!)

The real things I wanted to scrape were recipes, because I am often baking and my hands are covered in dough... the pages refresh or an ad pops up, wash my hands and then it happens all over again. It really is a scene befitting Scooby and Shaggy.

Scooby Doo farce running through random doors

(gif from The Nerd Daily)

You can scrape any website for information. A walkthrough of scraping IMDB for episode information will be up next.

Let's talk about why I think Ruby is great for this. It has several tools at its disposal. Nokogiri, httparty and what I find separates Ruby from my scraping experience in Python is the Pry library.

Ruby forces you to make decisions that really are foundational in Object Oriented Programming. In Python I found myself able to use procedural programming as a crutch when starting out. Ruby because a variable has to be passed into a function otherwise it is out of scope, it really makes you want to turn everything into a class and use instances. So you really start to think about how to not work on a variable but work with a variable. Why is that important? You start to think in modularity and reusability of code. Future you will thank you for that.

So how does pry play into that? Well, make a run file that declares the instance with a binding.pry at the end enables you to explore that variable in webs craping, your basic information is the entirety of the HTML document. So when you are in pry, it's loaded into the memory and through trial and error you can find what you want. If you strike gold, you add that into your class as a method and continue on.

Ruby also has a regex built in so if you need to parse out specific parts or search for keywords that is functionality built in. It can be paired with trying to find css-selectors to add some extra layer of adaptability to your program.

Through learning to web scrape there are several other libraries I came across that are also useful. Selenium being one of them. This library allows you to run a browser from the command line. In my project I ended up using it to just double check that I was indeed navigating the scraper to the correct location. It also has the ability to send information to the browser and automate clicks and fill in forms for you. Real handy for automation.

So webscraping also makes you think about data types. How do you want to store this information you are finding. Many APIs return responses in JSON format, so that is a natural choice in order to learn. Making decisions of how to layer a hash with arrays or strings or integers really will get you into thinking about how to manipulate this data and engineer ways to retrieve the information. Ruby has a great JSON gem that will let you write to a JSON and later read it if you want to go back into your scraped information in a persisting form (possible future post of manipulating a JSON into a database and how to think that through, still one step at a time though). More on this for when we go through an example.


Enough about why it's great for a programming language, Ruby especially being my favorite but what do you learn about HTML and CSS while scraping?

Well first thing I learned (again I started with trying to do recipes), was that http://www.allrecipes.com has a unicorn hidden on most pages. Sadly, they have not responded to my email asking why. Maybe the lesson is never question a unicorn?

<!doctype html>
<!--
        /((((((\\\\
=======((((((((((\\\\\
     ((           \\\\\\\
     ( (*    _/      \\\\\\\
       \    /  \      \\\\\\________________
        |  |   |       </                  ((\\\\
        o_|   /        /                      \ \\\\    \\\\\\\
             |  ._    (                        \ \\\\\\\\\\\\\\\\
             | /                       /       /    \\\\\\\     \\
     .______/\/     /                 /       /         \\\
    / __.____/    _/         ________(       /\
   / / / ________/`_________'         \     /  \_
  / /  \ \                             \   \ \_  \
 ( <    \ \                             >  /    \ \
  \/     \\_                           / /       > )
          \_|                         / /       / /
                                    _//       _//
                                   /_|       /_|
-->
<html lang="EN">
  <head> 
Allrecipes apparently has a great sense of humor with these hidden unicorns.

Anyway. Scraping makes you look at a lot of examples of HTML and occasionally see why some decisions were made. It is an extremely forgiving language to program in but someone who is looking at the HTML you write probably will be less so. Ugly HTML is awful to sort through, so it makes you aware of why some sites are easier to automate information gathering. So what you may ask? This is actually a really for screen readers and making the web accessible for all. The more explicit the HTML the more easily it is navigated. This also plays into css with adding titles and everything.


Web scraping will have the effect of you having to narrow down css selectors if you are trying to get a specific answer. Not everything will accessible through an html tag. There be many href tags, but you will then see the choices of class and id use. What if there's a class of container that has information after a div tag? You learn how to use .class > div or .class + div and begin to differentiate through those selectors.

CSS specificity is important when designing your own website in order to make it stand out among others that provide similar information.


All in all you can learn the basics of all the tools you will be needing in the future in one fun project. What we will do tomorrow night:

Pinky and the Brain Try to take over the world

Or whenever the post is, we will get some episode information for Pinky and the Brain.


You can take a look at that code for Walkthrough Part 1 at https://github.com/AmSmo/webscraper_narf
If you would like to see where I will eventually go with this, I created an active record database seeded with the JSON file I built from this webscaper, you can find that at https://github.com/AmSmo/tvdb
💖 💪 🙅 🚩
amsmo
Adam Smolenski

Posted on August 16, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related