Getting Page Titles for URLs with Ruby

bphogan

Brian P. Hogan

Posted on April 22, 2022

Getting Page Titles for URLs with Ruby

If you have a URL and you'd like to get the title for the page, you'll need to fetch the URL, parse the source, and grab the text from the <title> tag. If you have a large number of URLs, you'll want to automate this process.

You can do this with a few lines of Ruby and the Nokogiri library. In this tutorial you'll create a small Ruby program to fetch the title for a web page. Then you'll modify the program to work with an external file of URLs. Finally, you'll make it more performant by using threads.

Fetching the Title from the URL

The Nokogiri library lets you use CSS selector syntax to grab nodes from HTML, and the URI library lets you quickly read a file from a URL like the cURL command does.

To do this, install the nokogiri and uri gems:

gem install uri
gem install nokogiri
Enter fullscreen mode Exit fullscreen mode

Create a Ruby script called get_titles.rb and add the following code to load the libraries, open a URL as a file, send its contents to Nokogiri, and extract the value of the <title> tag:

require 'nokogiri'
require 'open-uri'

url = "https://google.com" 
URI.open(url) do |f|
  doc = Nokogiri::HTML(f)
  title = doc.at_css('title').text
  puts title
end
Enter fullscreen mode Exit fullscreen mode

Save the file and run the program:

ruby get_titles.rb
Enter fullscreen mode Exit fullscreen mode

The result shows the page title for Google:

Google
Enter fullscreen mode Exit fullscreen mode

To do this for multiple URLs, put the URLs in an array manually, or get them from a file.

Reading URLs from a File

You may already have the list of URLs in a file, which may have come from a data export. Using Ruby's File.readlines, you can quickly convert the file into an array.

Create a new file called links.txt and add a couple of links. Make sure one of them is a bad URL; you'll make sure to handle errors.

https://google.com
https://devto
Enter fullscreen mode Exit fullscreen mode

Save the file.

Now return to your get_titles.rb file and modify the code so it reads the file in line-by-line, and uses each line as a URL:

# get_titles.rb
require 'nokogiri'
require 'open-uri'

lines = File.readlines('links.txt')
lines.each do |line|
  url = line.chop
  URI.open(url) do |f|
    doc = Nokogiri::HTML(f)
    title = doc.at_css('title').text
    puts title
  end
rescue SocketError
  puts "#{url}: can't connect. Bad URL?"
end
Enter fullscreen mode Exit fullscreen mode

Each line from the file will have a line break at the end, which you remove with the .chop method before storing the value in the url variable.

The URI.open method will throw a SocketError if it can't connect, and so you rescue that error with a sensible message.

Save the file and run the program again:

ruby get_titles.rb
Enter fullscreen mode Exit fullscreen mode

This time you see Google's page title for the first URL, and the error message for the second:

Google
https://devto: Can't connect. Bad URL?
Enter fullscreen mode Exit fullscreen mode

This version isn't the fastest when your list gets large. On a file with 200 URLs, the process took 2 minutes. A lot of the time was the network latency. Each request takes some time to resolve and get the results.

Let's make it faster.

Processing URLs Concurrently

To make this process more efficient, and much faster, you'll need to use threads. And if you use threads, you'll need to think about thread pooling because if you use too many threads you'll run out of resources.

The concurrent-ruby gem makes this much less complex by giving you promises in Ruby, which have their own pooling mechanism.

Install the concurrent-ruby gem:

gem install concurrent-ruby
Enter fullscreen mode Exit fullscreen mode

To use it, you'll create a "job" for each line in the file. Each job is a promise which takes a block containing the code you want to execute. Then, following the loop, you collect all of the promises and call the value method, which blocks until the promise is complete. The pattern looks like this:

# Create a job for each line
jobs = lines.map do |line|
  Concurrent::Promises.future do
    # do the work
  end
end

# get all the jobs, blocking until they all finish.
Concurrent::Promises.zip(*jobs).value!
Enter fullscreen mode Exit fullscreen mode

Modify the program to include the concurrent library and create a promise for each URL read. Then get the results:

# get_titles.rb
require 'nokogiri'
require 'open-uri'
require 'concurrent'

lines = File.readlines('links.txt')
jobs = lines.map do |line|
  Concurrent::Promises.future do
    url = line.chop

    URI.open(url) do |f|
      doc = Nokogiri::HTML(f)
      title = doc.at_css('title').text
      puts title
    end
  rescue SocketError
    puts "#{url}: can't connect. Bad URL?"
  end
end

Concurrent::Promises.zip(*jobs).value
Enter fullscreen mode Exit fullscreen mode

In this version of the program, you're printing the results to the screen. But you could return a value instead and print those.

This time, a file with 200 URLs took around 3 seconds to process. That's a significant speed improvement and demonstrates why concurrent processing is important for these kinds of tasks.

Conclusion

In this tutorial you used Ruby to get page titles from URLs, and you then optimized it using the concurrent-ruby library to take advantage of threads and thread pooling.

To keep exploring, read in the data from a CSV file and use the program to generate a new CSV file with the URL and the title in separate columns.

Then see if you can pull additional information out of the URLs, such as the <meta> descriptions.

Like this post? Support my writing by purchasing one of my books about software development.

💖 💪 🙅 🚩
bphogan
Brian P. Hogan

Posted on April 22, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related