Brian P. Hogan
Posted on April 22, 2022
If you have a URL and you'd like to get the title for the page, you'll need to fetch the URL, parse the source, and grab the text from the <title>
tag. If you have a large number of URLs, you'll want to automate this process.
You can do this with a few lines of Ruby and the Nokogiri library. In this tutorial you'll create a small Ruby program to fetch the title for a web page. Then you'll modify the program to work with an external file of URLs. Finally, you'll make it more performant by using threads.
Fetching the Title from the URL
The Nokogiri library lets you use CSS selector syntax to grab nodes from HTML, and the URI
library lets you quickly read a file from a URL like the cURL
command does.
To do this, install the nokogiri
and uri
gems:
gem install uri
gem install nokogiri
Create a Ruby script called get_titles.rb
and add the following code to load the libraries, open a URL as a file, send its contents to Nokogiri, and extract the value of the <title>
tag:
require 'nokogiri'
require 'open-uri'
url = "https://google.com"
URI.open(url) do |f|
doc = Nokogiri::HTML(f)
title = doc.at_css('title').text
puts title
end
Save the file and run the program:
ruby get_titles.rb
The result shows the page title for Google:
Google
To do this for multiple URLs, put the URLs in an array manually, or get them from a file.
Reading URLs from a File
You may already have the list of URLs in a file, which may have come from a data export. Using Ruby's File.readlines
, you can quickly convert the file into an array.
Create a new file called links.txt
and add a couple of links. Make sure one of them is a bad URL; you'll make sure to handle errors.
https://google.com
https://devto
Save the file.
Now return to your get_titles.rb
file and modify the code so it reads the file in line-by-line, and uses each line as a URL:
# get_titles.rb
require 'nokogiri'
require 'open-uri'
lines = File.readlines('links.txt')
lines.each do |line|
url = line.chop
URI.open(url) do |f|
doc = Nokogiri::HTML(f)
title = doc.at_css('title').text
puts title
end
rescue SocketError
puts "#{url}: can't connect. Bad URL?"
end
Each line from the file will have a line break at the end, which you remove with the .chop
method before storing the value in the url
variable.
The URI.open
method will throw a SocketError
if it can't connect, and so you rescue that error with a sensible message.
Save the file and run the program again:
ruby get_titles.rb
This time you see Google's page title for the first URL, and the error message for the second:
Google
https://devto: Can't connect. Bad URL?
This version isn't the fastest when your list gets large. On a file with 200 URLs, the process took 2 minutes. A lot of the time was the network latency. Each request takes some time to resolve and get the results.
Let's make it faster.
Processing URLs Concurrently
To make this process more efficient, and much faster, you'll need to use threads. And if you use threads, you'll need to think about thread pooling because if you use too many threads you'll run out of resources.
The concurrent-ruby
gem makes this much less complex by giving you promises in Ruby, which have their own pooling mechanism.
Install the concurrent-ruby
gem:
gem install concurrent-ruby
To use it, you'll create a "job" for each line in the file. Each job is a promise which takes a block containing the code you want to execute. Then, following the loop, you collect all of the promises and call the value
method, which blocks until the promise is complete. The pattern looks like this:
# Create a job for each line
jobs = lines.map do |line|
Concurrent::Promises.future do
# do the work
end
end
# get all the jobs, blocking until they all finish.
Concurrent::Promises.zip(*jobs).value!
Modify the program to include the concurrent
library and create a promise for each URL read. Then get the results:
# get_titles.rb
require 'nokogiri'
require 'open-uri'
require 'concurrent'
lines = File.readlines('links.txt')
jobs = lines.map do |line|
Concurrent::Promises.future do
url = line.chop
URI.open(url) do |f|
doc = Nokogiri::HTML(f)
title = doc.at_css('title').text
puts title
end
rescue SocketError
puts "#{url}: can't connect. Bad URL?"
end
end
Concurrent::Promises.zip(*jobs).value
In this version of the program, you're printing the results to the screen. But you could return a value instead and print those.
This time, a file with 200 URLs took around 3 seconds to process. That's a significant speed improvement and demonstrates why concurrent processing is important for these kinds of tasks.
Conclusion
In this tutorial you used Ruby to get page titles from URLs, and you then optimized it using the concurrent-ruby
library to take advantage of threads and thread pooling.
To keep exploring, read in the data from a CSV file and use the program to generate a new CSV file with the URL and the title in separate columns.
Then see if you can pull additional information out of the URLs, such as the <meta>
descriptions.
Like this post? Support my writing by purchasing one of my books about software development.
Posted on April 22, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.