Build a web scraper with Go
Claud
Posted on August 12, 2023
Introduction ๐ฃ
In this tutorial, we will walk through the process of building a web scraper using the Colly library in the Go programming language. This scraper will extract recipe data from a specific website. We'll cover the code step by step to understand each component.
What is a web-scraper
A web-scraper is, as the name suggests, a tool you can use to โscrapeโ or extract data from websites.
For this project, I decided to scrape the Italian recipe website Giallo Zafferano.
Getting started
To get started
๐ป your IDE of choice - Iโm using VSCode.
๐ช some snacks because all this recipe scraping will make you hungry.
Let's get coding ๐
Install Go
Firstly, let's set up our Go project. As this is the first article of the Go Learning Series, I'll spend a few seconds talking you through installing Go.
If you have installed it already, then scroll through โญ๏ธ
Depending on your operating system, you can find your installation guide on Go documentation page. If you're a macOS user and you use brew
, you can run it in your terminal:
brew install go
Set up the project
Create a new directory for your project, move to the directory then run the following command, where you can replace the word webscraper
with anything you want your module to be named.
go mod init webscraper
๐ก
The go mod init
command initializes a new Go module in the directory where it is executed. It creates a new go.mod
file, which is used to define dependencies and manage versions of third-party packages used in the project (๐ kinda like a package.json if you're using node).
Now let's install colly
and its dependencies
go get github.com/gocolly/colly
This command will also update the go.mod
file with all the required dependencies as well as create a go.sum
file.
๐ก
Colly is a Go package that allows you to write both web scrapers and crawlers. It is built on top of Go's net/HTTP package for network communication, and goquery
, which provides a "jQuery-like" syntax for targeting HTML elements.
Scraper logic
We are ready to start. Create a scraper.go
file in your directory, and set up your main
function.
package main
import (
// import Colly
"github.com/gocolly/colly"
)
func main(){
}
๐ง
If you haven't used Go before, this might look a bit weird. If you're asking yourself what is main? where does that package come from? then check out my other blog post that goes into detail on this.
Using Colly
The bulk of this program will be handled by Colly and the Collector object which manages the network communication and is responsible for the execution of the attached callbacks.
To use Let's initialise a Collector inside your main function:
collector := colly.NewCollector()
You can customise the parameters of your Collector to suit different needs. We will be using the default setup, but if you're curious you can check out the Colly documentation for more.
Firstly, let's start with the most simple case to see if our set-up is working: we want to visit a URL and print in our console the URL we are visiting. This is the URL I want to use: https://ricette.giallozafferano.it/Schiacciata-fiorentina.html
๐ก
The code in this program is catered to work on that URL (and other recipes from that website) - it won't work on any other URL so use the same URL if you are following along.
Using command-line arguments
In every tutorial I checked out before building this scraper, they copy-pasted the url locally in the file and then passed it to the Collector calling collector.Visit("their_url_string_here).
I didn't find that solution very elegant, and in the name of learning Go and experimenting, I looked up how to pass command line arguments to a Go app, so that I could pass the url as a command line argument, and scrape different urls from the same website without having to manually change the code.
To achieve this, we are using the os
package, so we're going to add this to the import, and fill in our main function with what we need so far:
A variable called
args
which will store the command line arguments we pass at run time.A variable called
url
to which we assign the value at index 1 from theargs
object. โ๏ธThis is where we're storing the URL that we pass in the command line.A Colly collector object.
package main
import (
"os"
// import Colly
"github.com/gocolly/colly"
)
func main(){
args := os.Args
url := args[1]
collector := colly.NewCollector()
}
Now that we have the building blocks of our program, let's try it out and see if it works by setting up callbacks for our collector object. Just so you know, everything we are going to write from now on is to be placed inside the main
function. For our print statements, we are using the package fmt
- it stands for "formatting" and you can import it in the import function as shown in the code block below.
package main
import (
"fmt"
"os"
// import Colly
"github.com/gocolly/colly"
)
func main() {
args := os.Args
url := args[1]
collector := colly.NewCollector()
// whenever the collector is about to make a new request
collector.OnRequest(func(r *colly.Request) {
// print the url of that request
fmt.Println("Visiting", r.URL)
})
collector.OnResponse(func(r *colly.Response) {
fmt.Println("Got a response from", r.Request.URL)
})
collector.OnError(func(r *colly.Response, e error) {
fmt.Println("Blimey, an error occurred!:", e)
})
collector.Visit(url)
}
๐
If you haven't used callbacks before, here's a brief explanation of what is happening there. I'll use the first callback as an example (OnRequest
), but the explanation applies to each of them.
The
OnRequest
is a method of the Collector object, and it registers a function to be called whenever a new request is about to be made by the collector.We are passing it an anonymous function that takes a
*colly.Request
object as an argument (representing the request that is about to be made). Inside this function, we are using thefmt.Println
function to print out the message "Visiting" along with the URL of the request that is currently being processed in the command line.
Essentially, whenever the collector is about to make a request, this function will be executed, and it will print a message indicating that the collector is visiting a particular URL. This can be useful for debugging and monitoring the progress of the web scraping process, allowing you to see which URLs are being visited by the collector. Similarly, the OnResponse
and OnError
methods will trigger when the collector is about to receive a response or raise an error respectively.
In your terminal, run
go run scraper.go https://ricette.giallozafferano.it/Schiacciata-fiorentina.html
and you should see the following being printed back:
Visiting https://ricette.giallozafferano.it/Schiacciata-fiorentina.html
Got a response from https://ricette.giallozafferano.it/Schiacciata-fiorentina.html
๐จ Now comes the fun
To write our main logic, we are going to use the OnHTML
callback. To fully understand this, I recommend you spend some quality time with the Colly documentation. The function that we define in this callback will be run when the collector encounters an HTML element that you can specify as a css-selector. In the example below, the function will be run when the collector finds an element with class name "div-main".
collector.OnHTML(".div-main", func(e *colly.HTMLElement) {
// Your code to handle the HTML element matching the selector
})
To know which css-selector to target, we'll need to spend some time inspecting the webpage we are going to scrape. To do this, use your browser web inspector and look at the elements of the content you want to extract.
While you can multiple OnHtml callbacks in your program, each handling a different css-selector for the multiple elements you want to scrape, I felt it was cleaner to have one callback that triggers for the main
element of the page, and to handle the child elements scraping cascading within the same function.
Let's do a little recap before we move on:
We have our collector
We know how to scrape the url and we know how to pass the url
But what are we going to do with the data scraped?
Since we are scraping a recipe, it only makes sense to create a struct
to store the recipe features we are interested in extracting.
Defining Data Structures
I spent some time looking at the recipe page and deciding which elements I wanted to extract: the recipe name, the recipe specs (such as difficulty, preparation time, cooking time, serving size and price tier), and the ingredients.
Firstly, I defined a struct
for the recipe, with url
and name
as strings. Then I had to think of which data structure to use for the recipe specs and the ingredients. Different recipes will have different combinations of ingredients, but will always have the same types of specs, just with different values, so it made sense to use different data structures for the two.
As the recipe specs are fixed fields across different recipes, such as the name and URL fields, I decided to create a separate struct to hold the
RecipeSpecs
For the ingredients, however, I needed a more flexible data structure to which I could append key-value pairs, as I didn't know how many or which keys I would encounter in the ingredient list. To achieve this, I created a Dictionary object of type
map
which defines a mapping of key-value pairs from string to string type.
type Dictionary map[string]string
type RecipeSpecs struct {
difficulty, prepTime, cookingTime, servingSize, priceTier string
}
type Recipe struct {
url, name string
ingredients []Dictionary
specifications RecipeSpecs
}
Phew, with that out of the way, I was ready to save the scraped data to my struct.
Have a look at the code below and in-line comments for explanations of how I put it together. Note that the recipe is in Italian, but thanks to the Latin influence onto the English language, a lot of the words are very similar to English:
Difficoltร : difficulty
Preparazione: preparation
Cottura: cooking (aka cooking time)
Dosi per: doses for (aka serving size)
Costo: cost
// initialise a slice of type Recipe (like a list of recipes)
// this way we'll be able to append each recipe to it,
// and access the recipes outside the scope of the callback function.
var recipes []Recipe
c.OnHTML("main", func(main *colly.HTMLElement) {
// initialise a new recipe struct every time we visit a page
recipe := Recipe{}
// initialise a new Dictionary object to stoer the ingredients mappings
ingredients_dictionary := Dictionary{}
// assign the value of URL (the url we are visiting) to the recipe field
recipe.url = url
// find the recipe title, assign it to the struct, and print it in the command line
recipe.name = main.ChildText(".gz-title-recipe")
println("Scraping recipe for:", recipe.name)
// iterate over each instance of a recipe spec
// and assign its value to the recipe spec struct and the recipe
main.ForEach(".gz-name-featured-data", func(i int, specListElement *colly.HTMLElement) {
if strings.Contains(specListElement.Text, "Difficoltร : ") {
recipe.specifications.difficulty = specListElement.ChildText("strong")
}
if strings.Contains(specListElement.Text, "Preparazione: ") {
recipe.specifications.prepTime = specListElement.ChildText("strong")
}
if strings.Contains(specListElement.Text, "Cottura: ") {
recipe.specifications.cookingTime = specListElement.ChildText("strong")
}
if strings.Contains(specListElement.Text, "Dosi per: ") {
recipe.specifications.servingSize = specListElement.ChildText("strong")
}
if strings.Contains(specListElement.Text, "Costo: ") {
recipe.specifications.priceTier = specListElement.ChildText("strong")
}
})
// find the recipe introduction and ingredients and assign to the struct
main.ForEach(".gz-ingredient", func(i int, ingredient *colly.HTMLElement) {
ingredients_dictionary[ingredient.ChildText("a")] = ingredient.ChildText("span")
})
recipe.ingredients = append(recipe.ingredients, ingredients_dictionary)
recipes = append(recipes, recipe)
})
// finally, run the scraper
collector.Visit(url)
Conclusion ๐
And this is it! Congratulations! You've successfully built a web scraper in Go using the Colly library, and learned some Italian words ๐ฎ๐น
You can now apply this knowledge to scrape data from other websites or enhance the scraper's capabilities.
Remember that web scraping should be done responsibly and in accordance with the website's terms of use. Always be respectful of the website's resources and consider implementing rate limiting and error handling mechanisms to ensure a smooth scraping process if you are building a more complex application.
I will revisit this project in the future and look to implement more functions, such as:
Add a module to save the scraped data locally (such as a csv file).
Cache pages so subsequent runs don't have to download the same page again.
What else should I implement? And did you enjoy this blog post? Leave a comment with your thoughts, and see you next time ๐
๐
You can find the full codebase on my Github in the Webscraper repository.
Posted on August 12, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.