Daniel McMahon
Posted on December 6, 2018
Lucas
Lucas is a webscraper built using Go and the Colly library. He's also an adorable spider on YouTube!
Why?
I wanted to experiment a little with Go as it's a programming language used by certain teams at my office and not something I often get hands on with. I figured it would be fun to try deconstruct a random online website and see if I could come up with a programmatic way to abstract the HTML data in a way that could be perceived as useful.
The project started out as a fun experiment but after taking it as far as I have I decided to stop working on it for a few reasons which I'll outline below.
The Morality of Web Scraping
As will be apparent to many out there, web scraping is a bit of a 'grey' area morally speaking. From all the research and reading I've done on the subject I usually boil it down to this:
'if you are not violating fair terms of use, abiding by a companies robots.txt file & notifying them that you are doing so, then you should be in the clear'
This is a highly generalised statement for this complex issue but I find its a nice guideline to try follow.
When it comes to e-commerce online websites can hold some highly profitable data, data which they may not even necessarily surface on the frontend of their stores, but they will still have the JSON floating around in their static page content.
Some types of information e-commerce stores may leave floating around in their HTML include:
- Stock levels (things like low in stock flags or even exact stock numbers!)
- Size availability
- Pricing/Sales Rates
For general user purposes this data can go unnoticed but it can be a simple case of right clicking on a storefront and selecting 'view page source' to access this data.
The 'Drawbacks' of Structured data
When it comes to designing a web scraper you want to try analyse the structure of your 'target' sites pages. Ask yourself these kinds of questions:
- Is there an easy pattern to follow to access product specific pages
- Is there a replicated structure across multiple pages to allow easy html parsing
On the more technical side of things consider the following:
- Does the website update its HTML often?
- Can you account for missing data during your web scraping?
- How will you store the data? Will a DB be fast enough?
- Do you plan to handle multi-threading/distributed scraping?
How the scraper works
So onto the fun part - the code! I will talk in generalised terms here so the practices used can be applied to any form of e-commerce store.
Data Storage
In order to setup a way to store the scraped websites data I decided to roll with a postgres DB, for no other reason than I was familiar with it and it was easy to setup via a docker-compose file.
lucas:
container_name: lucas
image: postgres:9.6-alpine
ports:
- '5432:5432'
environment:
POSTGRES_DB: 'lucas_db'
POSTGRES_USER: 'user'
With this basic PSQL db I was able to setup a basic table by running the following command with the following input file
psql -h localhost -U user lucas_db -f dbsetup.sql
dbsetup.sql:
\c lucas_db
CREATE TABLE floryday(
index serial,
product text,
code text,
description text,
price decimal(53, 4)
)
As you can see from the table there were a few basic details I decided to scrape from the web pages in question:
- index: this was just used as a unique id
- product: name of the product item
- code: code of the product in question
- description: the description text of the product
- price: the price of the product
As listed above there are other additional fields you might be interested in abstracting in your own experiences like size & availability.
Go Dependencies
I was a little sloppy in my service setup in that I did not rely on a Go service dependency management tool like dep, instead I just took care of manually installing them (as there were only 3 dependencies it wasn't so bad). These were the three main external libraries I used they can be installed with the command go get <repo-name>
:
- github.com/gocolly/colly - webscraping package
- github.com/fatih/color - command line colors package
- github.com/lib/pq - postgres driver package
To make this setup a little easier I setup a Dockerfile to keep track of the installation
FROM golang:1.11
MAINTAINER Daniel McMahon <daniel40392@gmail.com>
WORKDIR /opt/lucas
ADD . /opt/lucas
ENV PORT blergh
# installing our golang dependencies
RUN go get -u github.com/gocolly/colly && \
go get -u github.com/fatih/color && \
go get -u github.com/lib/pq
EXPOSE 8000
CMD go run lucas.go
The main logic
In short the main code does the following:
- Starts at a seed url
- Scans all the links on the page
- Looks for a page that matches a certain regex i.e. -Dresses we know from some basic checks that these pages all have a similar page structure and are usually product pages that we are interested in
- Define a Struct to represent our clothing values of interest
- Write the Struct to the postgres DB
- Continue up to a size of 200 writes
Here is the main bulk of the code with comments explaining the logic - it is not optimized and still quite rough around the edges but its key functionality is in place:
lucas.go
// as our scraper will only use one file this will be our main package
package main
// importing dependencies
import (
"encoding/json"
"log"
"os"
"fmt"
"strings"
"github.com/gocolly/colly"
"github.com/fatih/color"
"database/sql"
_ "github.com/lib/pq"
"strconv"
)
// setting up a datastruture to represent a form of Clothing
type Clothing struct {
Name string
Code string
Description string
Price float64
}
// setting up a function to write to our db
func dbWrite(product Clothing) {
const (
host = "localhost"
port = 5432
user = "user"
// password = ""
dbname = "lucas_db"
)
psqlInfo := fmt.Sprintf("host=%s port=%d user=%s "+
"dbname=%s sslmode=disable",
host, port, user, dbname)
db, err := sql.Open("postgres", psqlInfo)
if err != nil {
panic(err)
}
defer db.Close()
err = db.Ping()
if err != nil {
panic(err)
}
// some debug print logs
log.Print("Successfully connected!")
fmt.Printf("%s, %s, %s, %f", product.Name, product.Code, product.Description, product.Price)
sqlStatement := `
INSERT INTO floryday (product, code, description, price)
VALUES ($1, $2, $3, $4)`
_, err = db.Exec(sqlStatement, product.Name, product.Code, product.Description, product.Price)
if err != nil {
panic(err)
}
}
// our main function - using a colly collector
func main() {
// creating our new colly collector with a localised cache
c := colly.NewCollector(
// colly.AllowedDomains("https://www.clotheswebsite.com/"),
colly.CacheDir(".floryday_cache"),
// colly.MaxDepth(5), // keeping crawling limited for our initial experiments
)
// clothing detail scraping collector
detailCollector := c.Clone()
// setting our array of clothing to size 200
clothes := make([]Clothing, 0, 200)
// Find and visit all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// hardcoded urls to skip -> these arent relevant for products
if !strings.HasPrefix(link, "/?country_code") || strings.Index(link, "/cart.php") > -1 ||
strings.Index(link, "/login.php") > -1 || strings.Index(link, "/cart.php") > -1 ||
strings.Index(link, "/account") > -1 || strings.Index(link, "/privacy-policy.html") > -1 {
return
}
// scrape the page
e.Request.Visit(link)
})
// printing visiting message for debug purposes
c.OnRequest(func(r *colly.Request) {
log.Println("Visiting", r.URL.String(), "\n")
})
// visting any href links -> this can be optimised later
c.OnHTML(`a[href]`, func(e *colly.HTMLElement) {
clothingURL := e.Request.AbsoluteURL(e.Attr("href"))
// this was a way to determine the page was definitely a product
// if it contained -Dress- we were good to scrape
if strings.Contains(clothingURL, "-Dress-"){
// Activate detailCollector
color.Green("Crawling Link Validated -> Commencing Crawl for %s", clothingURL)
detailCollector.Visit(clothingURL)
} else {
color.Red("Validation Failed -> Cancelling Crawl for %s", clothingURL)
return
}
})
// Extract details of the clothing
detailCollector.OnHTML(`div[class=prod-right-in]`, func(e *colly.HTMLElement) {
// some html parsing to get the exact values we want
title := e.ChildText(".prod-name")
code := strings.Split(e.ChildText(".prod-item-code"), "#")[1]
stringPrice := strings.TrimPrefix(e.ChildText(".prod-price"),"β¬ ")
price, err := strconv.ParseFloat(stringPrice, 64) // conversion to float64
color.Red("err in parsing price -> %s", err)
description := e.ChildText(".grid-uniform")
clothing := Clothing{
Name: title,
Code: code,
Description: description,
Price: price,
}
// writing as we go to DB
// TODO optiize to handle bulk array uploads instead of one at a time
dbWrite(clothing)
// appending to our output array...
clothes = append(clothes, clothing)
})
// start scraping at our seed address
c.Visit("https://www.ourclothingwebstore.com/Dresses-r9872/")
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
// Dump json to the standard output
enc.Encode(clothes)
}
Dangerzone
After getting the basic functionality setup and working it was at this point I decided to try experiment with multiple product seed pages. I discovered that this particular store laid out its main products on a page called: https://www.ourclothingwebstore.com/Dresses-r9872/ however this could be paginated with a simple /p2 at the end of the url, or /p3, /p4 all the way up to p80 something! On average there was around 40+ products on each of these pages. I had implemented a simple for loop that iterated over this seed url updating it each time. I could in essence with a little more logic setup the crawler to hit all the Dress products this store had on sale (and similarly I'm sure with a small regex tweak the logic could have been applied to other fashion categories the retailer had on offer).
It was at this point that I decided to stop work on the project as I had achieved the basic goals I set out to do and learned perhaps a little too easily how malicious this innocent project could turn. It was turning into a one way stop into essentailly having a DB that contained the entire storefront of this ecommerce site.
There are definitely optimisations that would be required to achieve this goal but I imagine with the use of goroutines you could get some parallelisation of this scraper happening to speed up the process to potentially scrape the entire website in a short timespan.
Closing Thoughts
I had some fun trying to reverse engineer the websites HTML structure and figure out how they were displaying product pages and how to go about abstracting the right product hrefs to crawl and the correct data to be able to obtain and write to a DB. It was enjoyable but it all felt a little too close to the sun for my legal liking.
I have deliberately not referenced the 'real' websites name in this example for the sake of their anonymity but the underlying principles should be applicable to most major online ecommerce retailers.
I was amazed with how easy it was to get up and running with the Colly library - I definitely suggest testing it out but be careful with what data you decide to scrape/store and that you investigate your targets robots.txt file to ensure you have permission to hit their website.
Any thoughts/opinions/comments feel free to leave below.
See you next time!
Posted on December 6, 2018
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.