Scan.co.uk sales scraper

I was working today on a scraper that grabs the sale items from Scan.co.uk and collects the data in .csv file. Nothing fancy - its sole value is educational. And fittingly, the simple bs4 script threw up two issues that seem worth mentioning.

HTTP Error 403 - access to the server was not authorised, the html could not be grabbed. How frustrating!
x.findAll() does not return all result - I was trying to grab 6 'li' containers, but only 4 were ever found by the function. What do?

HTTP Error 403: Forbidden

This is related to urllib headers - the website does not want to be wrapped in in dealing with requests from countless scrapers, so requests headed urllib are blocked.

To get around this, you must obscure the fact you are running a scraping bot. The simplest way to do this is by using headers, as follows:

req = Request(my_url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req)
page_html = page.read()
page.close()

At first, did this not work for me (It's User-Agent not User_Agent, bwap bwap).

So here's also another, apparently older solution, from user Zeta over on StackOverflow:

import urllib.request

class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

uClient_opener = AppURLopener()
uClient = uClient_opener.open(my_url)

This appears to be a legacy solution, and not preferred. In the end, both solutions worked for me, typos aside.

x.findAll() does not return all results

product_list = product_categories[x].findAll('li')

The above code should have returned 6 results, but I could never get it to go above 4.

Some googling suggested that this was a problem with html_parser. Suggested solution - use html5lib.

This is what parsing the html with BeautifulSoup looked like before:

page_soup = soup(page_html, 'html_parser')
product_categories = page_soup.findAll('div', {'class':'category'})

The changes to the code are minimal - just replace the html_parser variable with html5lib:

import html5lib

page_soup = soup(page_html, 'html5lib')
product_categories = page_soup.findAll('div', {'class':'category'})

And it works! len(product_list) returns the correct 6 I was looking for.

Hope someone finds this helpful.

Blog

Scan.co.uk sales scraper

Bartosz Raubo

HTTP Error 403: Forbidden

x.findAll() does not return all results

Join Our Newsletter. No Spam, Only the good stuff.

Related