Scrap web-shop to database

Web scrapping is important and needed in all spheres of IT. One of good use-cases is to create a drop-shipping website, and to populate database with items from some shop. This is even better if you work with digital items, so you don't care about item-delivery. You just have to put some provision, and to give something to the users that will make them use your site (create quality content, add forum, chat, scrap some news about your products and show them in news feed...).

I will create Ruby script to scrap data from Delije Butik as sample shop. You can implement this as rake task on heroku, to scrap items every hour and update database if something changed.

First install all gems that we will use:

gem install nokogiri rest-client sequel colorize

Now create a new document and call it scrap.rb. As always on start add shebang and require all gems:

#!/usr/bin/env ruby

require nokogiri
require rest-client
require sequel
require colorize

Now we can start... I will use sequel to connect to database and populate scrapped data. Rest-Client is used to make request with website, and Nokogiri to actually scrap our data. At the end, gem Colorize is used to make our output in terminal look nice.


  DB  = Sequel.sqlite('db/development.sqlite3')
  URL = 'https://delijebutik.com/shop'

  def open_page(url)
    html  = RestClient.get(url)
    @page = Nokogiri::HTML(html)
    @data = @page.search('div.thunk-product')
  end

Now open the web-page you want to scrap, to find all div names you need. So go open inspector (right click + inspect element) and find main field that contain all others. So we don't need header and all body, we need item name, price, photo, description etc... In this case, field is <div class="thunk-product">:

Now create new method that will actually scrap all those data and populate database:



  table = DB[:products]

  @data.each do |x|

    title   = x.search('div.thunk-product-content > h2').text          rescue  title   = nil
    price   = x.search('div.thunk-product-content > span > span').text rescue  price   = nil
    photo   = x.search('div.thunk-product-image > img')[0]['src']      rescue  photo   = nil
    photoH  = x.search('div.thunk-product-image > img')[1]['src']      rescue  photoH  = nil
    link    = x.search('> a')[0]['href']                               rescue  link    = nil

  unless table.where(title: title, price: price, link: link).all.count > 0

    table.insert(
      title:      title,
      price:      price,
      photo:      photo,
      photoH:     photoH,
      link:       link,
      created_at: Time.now,
      updated_at: Time.now )

    puts 'Naziv:     ' + title.yellow.bold
    puts 'Cena:      ' + price.green.bold
    puts 'Slika:     ' + photo.red
    puts 'Link:      ' + link.yellow

    60.times { print '='.white }; puts ''
  else

    puts "Product: #{title} has been skipped!"
  end end

So how we found all other fields? After we found main field div.thunk-product, we look at other fields against main. Take a look at code or go to the website and look at page source:

So it was easy, but what what about pages? We will find total number of pages, and define counter as zero. Then we will scrap one page, increase counter for 1, then scrap again... until we reach total number of pages:


  @c = 0
  open_page(URL)

  @data       = @page.search('div.thunk-product')
  last_number = @page.search("a.page-numbers")[5].text

  @last_page_number = last_number.to_i
  @last_page_number.times do
    @c +=1
    puts "\n Scrapping page [" + "#{@c}".red.bold + "]\n" + "\n" 
    open_page("#{URL}/page/#{@c}") and scrap!
    puts "\nFinished scrapping [".white + "#{@last_page_number}".red.bold + "] pages!\n\n"
  end

And that's it! You have a web-scrapper for your new e-commerce app! I used sqlite3, but you can use whatever DB fit your needs. Full code look like this, and in my case it was in lib folder of rails app, for manual execution:
ruby lib/scrap.rb

Edit:
Added if response code = 200, to check is web-page available:

#!/usr/bin/env ruby

require 'nokogiri'
require 'sequel'
require 'rest-client'
require 'colorize'


class DelijeShop

  DB  = Sequel.sqlite('db/development.sqlite3')
  URL = "https://delijebutik.com/shop"

  def initialize

    @c = 0
    open_page(URL)

    if @html.code == 200

      @data       = @page.search('div.thunk-product')
      last_number = @page.search("a.page-numbers")[5].text

      @last_page_number = last_number.to_i
      @last_page_number.times do
        @c +=1
        puts "\n Scrapping page [" + "#{@c}".red.bold + "]\n" + "\n" 
        open_page("#{URL}/page/#{@c}") and scrap! and sleep(3)
      end
      puts "\nFinished scrapping [".white + "#{@last_page_number}".red.bold + "] pages!\n\n"

    else
      raise "Connection error, code #{@html.code} returned"
    end
  end

  def open_page(url)
    @html = RestClient.get(url)
    @page = Nokogiri::HTML(html)
    @data = @page.search('div.thunk-product')
  end

  def scrap!

    table = DB[:products]

    @data.each do |x|

      title   = x.search('div.thunk-product-content > h2').text          rescue  title   = nil
      price   = x.search('div.thunk-product-content > span > span').text rescue  price   = nil
      photo   = x.search('div.thunk-product-image > img')[0]['src']      rescue  photo   = nil
      photoH  = x.search('div.thunk-product-image > img')[1]['src']      rescue  photoH  = nil
      link    = x.search('> a')[0]['href']                               rescue  link    = nil

    unless table.where(title: title, price: price, link: link).all.count > 0

      table.insert(
        title:      title,
        price:      price,
        photo:      photo,
        photoH:     photoH,
        link:       link,
        created_at: Time.now,
        updated_at: Time.now )

      puts 'Naziv:     ' + title.yellow.bold
      puts 'Cena:      ' + price.green.bold
      puts 'Slika:     ' + photo.red
      puts 'Link:      ' + link.yellow

      60.times { print '='.white }; puts ''
    else

      puts "Product: #{title} has been skipped!"
    end end
  end

end    # end_of_class


  puts "\n" + 'Scrapping Delije Shop Products List ...'.yellow

  50.times { print '-'.yellow }; puts '' and DelijeShop.new

Blog

Scrap web-shop to database

Linuxander

Join Our Newsletter. No Spam, Only the good stuff.

Related