Prices scrapping

jagedn

Jorge

Posted on November 24, 2019

Prices scrapping

This is a simple 'prices scrapping' application to check a list of URLs of products and notify via Telegram if some of them are under or bellow a price

This post is an example to show how we can mix different technologies without spend money to have a daily alert if some prices are changed. Be aware this is not a "proffesional" way to do it and we can consider it as a "toy" or as a "lab" to learn

Requirements

In order to have a fully functional example you’ll need:

  • a bot Telegram

  • a Google Sheet plus

  • credentials from Google Project

  • a Gitlab account (you can use Github or similar but this example use Gitlab pipeline)

Telegram

We’ll use Telegram as the channel to notify us about the changes of prices. You’ll need to have two things:

  • Telegram installed into our mobile phone (also you can access via web browser)

  • a Telegram bot

First step it’s easy and similar as install other messanger applications.

To create our bot we’ll use the Telegram application to talk with 'BotFather' a bot from Telegram able to create bots (https://core.telegram.org/bots)

botfather

Basically we’ll order it to create a new boot writting "/newbot" and following his instructions (a name, a description and so on) to obtain a token similar as 12312312:AAAAAAAAAAAAAAAAAAAaaY . DON’T SHARE THIS TOKEN AND DON’T STORE IT IN YOUR REPO

To allow our bot to talk with you, you need to start the conversation, so search your bot with the Telegram’s search button and send it a hello with the '/start' command

Also we’ll need to know your telegram client id. You can use the existing bot '@userinfobot' who reply every message with info about your account to obtain your client id plus other information. YOU CAN SHARE THIS ID, IT’S NOT SO IMPORTANT, BUT AS WITH THE TOKEN WE’LL KEEP IT SECRET

Google Service

Probably this is the part most obscure of the process. If you have a Google account, you can create projects and deploy Google AppEngine, Kubernetes, and a lot of Google services.

So open https://console.cloud.google.com/ and follow instructions to create your first project (but for this tutorial you don’t need to deploy anything, only create the project)

Once create the project we’ll need to enable the Google Sheet API:

Also we’ll need to create a service and generate a credentials file from it:

crear credenciales

After create the service, Google will download automatically a JSON file. KEEP IT SECRET AND DON’T STORE INTO YOUR REPO

We’ll need the email of the service account (something similar to your-awesome-service@your-awesome-project.iam.gserviceaccount.com) in the next step

Google Sheet

The application will read a Google Sheet with a simple structure as this:

google sheet

When you are editing the sheet you can find the ID of it in the URL:

You we’ll need this ID and IT’S BETTER NOT KEEP IT INTO YOUR REPO

In order the application can read this sheet we need to share it with the service email created previously so click on 'Share' and add your service as collaborator (don’t send the notification email because nobody will be listening and you will receive a notification error email)

Application

You can download the application from https://gitlab.com/jorge-aguilera/scrapping-prices

Basically is a "one only class application" who

  • reads a google sheet’s range "A1:E99" (yes, this example only works with a max of 99 articles),

  • opens a Geb Browser per row

  • use a custom css selector to find the price element.

  • if the value of the item is lower or upper than the associate rule it add the item to a list.

  • at the end it sends an http POST to the channel with the summary

Gitlab

The repo is allocate at Gitlab and uses the pipeline capability of it to run every day the run task

Basically we need to configure some environment variables:

  • SHEET_ID (the id of the sheet)

  • TABS ("Sheet 1" or whatever you use)

  • TELEGRAM_TOKEN (the token obtained via BotFather)

  • TELEGRAM_CHANNEL (the telegram userId)

  • GOOGLE_APPLICATION_CREDENTIALS (use as File instead of variable and paste the content of the credentials.json)

And set the schedule we want to use, for example:

0 9 * * * (every day at 9:00)

Conclusion

I’ll recapt:

  • you have a bot telegram with a TOKEN

  • you have your telegram id and you’ve started a conversation with your bot

  • you have a Google Sheet, you have the Sheet Id and you’ve added the service as collaborator

  • you have a credentials file in a safe place

  • you have a repo in Gitlab with several environments variables configured

Basically we have someone (Gitlab Pipeline) running our application every day, reading a Google Sheet (via a service account) and sending us a message (vía our bot telegram)

💖 💪 🙅 🚩
jagedn
Jorge

Posted on November 24, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Prices scrapping
groovy Prices scrapping

November 24, 2019