Scraping and Storing Crypto-currency Prices with Scala and PostgreSQL

nazliander

Nazli Ander

Posted on August 2, 2020

Scraping and Storing Crypto-currency Prices with Scala and PostgreSQL

Web scraping mostly involves text-intensive tasks such as product review scraping, gathering real-estate listings, or even tracking online reputation and presence. When one application scrapes only String data types for qualitative analysis, it may not need type safety. However, in case the end goal of the web scraping is to do quantitative analysis with prices or weather forecasts, using a type-safe language might be quite handy.

In this article, we aim to give a small and interesting example of price scraping for crypto-currencies by using Scala and storing those into a PostgreSQL database. To scrape the prices we selected to use CoinMarketCap homepage. It is a crypto-currency knowledge website, which gives information also on market capitalizations (relative market sizes), circulating supply, and trading volumes. Even though it is fascinating to see all those information together, to keep it simple we will be only scraping the prices.

This article might be considered as a tutorial, and it requires a basic level of knowledge of docker-compose and Scala.

A summary of the pipeline that this tutorial usesA summary of the pipeline that this tutorial uses

Tools and Steps

While web scraping in Scala, we will be using an HTML parsing library called scala-scraper with JSoup. Following that, we will be inserting the scraped prices to the PostgreSQL database by using a functional JDBC tool called doobie.

Although we mentioned some fancy library and tool names, the real magic happens in case classes. For each call to the CoinMarketCap homepage, we aim to retrieve the long crypto-currency table with type safety. To do that we created CoinCreate and CoinInsert case classes and companion objects.

We will start explaining first the case classes together with their companion objects, as we aimed to model the data while creating those. Then we will explain the simple functions for retrieving the updated price table from the homepage. Lastly, we will explain how we inserted the table records into the PostgreSQL database running locally. We can power the database with this simple docker-compose file. In the docker-compose file, we initialized a PostgreSQL database with a name dev, username admin, and a password as admin.

The steps that are explained in this tutorial are displayed in the pipeline above.

Case Classes and Companion Objects

Although there might be different approaches to model the data, we can start by creating two case classes as CoinCreate and CoinInsert. Those will help us to keep the data types safe while scraping the price table and inserting into a database.

A view from the CoinMarketCap homepage prices tableA view from the CoinMarketCap homepage prices table

CoinCreate aims to safely type a pair of crypto-currency code and its current price. Thus, it has two parameters code(referring to the currency code) and price(current price in USD). However, while thinking about its companion object we need to consider the shape of the price records in each row. For instance, if we consider only to use coin names and prices in our case class, in an array of records their indices will be 1 and 3. This is quite similar to column indices for tables.

By observing the price table (above you can find a screenshot from the homepage), we decide to use a companion object to have an apply method for functionally transforming an input of String List to CoinCreate. Although this transformation is not that straightforward, we can use helper functions to get only the coin code (getCoinCode) and transform the dollar price string into a double (numberStringToDouble).

case class CoinCreate(code: String, price: Double)

object CoinCreate {
  def apply(listOfElements: List[String]): CoinCreate = {
    CoinCreate(
      code = getCoinCode(listOfElements(1)),
      price = numberStringToDouble(listOfElements(3))
    )
  }

  def dollarToNumber(dlr: String): Option[String] = {
    val p = "[0-9.]+".r
    p.findFirstIn(dlr)
  }

  def numberStringToDouble(strDlr: String): Double = {
    val numberStr = dollarToNumber(strDlr)
    numberStr.getOrElse("0").toDouble
  }

  def getCoinCode(strCoin: String): String = {
    strCoin.split(" ")(0)
  }
}
Enter fullscreen mode Exit fullscreen mode

CoinInsert aims to safely type a pair of crypto-currency code, its current price, and a log timestamp for insertion time logging. We can use this case class while inserting a vector of CoinCreate into PostgreSQL. As its parameters are so similar to CoinCreate, we can create a simple companion object to transform a CoinCreate to CoinInsert. This object’s apply method can naturally add the current timestamp to a CoinCreate to obtain a CoinInsert.

Hence the only difference between a CoinCreate case class and CoinInsert case class will be the current Timestamp, notated as a logTimestamp parameter.

case class CoinInsert(code: String, price: Double, logTimestamp: Timestamp)

object CoinInsert {
  def apply(coin: CoinCreate, logTimestamp: Long): CoinInsert = {
    CoinInsert(
      code = coin.code,
      price = coin.price,
      logTimestamp = new Timestamp(logTimestamp)
    )
  }
}
Enter fullscreen mode Exit fullscreen mode

Scraping Functions

Scraping with scala-scraper and JSoup is quite easy. First, we need to GET request to the homepage by creating a new JSoup browser. A new JSoup browser enables us to fetch HTML from the web. Since we need only HTML parsing JSoup was enough in this case, for Javascript using pages other browser options could be used.

def siteConnect(html: String, browser: JsoupBrowser): browser.DocumentType = {
    browser.get(html)
}
Enter fullscreen mode Exit fullscreen mode

By using the GET request, we need to find the main table and store it as a Vector of Strings. Luckily when we specify that we are looking for a table element, scala-scraper’s table method does all the job for us.

def getCoinUpdatedTable(webPage: String,
                        tableNameInHTML: String): Vector[CoinCreate] = {
  val site = siteConnect(webPage, new JsoupBrowser()) // Connects to the webpage.

  val tab = site >> table(s"#${tableNameInHTML}") // Gets the table with the given name.

  val body = tab.slice(1, tab.length) // First index belongs to the header of the table.

  body.map(x => CoinCreate(x.map(_.text).toList)) // Table rows are mapped to CoinCreate case class.
}
Enter fullscreen mode Exit fullscreen mode

Lastly, we need to slice the Vector, starting from the second index till the last lines, as the first line contains column names (headers). The resulting sliced Vector would still have the table rows with their HTML elements as String. So we can benefit from functional programming to map all the table rows (Vector elements) while extracting text in the elements then transform to CoinCreate (the comfort of having a tailor-made apply function).

Insertion Functions

doobie is an amazing functional JDBC layer for Scala. It provides a functional way to write any JDBC program. In this tutorial, we will keep it quite simple by writing only a connection Transactor to connect to the local PostgreSQL database and an insertion function to make the transactions with type-safety.

To connect to the database, we need to use a Transactor stating the type of the driver (in our case it is a PostgreSQL driver), URL for connection, user name, password, and an Execution Context (EC). The transactor needs an implicit val to define the EC. For non-blocking operations doobie’s Transactor uses contextShift. For testing our code doobie documentation recommends using synchronous EC.

implicit val cs = IO.contextShift(ExecutionContexts.synchronous)

val xa = Transactor.fromDriverManager[IO](
  "org.postgresql.Driver", // driver classname
  "jdbc:postgresql://localhost:5432/dev", // connect URL (driver-specific)
  "admin", // user
  "admin", // password
  ExecutionContexts.synchronous
)
Enter fullscreen mode Exit fullscreen mode

For writing a row by row insertion function we can use SQL interpolation. The function has an input of CoinInsert and an output of Update0 (representing a parameterized statement where the arguments are known).

def insertCoin(coinInsert: CoinInsert): Update0 =
  sql"""
     INSERT INTO coins (code, price, logTimestamp)
     VALUES (${coinInsert.code}, ${coinInsert.price}, ${coinInsert.logTimestamp})
     """.update
Enter fullscreen mode Exit fullscreen mode

Lastly, we can GET request the homepage by using the getCoinUpdatedTable function. This will return a Vector of Strings.

Consequently, we can use this Vector (coinTable) to transform CoinCreate to CoinInsert case class and execute the insert statement we prepared in the previous step.

val coinTable =
    getCoinUpdatedTable("https://coinmarketcap.com", "currencies")

coinTable
    .foreach { coinCreate =>
        val logTimestamp = Calendar.getInstance.getTimeInMillis
        val coinInsert = CoinInsert(coinCreate, logTimestamp)
        insertCoin(coinInsert).run.transact(xa).unsafeRunSync
    }

Enter fullscreen mode Exit fullscreen mode

Last Words

Thanks to doobie, with only a few lines we were able to scrape the crypto-currency prices from CoinMarketCap and insert those into a local PostgreSQL database. Although the code does its job, for now, the source code can be extended with exception handling and monitoring. You can find the whole project in this Github repository.

This article was originally published in the following link.

💖 💪 🙅 🚩
nazliander
Nazli Ander

Posted on August 2, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related