RESTful Web scraping in Scala, using Play Framework and Jsoup

bartoszgajda55

Bartosz Gajda

Posted on September 19, 2020

RESTful Web scraping in Scala, using Play Framework and Jsoup

Recently I have encountered a very cool site with cooking recipes, which had extremely poor UI, especially when using a mobile. There was no official API, and so I have decided to build a web service that would web scrape the content out of it, and publish it using RESTful API. In this post I will show you how to use Scala, Play Framework and Jsoup to build such service. Enjoy!

You will need

  • Scala - I use the version 2.13.1
  • Sbt - install from official website
  • IDE of choice - IntelliJ or VS Code, I use latter
  • Around 15 minutes of your time

Creating empty Play project

We will start off by creating a Play project. You can of course start from the scratch and add the code iteratively, but for the scope of this tutorial, I will suggest using on of the Play's example projects as a starter.

The Scala Rest API project have almost everything we need. This project serves as a nice backbone to our use case, which we will be able to extend with ease.

Let's start by cloning the Play Framework samples repository, and opening the Scala Rest API example like this:

git clone https://github.com/playframework/play-samples.git
cd play-scala-rest-api-example

After this, try running this command - the Play application should shoot off, and you should be able to access it at http://localhost:9000

sbt run

If the application builds and runs correctly, we are done with initializing the project. Now, you can open this folder with your IDE of choice - I will use VS Code

Adding Jsoup dependency

Next, we will add the Jsoup - an open source HTML parser for Java and JVM. To do this, open a build.sbt file and add the following in the libraryDependencies block:

"org.jsoup" % "jsoup" % "1.13.1"

Of course, make sure to use the latest version possible - at the time of authoring this article, 1.13.1 was the latest one.

Now, wait for sbt to refresh the dependencies, and we are ready to write some code!

Defining an endpoint

The first coding step is to define an endpoint to Play framework. This is done, by adding route definitions to conf/routes file. In our case, I want to have a v1/recipes endpoint that is solely handled by RecipeRouter class. We can do this by adding the following:

->         /v1/recipes             v1.recipe.RecipeRouter

Now, any request that hits that URL, will be redirected to our router.

Next of, I am creating a package for my recipe related classes app/v1/recipe. In here, let's create an entry point that will handle the requests: RecipeRouter

package v1.recipe

import javax.inject.Inject

import play.api.routing.Router.Routes
import play.api.routing.SimpleRouter
import play.api.routing.sird._

class RecipeRouter @Inject()(controller: RecipeController) extends SimpleRouter {
  val prefix = "/v1/recipes"

  def link(id: String): String = {
    import io.lemonlabs.uri.dsl._
    val url = prefix / id.toString
    url.toString()
  }

  override def routes: Routes = {
    case GET(p"/") =>
      controller.index

    case GET(p"/all/$id") =>
      controller.showAll(id)
  }

}

Connecting to website

Going further, it's time to finally use Jsoup to get some content out of the website. In the previous section, we have defined showAll function, and that's the one I will be implementing here. We will be making use of the id parameter, which come in handy in many situations.

So, let's move to RecipeController class, and define the first element: sourceUrl. I like to have it defined as a class field, so that it is accessible by all methods:

val sourceUrl: String = "https://kwestiasmaku.com"

Then, inside our method, let's get the content of this website. It can be done like this:

def showAll(pageId: String): Action[AnyContent] = Action { implicit request =>
    val htmlDocument = Jsoup.connect(s"${sourceUrl}/home-przepisy?page=${pageId}").get()

    val r: Result = Ok(htmlDocument)
    r
}

In the snippet above, we are connecting to our website, b using Jsoup.connect method, while passing the correct URL as a string. What we get, is an HTML document, that we just return for now. We will do some finer extraction in the next section.

To test the endpoint, call the following URL in your browser, or any http client, like curl:

http://localhost:9000/v1/recipes/all/1

You should see a response, which is raw HTML document. Now, let's parse it and extract some meaningful data.

Extracting the content

Now, let's extract some specific elements from the raw HTML. For this, we can use the select method of Jsoup. This method requires a CSS like selector syntax - if you have used CSS before, then it is as easy specifying the correct string - in our case, to extract single recipe HTML object, use the following:

import scala.jdk.CollectionConverters._

val recipesDomElements = htmlDocument.select("section#block-system-main .col").asScala

This query returns, collection-like structure, on which we can easily iterate. To extract specific elements from a single recipe, use the following:

final case class Recipe(title: String, href: String, img: String)

val recipeData = for(recipeElement <- recipesDomElements)
      yield Recipe(
        recipeElement.select(".views-field-title a").html(),
        sourceUrl + recipeElement.select(".views-field-title a").attr("href"),
        recipeElement.select("img").attr("src")
      )

In here, we use a simple for loop, which yields the concrete data, by wrapping it into Recipe case class. Now, we have nicely cleaned data, and we can finally return it as JSON.

Returning a JSON

Last step is to return our collection of Recipe case classes as a JSON. We can do this using Play's built in Json class, and its methods:

implicit val recipeWrites: OWrites[Recipe] = Json.writes[Recipe]

val r: Result = Ok(Json.toJson(recipeData))
r

An important thing to have in mind, is that before converting a case class into its JSON representation, we must declare implicit writer, so that the Json class knows how to write this specific data.

And that's it! The complete RecipeController class implementation looks like the following:

package v1.recipe

import javax.inject.Inject
import org.jsoup.Jsoup
import play.api.mvc._
import play.api.libs.json.{Json, OWrites}

import scala.jdk.CollectionConverters._

class RecipeController @Inject()(val controllerComponents: ControllerComponents) extends BaseController {
  implicit val recipeWrites: OWrites[Recipe] = Json.writes[Recipe]
  val sourceUrl: String = "https://kwestiasmaku.com"

  def index: Action[AnyContent] = Action { implicit request =>
    val r: Result = Ok("hello world")
    r
  }

  def showAll(pageId: String): Action[AnyContent] = Action { implicit request =>
    val htmlDocument = Jsoup.connect(s"${sourceUrl}/home-przepisy?page=${pageId}").get()
    val recipesDomElements = htmlDocument.select("section#block-system-main .col").asScala

    val recipeData = for(recipeElement 

Summary

I hope you have found this post useful. If so, don’t hesitate to like or share this post. Additionally, you can follow me on my social media if you fancy so πŸ™‚

πŸ’– πŸ’ͺ πŸ™… 🚩
bartoszgajda55
Bartosz Gajda

Posted on September 19, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related