Bartosz Gajda
Posted on September 19, 2020
Recently I have encountered a very cool site with cooking recipes, which had extremely poor UI, especially when using a mobile. There was no official API, and so I have decided to build a web service that would web scrape the content out of it, and publish it using RESTful API. In this post I will show you how to use Scala, Play Framework and Jsoup to build such service. Enjoy!
You will need
- Scala - I use the version 2.13.1
- Sbt - install from official website
- IDE of choice - IntelliJ or VS Code, I use latter
- Around 15 minutes of your time
Creating empty Play project
We will start off by creating a Play project. You can of course start from the scratch and add the code iteratively, but for the scope of this tutorial, I will suggest using on of the Play's example projects as a starter.
The Scala Rest API project have almost everything we need. This project serves as a nice backbone to our use case, which we will be able to extend with ease.
Let's start by cloning the Play Framework samples repository, and opening the Scala Rest API example like this:
git clone https://github.com/playframework/play-samples.git cd play-scala-rest-api-example
After this, try running this command - the Play application should shoot off, and you should be able to access it at http://localhost:9000
sbt run
If the application builds and runs correctly, we are done with initializing the project. Now, you can open this folder with your IDE of choice - I will use VS Code
Adding Jsoup dependency
Next, we will add the Jsoup - an open source HTML parser for Java and JVM. To do this, open a build.sbt
file and add the following in the libraryDependencies
block:
"org.jsoup" % "jsoup" % "1.13.1"
Of course, make sure to use the latest version possible - at the time of authoring this article, 1.13.1 was the latest one.
Now, wait for sbt
to refresh the dependencies, and we are ready to write some code!
Defining an endpoint
The first coding step is to define an endpoint to Play framework. This is done, by adding route definitions to conf/routes
file. In our case, I want to have a v1/recipes
endpoint that is solely handled by RecipeRouter
class. We can do this by adding the following:
-> /v1/recipes v1.recipe.RecipeRouter
Now, any request that hits that URL, will be redirected to our router.
Next of, I am creating a package for my recipe related classes app/v1/recipe
. In here, let's create an entry point that will handle the requests: RecipeRouter
package v1.recipe import javax.inject.Inject import play.api.routing.Router.Routes import play.api.routing.SimpleRouter import play.api.routing.sird._ class RecipeRouter @Inject()(controller: RecipeController) extends SimpleRouter { val prefix = "/v1/recipes" def link(id: String): String = { import io.lemonlabs.uri.dsl._ val url = prefix / id.toString url.toString() } override def routes: Routes = { case GET(p"/") => controller.index case GET(p"/all/$id") => controller.showAll(id) } }
Connecting to website
Going further, it's time to finally use Jsoup
to get some content out of the website. In the previous section, we have defined showAll
function, and that's the one I will be implementing here. We will be making use of the id
parameter, which come in handy in many situations.
So, let's move to RecipeController
class, and define the first element: sourceUrl
. I like to have it defined as a class field, so that it is accessible by all methods:
val sourceUrl: String = "https://kwestiasmaku.com"
Then, inside our method, let's get the content of this website. It can be done like this:
def showAll(pageId: String): Action[AnyContent] = Action { implicit request => val htmlDocument = Jsoup.connect(s"${sourceUrl}/home-przepisy?page=${pageId}").get() val r: Result = Ok(htmlDocument) r }
In the snippet above, we are connecting to our website, b using Jsoup.connect
method, while passing the correct URL as a string. What we get, is an HTML document, that we just return for now. We will do some finer extraction in the next section.
To test the endpoint, call the following URL in your browser, or any http client, like curl
:
http://localhost:9000/v1/recipes/all/1
You should see a response, which is raw HTML document. Now, let's parse it and extract some meaningful data.
Extracting the content
Now, let's extract some specific elements from the raw HTML. For this, we can use the select
method of Jsoup. This method requires a CSS like selector syntax - if you have used CSS before, then it is as easy specifying the correct string - in our case, to extract single recipe HTML object, use the following:
import scala.jdk.CollectionConverters._ val recipesDomElements = htmlDocument.select("section#block-system-main .col").asScala
This query returns, collection-like structure, on which we can easily iterate. To extract specific elements from a single recipe, use the following:
final case class Recipe(title: String, href: String, img: String) val recipeData = for(recipeElement <- recipesDomElements) yield Recipe( recipeElement.select(".views-field-title a").html(), sourceUrl + recipeElement.select(".views-field-title a").attr("href"), recipeElement.select("img").attr("src") )
In here, we use a simple for
loop, which yields the concrete data, by wrapping it into Recipe
case class. Now, we have nicely cleaned data, and we can finally return it as JSON.
Returning a JSON
Last step is to return our collection of Recipe
case classes as a JSON. We can do this using Play's built in Json
class, and its methods:
implicit val recipeWrites: OWrites[Recipe] = Json.writes[Recipe] val r: Result = Ok(Json.toJson(recipeData)) r
An important thing to have in mind, is that before converting a case class into its JSON representation, we must declare implicit writer, so that the Json
class knows how to write this specific data.
And that's it! The complete RecipeController
class implementation looks like the following:
package v1.recipe import javax.inject.Inject import org.jsoup.Jsoup import play.api.mvc._ import play.api.libs.json.{Json, OWrites} import scala.jdk.CollectionConverters._ class RecipeController @Inject()(val controllerComponents: ControllerComponents) extends BaseController { implicit val recipeWrites: OWrites[Recipe] = Json.writes[Recipe] val sourceUrl: String = "https://kwestiasmaku.com" def index: Action[AnyContent] = Action { implicit request => val r: Result = Ok("hello world") r } def showAll(pageId: String): Action[AnyContent] = Action { implicit request => val htmlDocument = Jsoup.connect(s"${sourceUrl}/home-przepisy?page=${pageId}").get() val recipesDomElements = htmlDocument.select("section#block-system-main .col").asScala val recipeData = for(recipeElement
Summary
I hope you have found this post useful. If so, donβt hesitate to like or share this post. Additionally, you can follow me on my social media if you fancy so π
Posted on September 19, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.