Creating and running Spark Jobs in Scala on Cloud Dataproc !!!

jlgjosue

Josue Luzardo Gebrim

Posted on December 22, 2021

Creating and running Spark Jobs in Scala on Cloud Dataproc !!!

Deploy and Run Jobs Spark on Scala in GCP is easy!!

This is a simple tutorial with examples of using Google Cloud to run Spark jobs done in Scala easily! :)

Install the Java 8 JDK or Java 11 JDK

To check if Java is installed on your operating system, use the command below:

java -version

Depending on the version of Java, this command can change … :)

If Java has not been installed yet, install from these links: Oracle Java 8, Oracle Java 11, or AdoptOpenJDK 8/11; always checking the compatibility between the versions of JDK and Scala following the guidelines of this link: JDK Compatibility.

Install Scala

  • Ubuntu:

sudo apt install scala

  • macOS:
brew update
brew install scala
Enter fullscreen mode Exit fullscreen mode

Set the SCALA_HOME environment variable and add it to the path, as shown in the Scala installation instructions. Example:

export SCALA_HOME=/usr/local/share/scala export PATH=$PATH:$SCALA_HOME/

if you use Windows:

%SCALA_HOME%c:\Progra~1\Scala 
%PATH%%PATH%;%SCALA_HOME%\bin
Enter fullscreen mode Exit fullscreen mode

Start Scala.

$ scala 
Welcome to Scala 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.9.1).
Type in expressions for evaluation. Or try :help.
Enter fullscreen mode Exit fullscreen mode

Depending on your installation, the version of Scala and Java may change …

Copy and paste the HelloWorld code into Scala REPL (interactive mode via terminal)

object HelloWorld {
 def main(args: Array[String]): Unit = {
 println(“Hello, world!”)
 }
Enter fullscreen mode Exit fullscreen mode

Save HelloWorld.scala and exit REPL.

scala> :save HelloWorld.scala 
scala> :q
Enter fullscreen mode Exit fullscreen mode

Compile with: scalac.

$ scalac HelloWorld.scala

List the compiled .class files.

$ ls -al 
HelloWorld*.class 
HelloWorld$.class 
HelloWorld.class
Enter fullscreen mode Exit fullscreen mode

Copying the Scala code

Let’s put an order in the house!

mkdir hello
$ cd hello
$ echo \
'object HelloWorld {def main(args: Array[String]) = println("Hello, world!")}' > \
HelloWorld.scala
Enter fullscreen mode Exit fullscreen mode

Create a sbt.build configuration file to define artifactName (the name of the jar file you will generate below) as “HelloWorld.jar”

echo \
'artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) =>
"HelloWorld.jar" }' > \
build.sbt
Enter fullscreen mode Exit fullscreen mode

Start SBT (Scala Package and Dependency Manager) and run the code:

$ sbt

If SBT has not been installed yet, it will be downloaded and installed; it may take some time….

sbt:hello> run
[info] running HelloWorld 
Hello, world!
[success] Total time: 0 s, completed 4 de fev de 2021 00:20:26
Enter fullscreen mode Exit fullscreen mode

Package your project to create a .jar file

sbt:hello> package
[success] Total time: 0 s, completed 4 de fev de 2021 00:20:35
sbt:hello> exit
[info] shutting down sbt server
Enter fullscreen mode Exit fullscreen mode

The compiled file is at “./target/scala-version_xpto/HelloWorld.jar”

Google cloud platform (GCP)

To use the Google cloud service, it is necessary to enable its billing; in this example, we will use cloud storage to store the jar file with our Scala code and Cloud Dataproc to execute the used Spark file.

Both services have a free usage fee, and once that quota ends, you will be charged to the credit card linked to the account!

Copy the Jar file to a bucket on cloud storage:

image.png

With the link to the bucket where the jar file is, submit a new job in the Cloud Dataproc:

image.png

Field references:

  • Cluster: Select the specific cluster from the list.

  • Job type: select “Spark”

  • Main class or jar: paste the link to the bucket where the jar file is located.

Before submitting, check if the Spark cluster has been created :)

After filling out the form, we have to select “submit,” the job will be executed and will appear in the list of jobs:

image.png

By selecting the Job ID, it will be possible to see its output:

image.png

This was a brief example of deploying a Spark routine done in Scala in the Google environment, there is the possibility to interact with the Spark cluster via spark-shell accessing via ssh, and instead of doing the submit via form, we could use the Google CLI CLoud.

References:

https://cloud.google.com/dataproc/docs/tutorials/spark-scala

Buy Me A Coffee

Follow me on Medium :)

💖 💪 🙅 🚩
jlgjosue
Josue Luzardo Gebrim

Posted on December 22, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related