Spark is lit once again

minutis

Mindaugas

Posted on October 29, 2021

Spark is lit once again

@pdambrauskas and I are marking hactoberfest by releasing our little in-house project...

Lighter - Running Spark applications on Kubernetes

Here at Exacaster Spark applications have been used extensively for years. We started using them on our Hadoop clusters with YARN as an application manager. However, with our recent product, we started moving towards a Cloud-based solution and decided to use Kubernetes for our infrastructure needs.

Livy

When running Spark applications on YARN, you can submit jobs using:

  • Spark client
  • Apache Livy - an open-source REST API for interacting with Apache Spark from anywhere.

Latter was a go-to solution at the time when we were only using Spark on YARN. Sadly Apache Livy is not maintained anymore: it has no K8s support, Spark client is more and more outdated with every passing day. For some time we used @jahstreet's fork which had K8s available. But then we saw that the Livy project hadn't received any updates and we decided to implement our own solution - Exacaster Lighter.

Lighter

Exacaster Lighter is heavily inspired by Apache Livy. The Idea is the same: hide Spark application client under the REST API. However, we are focusing on running those applications on the K8s cluster. YARN mode is also supported. We designed our application to be extendible with different execution backends.

Lighter has lightweight, React based UI written in TS and back-end written in Java with minor Python integration points.

Simplified illustration of the architecture:

                                              ┌────────────────────────────────────────────────────────────────────────────┐
                                              │ Lighter                                                                    │
                                              │     ┌────────────────────────────────────────────────────────────────┐     │
                                              │     │                                                                │     │
                                              │     │                         Internal storage                       │     │
                                              │     │                                                                │     │
                                              │     │                                                                │     │
                                              │     └▲────────▲────────────────────┬─────────────────────────┬───────┘     │
                                              │      │        │                    │                         │             │
                                              │  store app    │                 get│new apps            sync status        │
                                              │      │     check status            │                         │             │
┌────────────────────┐                    ┌───┴──────┴──────────┐           ┌──────▼─────────┐      ┌────────▼────────┐    │
│                    │                    │                     │           │                │      │                 │    │
│                    │  Submit            │                     │           │                │      │                 │    │
│                    ├────────────────────►                     │           │                │      │                 │    │
│      Client        │                    │       REST api      │           │  App executor  │      │ Status tracker  │    │
│                    │  Check status      │                     │           │                │      │                 │    │
│                    ◄────────────────────┤                     │           │                │      │                 │    │
│                    │                    │                     │           │                │      │                 │    │
│                    │                    │                     │           │                │      │                 │    │
└────────────────────┘                    └───┬─────────────────┘           └────────┬───────┘      └────────┬────────┘    │
                                              │                                      │                       │             │
                                              │                                   execute               get status         │
                                              │                                      │                       │             │
                                              │                              ┌───────▼───────────────────────▼──────┐      │
                                              │                              │                                      │      │
                                              │                              │                                      │      │
                                              │                              │                Backend               │      │
                                              │                              │               (YARN/K8s)             │      │
                                              │                              │                                      │      │
                                              │                              │                                      │      │
                                              │                              └──────────────────────────────────────┘      │
                                              │                                                                            │
                                              └────────────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

More information can be found on our documentation page.

UI

This is the job list view:
Job list

You can see the configuration of the submitted job inside:
Job configurations

Driver logs are also available for each job:
Job logs

How it works?

Glad you asked. It is quite simple. Lighter uses Spark Launcher to launch Spark applications on Kubernetes cluster. The launcher takes care of creating all Pods needed for the Spark application to run. When launching applications we tag them with a unique identifier by setting config property spark.kubernetes.driver.label.spark-app-tag. Then we use that identifier to check application status and retrieve application logs by calling pods API with labelSelector property.

Things get a bit more complicated on interactive sessions. We've created Sparkmagic compatible REST API so that Sparkmagic kernel could communicate with Lighter the same way as it does with Apache Livy. When a user creates an interactive session Lighter server submits a custom PySpark application which contains an infinite loop which constantly checks for new commands to be executed. Each Sparkmagic command is saved on Java collection, retrieved by the PySpark application through Py4J Gateway and executed.

Uscases

Spark on K8s

Since Apache Spark 2.4, applications can be executed on the K8s cluster. When you submit your Spark application, driver and executor pods are created for your application and removed after the application completes. But if you want to track application status and report them to end-users in a nice manner it gets complicated. Haha.

Spark on YARN

In the early days of the Big Data era when K8s hasn't even been born yet, the common open source go-to solution was the Hadoop stack. We have written several old-fashioned Map-Reduce jobs, scripts using Pig until we came across Spark. Since then Spark has became one of the most popular data processing engines. It is very easy to start using Lighter on YARN deployments. Just run a docker with proper configuration and mount necessary configurations in all the default paths.

Jupyterlab

For ad-hoc data analysis Jupyterlab on top of Spark is an elegant solution. Between themselves, however, these two great tools cannot communicate so Lighter together with SparkMagic acts as a bridge. You only need to provide the correct configuration to SparkMagic to have it working.

Closing remarks

Lighter is a freshly baked tool and open-sourced for everyone to use. Since we developed it to the use-cases that are familiar to us, feel free to contribute if you see any opportunities to make it better.

💖 💪 🙅 🚩
minutis
Mindaugas

Posted on October 29, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Spark is lit once again
kubernetes Spark is lit once again

October 29, 2021