JayReddy
Posted on January 31, 2022
Apache Cassandra is an open-source, distributed data storage system that is durable, scalable, consistently tuneable, and is highly efficient for OLAP.
It was 2009, Cassandra first started as an Incubator project at Apache, Shortly thereafter Apache Cassandra gained a lot of traction and grew to what we see today. Cassandra has an active community of enthusiastic developers and is being used in production by many big companies on the web. to name a few Facebook, Twitter, and Netflix.
Apache Cassandra operates blazingly fast when compared to RDBMS for database writes and can store hundreds of terabytes of data in a decentralized and symmetrical way with no single point of failure.
Quill is a Scala library that provides a Quoted Domain Specific Language to express queries in Scala and execute them in a target language.
Quill and Cassandra are a perfect match to query, run and optimize unstructured data in Apache spark offering exceptional capabilities to work with No-SQL distributed databases.
Cassandra can be integrated with apache Spark using a docker image or by using a jar file. Let's go with the docker image approach as we might want to experiment with Cassandra shell(CSH) and Cassandra Query Language.
Setup:
Let's spin up the Apache Cassandra node on Docker,
version: "3"
services
Cassandra:
image: cassandra:latest
volumes: ["./cassandra/cassandra.yml:/etc/cassandra/cassandra.yml"]
ports: ["9042:9042"]
networks: ["sandbox"]
networks:
sandbox:
driver: bridge
We will use Apache Spark’s recent release is 3.2.0 as a shell with a connector and Cassandra’s client library 3.11 and pass the scala jar file as a parameter to run.
./spark-shell — packages com.datastax.spark:spark-cassandra-connector_2.12:3.2.0-beta,com.datastax.cassandra:cassandra-driver-core:3.11 spark-cassandra-connector-assembly-1.1.1-SNAPSHOT.jar
Apache Cassandra has a dedicated shell that is more like Scala’s REPL called cqlsh for running CQL scripts.
to start cqlsh we need to execute our container image,
sudo docker exec -it container_name cqlsh
First, we have to create a keyspace, which is like a container where our tables will be residing.
CREATE KEYSPACE spark_playground WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
Once the CSH and KEYSPACE is in place, we can run any Cassandra query and write it to file using Sparks write method,
Let's get to the main focus of the article now.
Spark Integration with Quill
Quill shines in offering a fully type-safe library to use Spark’s highly-optimized SQL engine.
Let's see how Quill works,
Simple case classes are used for mapping the database schema.
“Quoted DSL” — Quill leverages the mechanism of quote block in which Queries are defined. Quill parses each quoted block of code at compile time with the help of Scala’s
powerful compiler and translates the queries into an internal Abstract Syntax Tree (AST). similar to Sparks dags.Compile-time query generation: The ctx.run is responsible for reading the AST which is generated in step 2 and translates to the target language at compile-time.
Compile-time query validation: upon configuration, the query is verified against the database schema at compile-time. compilation status depends on the configuration. The query validation step does not alter the database state.
Let's Import quill-spark dependency into your sbt and build the project for usage.
libraryDependencies ++= Seq(
"io.getquill" %% "quill-spark" % "3.16.1-SNAPSHOT"
)
Usage:
Let's create a Spark session and import the required packages.
import org.apache.spark.sql.{SparkSession, sqlContext}
val spark =
SparkSession.builder()
.master("local")
.appName("spark-quill-test")
.getOrCreate()
// The Spark SQL Context must be provided by the user through an implicit value:
implicit val sqlContext = spark.sqlContext
import spark.implicits._
// Import the Quill Spark Context
import io.getquill.QuillSparkContext._
Note Unlike the other modules, the Spark context is a companion object. Also, it does not depend on a spark session.
Using Quill-Spark:
The run method returns a Dataset transformed by the Quill query using the SQL engine.
// Typically you start with some type dataset.
val data: Dataset[Data] = spark.read.format("csv").option("header", "true").load("/home/lambdaverse/spark_and_cassandra/test_data.csv")
// The liftQuery method converts Datasets to Quill queries:
val data: Query[Data] = quote { liftQuery(data) }
val data: Query[(Data] = quote {
data.join(addresses).on((p, a) => p.fact == a.dim)
}
Here is an example of a Dataset being converted into Quill, filtered, and then written back out.
import org.apache.spark.sql.Dataset
def filter(myDataset: Dataset[Data], name: String): Dataset[Int] =
run {
liftQuery(myDataset).filter(_.fact == lift(fact)).map(_.dim)
}
Conclusion:
Apache Spark is a popular big Data Analytical engine used in many fortune 500 companies.
Apache spark is known for its ability to process Structured and Un-Structured data. almost 70% of the data that is being generated today is Unstructured.
Cassandra is written in JAVA offering exceptional features and is a modern data stack, Quill is a Scala library that supports Cassandra integration.
This post is purely meant for educational purposes and opinions and content is referred from the official Quill page for better elaboration.
Please clap if you like the post and support.
subscribe to my newsletter to stay up to date on the Data Engineering content. [https://lambdaverse.substack.com/]
Posted on January 31, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.