Google Colab, Pyspark, Cassandra remote cluster combine these all together

If you don't know what are these things, please navigate to the introductory links mentioned bellow to get and rough idea about the tech stack mentioned here and what you can achieve using them.

Though it's really easy to connect Apache Spark with Cassandra or any other data source. But I can't find any source where it is mentioned clearly how to connect PySpark and Cassandra. To reduce the struggle of them who are are going to do the same, just follow this tutorial and you are all set to stark working with Sparks RDD by using Cassandra database as the data source.

At the very first please create one new notebook in Google Colab. Then you need to install jdk, apache spark on hadoop cluster, findspark library on python env. Copy the commands bellow and paste it into Google Colab notebook. Running the code block will install all of them.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
!tar xf spark-2.4.8-bin-hadoop2.7.tgz
!pip install -q findspark

Here I am using spark-2.4.8, but can install the latest versions as per your requirement.

Then you need to set some environment variables:

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.8-bin-hadoop2.7"

Initialize findspark and check where it is located:

import findspark
findspark.init()
findspark.find()

Now this is the step where you are going to connect PySpark with Cassandra:

from pyspark.sql import SparkSession
from pyspark import SQLContext

spark = (SparkSession.builder.master("local[*]")
         .config('spark.cassandra.connection.host', "[host_ip_address]")
         .config('spark.jars.packages', "datastax:spark-cassandra-connector:2.4.0-s_2.11")
         .config('spark.cassandra.auth.username', "[db_username]")
         .config('spark.cassandra.auth.password', "[db_password]")
         .getOrCreate())

SQL_LOCAL_CONTEXT = SQLContext(spark)

def read_table(context, table):
    return context.read.format("org.apache.spark.sql.cassandra").options(table=table, keyspace="[key_space]").load()

Test your connection:

groups = read_table(SQL_LOCAL_CONTEXT, "[db_table]")
groups.show()

Replace the variables placed inside the brackets, such as: host_ip_address, db_username, db_password & key_space with the values of your database.

groups will store data as PySpark dataframe.

Blog

Google Colab, Pyspark, Cassandra remote cluster combine these all together

Shovon Basak

Join Our Newsletter. No Spam, Only the good stuff.

Related