A Quick Start to Databricks on AWS

Big data technologies have evolved rapidly over the last two decades making it difficult to define clear-cut skillsets for data roles. For example, Data Scientists need to pre-process data before building models. But to what extent does this differ from Data Engineering. It is common to find role overlaps between Data Analysts and Data Scientists, Data Scientists and Machine Learning Engineers, and Machine Learning Engineers and DevOps Engineers.

This challenge extends to the development infrastructure. Organizations have isolated infrastructure stacks with specific tools that are dedicated to a single purpose like Web Development, Data Engineering, or Data Science. The siloed setup limits collaboration across teams, exposes data security challenges and limits effective data governance strategies.

What if a centralized data platform existed, where authorized persons, irrespective of their technical background and intended use case, could access data efficiently with consistency guarantees.

What is Databricks?

Databricks is a SAAS platform for developing cloud-agnostic AI and Data analytics solutions. Databricks creators are responsible for successful open-source projects like Apache Spark and MLflow. Over 5000 companies currently use Databricks and it integrates with over 450 partner technologies like Tableau, Qlik, SageMaker, Mathworks.

On Databricks, development teams can set up git repositories and run notebooks for Apache Spark applications in Python, Scala, R, and SQL. All-purpose clusters can be provisioned as the development environment or Job clusters, which are managed by the Databricks job scheduler, can be used for running automated jobs.

Why use Databricks?

Databricks currently has no streaming data ingestion offering like Kinesis. The competing services on AWS, are EMR and Glue for running Spark Jobs and Spark Machine Learning on Sagemaker. Given these services, is there any reason to consider using Databricks?

At the core of Databricks is the Datalakehouse platform that is founded on a Delta Lake. A Delta lake provides high-performance ACID properties for cloud object stores. Since S3 is a cost-effective solution for storing structured and unstructured data, a Delta lake can be built with it to provide the following benefits:

A unified environment to power diverse teams such as Data Analytics, Machine Learning, and Data Engineering.
Consistency is guaranteed when performing multi-object updates on a table based on multiple files in an object-store.
Perform rollbacks for unsuccessful transactions and query point-in-time snap-shots.

Databricks on AWS

Unlike the Azure Databricks service offering, a search for the Databricks service on the AWS portal yields no result. To quickly get started with Databricks on AWS, there are two options available:

Full Data Platform

Go to Databricks and click the Try Databricks button. Fill in the form and Select AWS as your desired platform afterward.
Select a Databricks subscription Plan: either Standard, Premium, or enterprise.
Set up a workspace using your AWS account. A workspace is simply a collaboration environment for your Databricks resources. You will be redirected to log in to your AWS account.
Authorize a CloudFormation Stack to create Databricks resources. By default, a cluster of three i3.xlarge EC2 instances is provisioned for the Spark cluster.
A URL will be sent to your email when your workspace is ready to start development.
Do not forget to terminate the default cluster when you sign out to prevent incurring unwanted costs.

Databricks Community Edition and S3

If you don't want to link your AWS account to Databricks or simply want to try it out, you can use the Databricks Community Edition. You can run Ipython notebooks for free on 15GB clusters, fully managed by Databricks.

You can mount your S3 buckets in your Databricks notebooks through Databricks File System (DBFS). A guide to implementing this can be found here.

Summary

Databricks provides a unified platform for Data teams to build AI and Analytics applications. It is founded on the DataLakehouse platform that is powered by DeltaLake, a technology that provides high performant ACID properties to object stores. You can set up Databricks on your AWS account or mount S3 in notebooks on the Databricks Community Edition.

References

Armbrust, M., Das, T., Sun, L., Yavuz, B., Zhu, S., Murthy, M., ... & Zaharia, M. (2020). Delta lake: high-performance ACID table storage over cloud object stores. Proceedings of the VLDB Endowment, 13(12), 3411-3424.
Lee, D., Das, T., & Jaiswal, V. (2022). Delta Lake The Definitive Guide [E-book]. O’Reilly Media.

Blog