Apache Spark and Databricks 101 pt. II - Some DataFrames

hugoestradas

Hugo Estrada S.

Posted on June 4, 2020

Apache Spark and Databricks 101 pt. II - Some DataFrames

Alt Text

One of the core API's of Apache Spark is the DataFrame.
It represents a table of data with rows and columns.

The list of columns and the respective datatypes are what's called the schema.
Like an Excel spreadsheet with named columns.

DataFrames are powerful and can be partitioned across thousands of computers at the same time.

They make parallel processing of big data possible:

Alt Text

Alt Text

1 Creating a Sample Spark DataFrame:

The 'inferSchema' option means Spark will automatically detect the schema for us.

2 Take a Glimpse into the DataFrame:

With the 'display()' function we can see our DataFrame in a nice table format:

Alt Text

If just so happens that you only want to take a look to the first x rows use the '.take(x)' function:

Alt Text

Remember, this is Spark and not Python.

3 Final Thoughts and Conclusions:

Spark DataFrames are completely separate from Pandas DataFrames. That's something for another lecture, and I will show you how to convert one into the other and the differences and benefits of each.

💖 💪 🙅 🚩
hugoestradas
Hugo Estrada S.

Posted on June 4, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related