Spark AI - Bringing Chat GPT to Data Engineering

Chat GPT has brought a sea of possibilities with his huge capacity to understand human language. Since OpenAI opened GPT model through Rest API for developers, a lot of those possibilites started to become reality, like Bing integrating GPT - an extension of Microsoft's search tool, or Auto-GPT - "An experimental open-source attempt to make GPT-4 fully autonomous".
And now - more precisely at June 29 2023, a new post on Databricks' Blog has introduced pyspark-ai, "The English SDK for Apache Spark". It brings a nice API over our known PySpark DataFrames allowing us to load data from web (like a web scraping) into a dataframe, perform transformations, run assertions about the data, describe and plot different views of the dataset. Everything with natural language. Let's see some examples (from the original article):

# ingest data
auto_df = spark_ai.create_df("https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand")
auto_df.show(n=5)

rank	brand	us_sales_2022	sales_change_vs_2021
1	Toyota	1849751	-9
2	Ford	1767439	-2
3	Chevrolet	1502389	6
4	Honda	881201	-33
5	Hyundai	724265	-2

# plot
auto_df.ai.plot()
# with instructions
auto_df.ai.plot("pie chart for US sales market shares, show the top 5 brands and the sum of others")

# transformations
auto_top_growth_df=auto_df.ai.transform("brand with the highest growth")
auto_top_growth_df.show()

brand	us_sales_2022	sales_change_vs_2021
Cadillac	134726	14

# validation
auto_top_growth_df.ai.verify("expect sales change percentage to be between -100 to 100")
# outputs True

SparkAI also provides a cool API to UDFs:

@spark_ai.udf
def previous_years_sales(brand: str, current_year_sale: int, sales_change_percentage: float) -> int:
    """Calculate previous years sales from sales change percentage"""
    ...

It looks amazing, right? If you want to give it a shot, I have built a CLI on top of pyspark-ai and you can run it interactively. Check it out: https://github.com/lucas-lm/spark-ai-cli

PySparkAI CLI

Let's suppose we want to check what are the top 3 repositories more stared in the google topic on github (https://github.com/topics/google). Using PySpark AI CLI we could run the command in shell to get this view:

pyspark-ai https://github.com/topics/google --transform "top 3 python repos with more stars"

Below results were produced by the command above using gpt-3-turbo from OpenAI as our LLM.

As we can see, it is achieving satisfactory results, but there is some mistakes like the wrong table name and the lower case in the filter when the values in the dataframe are Title Case.
As of today, pyspark-ai is still in early stage development and this kind of gap is expected.

Nevertheless, it has a great potential to become a tool for study and exploration of some datasets.

Note about pyspark-ai-cli:
The plot feature is not supported because pyspark-ai enforces plotly as its visualization library (in spark_ai.plot function), which does not display any figure when running from a terminal (https://github.com/plotly/plotly_express/issues/47).

If you want to get started with PySparkAI CLI, check the instructions in the public repository. If you are more interested in the pyspark-ai features, check it out on github repository.

PySpark-AI under the hood

If you take a quick look on spark-ai source code, you will notice that it follows a pattern:

Your input from methods (transform, plot, verify etc.) is caught
Your input is used to compound a prompt template
This prompt is processed by some LLM (commonly with GPT Rest API)
The output of this prompt is parsed to extract the code blocks
Code blocks are executed in the exec python's function
The results that matters from the execution are returned

This high-level overview recalls a little bit the illustration given in the Databricks blog post (https://www.databricks.com/blog/introducing-english-new-programming-language-apache-spark):

The Downside

Before talking about the cons of the "English SDK", we have to point out that the library is under the databrickslabs organization on github, which is a huge indicative that it is something experimental and it is not meant to be handled as a reliable product, and of course, is not ready for production environments.

What scares me the most about the approach embraced in PySpark-AI is that we do not have control over the code that is running. Even though we can see some logs to understand the code generated, we do not have the chance to assess that code before running it.
Even before we had advanced generative AIs as we have nowadays, exec and eval are functions highly avoided due to the inherent security risks they carry.

Another problem that comes as a consequence of this dynamic execution is the side effects. We can not trust that the code generated will be always the same given the same input. Relying on a third party service to give us the output can also be problematic, because we may face instabilities, increases in the latency among other undesirable situations.

Code generated by GPT (model gpt-3.5-turbo) running error.

Conclusion

PySpark-AI - or "English SDK" as it is being introduced, brought an innovative design with a nice API to work with pyspark covering a good variety of operations.

It is easy to get started, can be useful for beginners and non-technical users would feel more comfortable to try it as well.

It is not so reliable though. Even if we have future enhancements being applied, I myself can not see this kind of solution becoming safe and stable enough to be applied at scale and/or in a real-world production environment.

Blog

Spark AI - Bringing Chat GPT to Data Engineering

Lucas Miranda

PySparkAI CLI

PySpark-AI under the hood

The Downside

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related