Lucas Miranda
Posted on July 8, 2023
Chat GPT has brought a sea of possibilities with his huge capacity to understand human language. Since OpenAI opened GPT model through Rest API for developers, a lot of those possibilites started to become reality, like Bing integrating GPT - an extension of Microsoft's search tool, or Auto-GPT - "An experimental open-source attempt to make GPT-4 fully autonomous".
And now - more precisely at June 29 2023, a new post on Databricks' Blog has introduced pyspark-ai, "The English SDK for Apache Spark". It brings a nice API over our known PySpark DataFrames allowing us to load data from web (like a web scraping) into a dataframe, perform transformations, run assertions about the data, describe and plot different views of the dataset. Everything with natural language. Let's see some examples (from the original article):
# ingest data
auto_df = spark_ai.create_df("https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand")
auto_df.show(n=5)
rank | brand | us_sales_2022 | sales_change_vs_2021 |
---|---|---|---|
1 | Toyota | 1849751 | -9 |
2 | Ford | 1767439 | -2 |
3 | Chevrolet | 1502389 | 6 |
4 | Honda | 881201 | -33 |
5 | Hyundai | 724265 | -2 |
# plot
auto_df.ai.plot()
# with instructions
auto_df.ai.plot("pie chart for US sales market shares, show the top 5 brands and the sum of others")
# transformations
auto_top_growth_df=auto_df.ai.transform("brand with the highest growth")
auto_top_growth_df.show()
brand | us_sales_2022 | sales_change_vs_2021 |
---|---|---|
Cadillac | 134726 | 14 |
# validation
auto_top_growth_df.ai.verify("expect sales change percentage to be between -100 to 100")
# outputs True
SparkAI also provides a cool API to UDFs:
@spark_ai.udf
def previous_years_sales(brand: str, current_year_sale: int, sales_change_percentage: float) -> int:
"""Calculate previous years sales from sales change percentage"""
...
It looks amazing, right? If you want to give it a shot, I have built a CLI on top of pyspark-ai and you can run it interactively. Check it out: https://github.com/lucas-lm/spark-ai-cli
PySparkAI CLI
Let's suppose we want to check what are the top 3 repositories more stared in the google topic on github (https://github.com/topics/google). Using PySpark AI CLI we could run the command in shell to get this view:
pyspark-ai https://github.com/topics/google --transform "top 3 python repos with more stars"
Below results were produced by the command above using gpt-3-turbo from OpenAI as our LLM.
As we can see, it is achieving satisfactory results, but there is some mistakes like the wrong table name and the lower case in the filter when the values in the dataframe are Title Case.
As of today, pyspark-ai is still in early stage development and this kind of gap is expected.
Nevertheless, it has a great potential to become a tool for study and exploration of some datasets.
Note about pyspark-ai-cli:
The plot feature is not supported becausepyspark-ai
enforces plotly as its visualization library (inspark_ai.plot
function), which does not display any figure when running from a terminal (https://github.com/plotly/plotly_express/issues/47).
If you want to get started with PySparkAI CLI, check the instructions in the public repository. If you are more interested in the pyspark-ai features, check it out on github repository.
PySpark-AI under the hood
If you take a quick look on spark-ai source code, you will notice that it follows a pattern:
- Your input from methods (transform, plot, verify etc.) is caught
- Your input is used to compound a prompt template
- This prompt is processed by some LLM (commonly with GPT Rest API)
- The output of this prompt is parsed to extract the code blocks
- Code blocks are executed in the
exec
python's function - The results that matters from the execution are returned
This high-level overview recalls a little bit the illustration given in the Databricks blog post (https://www.databricks.com/blog/introducing-english-new-programming-language-apache-spark):
The Downside
Before talking about the cons of the "English SDK", we have to point out that the library is under the databrickslabs organization on github, which is a huge indicative that it is something experimental and it is not meant to be handled as a reliable product, and of course, is not ready for production environments.
What scares me the most about the approach embraced in PySpark-AI is that we do not have control over the code that is running. Even though we can see some logs to understand the code generated, we do not have the chance to assess that code before running it.
Even before we had advanced generative AIs as we have nowadays, exec
and eval
are functions highly avoided due to the inherent security risks they carry.
Another problem that comes as a consequence of this dynamic execution is the side effects. We can not trust that the code generated will be always the same given the same input. Relying on a third party service to give us the output can also be problematic, because we may face instabilities, increases in the latency among other undesirable situations.
Code generated by GPT (model gpt-3.5-turbo) running error.
Conclusion
PySpark-AI - or "English SDK" as it is being introduced, brought an innovative design with a nice API to work with pyspark covering a good variety of operations.
It is easy to get started, can be useful for beginners and non-technical users would feel more comfortable to try it as well.
It is not so reliable though. Even if we have future enhancements being applied, I myself can not see this kind of solution becoming safe and stable enough to be applied at scale and/or in a real-world production environment.
Posted on July 8, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.