Why use Python for data analysis (when you have Excel or Google Sheets)
Marc Dupuis
Posted on November 14, 2024
TL;DR: While spreadsheets are perfect for many data tasks, Python becomes essential when you need to handle large datasets, create advanced visualizations, automate workflows, or use machine learning models. The key is knowing when to leverage each tool's strengths for your specific data analysis needs.
While Python is often considered essential for data work, spreadsheets remain the most practical tool for many analysts' daily needs – and that's perfectly fine. But knowing when to graduate beyond them is crucial for advancing your data capabilities.
If you look at any data analyst or data scientist curriculum, you'll find the same core tools: spreadsheets, SQL, Python, and various Business Intelligence (BI) solutions. Yet when I talk with data practitioners and leaders, a common question emerges: "Why switch to Python when spreadsheets handle most of my needs?"
As someone who co-founded a company built on SQL, Python, and AI, my stance might surprise you: if a spreadsheet can do the job, use it. These tools have endured since the 1970s for good reason – they're intuitive, flexible, and excellent for explaining your work to others.
But they have their limits.
When you start conducting more ad hoc analysis or exploratory data analysis or dealing with more data in the enterprise, you’ll quickly run into a few issues:
- They struggle with large datasets
- They offer limited visualization and dashboarding capabilities
- They make it difficult to build automated data pipelines
- They lack advanced statistical and machine learning capabilities
- They don't support version control, making it hard to follow engineering best practices Below, I’ll break down why spreadsheets remain invaluable for many tasks, and when Python becomes the necessary next step in your data journey.
Why use Excel or Google Sheets?
At their core, spreadsheets are powerful because they put you in complete control of your data workspace. Like having your own custom-built dashboard, they let you instantly manipulate, visualize, and analyze data exactly how you want.
There are two main reasons that folks gravitate toward spreadsheets:
1. Spreadsheets are flexible and personalized
Let’s start with the most obvious reasons why data practitioners, regardless of skill level, love spreadsheets: They’re incredibly flexible and customizable.
In a spreadsheet, you’re working in your own environments, and you have full control over it. You want to highlight specific rows and create a quick chart? Easy. You want to add some conditional formatting to highlight a specific pattern? No problem. You even want to add a row or column to add some inputs? Go right ahead.
As a user, you’re in full control, even in shared workspace environments like Google Sheets. This is really powerful, especially in contrast with traditional BI solutions where you can’t edit the data directly in line the same way, nor can you call out specific pieces of data without having to slice the data into smaller subsets, which can quickly get out of hand. As a matter of fact, some new BI solutions such as Sigma are capitalizing on this idea with a spreadsheet-like interface being their main pitch.
All in all, there’s something deeply intuitive about the user experience of a spreadsheet. We learn math from a young age, and spreadsheets offer a nicely structured way of looking at data and understanding how all the numbers add up.
2. Spreadsheets are reactive & explainable
Reactivity in spreadsheets means that when you change one number, everything connected to it updates automatically. This instant feedback makes them perfect for understanding how different pieces of data affect each other.
For example, let’s say you have cells that are connected like:
C1 = A1 + B2
Reactivity means that when you update A1 or B2, C1 is automatically updated. There’s effectively a DAG which tracks the dependencies, or lineage, between all cells. This is an incredibly powerful concept, because, unlike with code, you don’t have to “run” the spreadsheet. You can simply create a model of the world and adjust inputs and see how the results react to that change.
This reactivity is also in very large part what contributes to the ease of understanding of a spreadsheet. I can view an easy-understood formula, click on it to highlight the dependent cells, and I adjust the dependent cells to understand how the number I’m looking at reacts and relates to it.
As you can see in the image above, if you want to know what numbers contribute most to Net Income Before Tax, you can simply click on the cell, view the dependent cells, and immediately understand what variables Net Income Before Taxes.
For these reasons, if you’re able to do your work in a spreadsheet, it’s probably a good idea.
Why use Python
While spreadsheets excel at many tasks, Python opens up a whole new world of possibilities for data work. From handling massive datasets to creating complex visualizations and automating repetitive tasks, there are five reasons why Python is a powerful tool for your data workflows.
1. Python easily tackles large amounts of data
The first and most obvious reason to use Python is illustrated when dealing with large datasets. Excel can support approximately 1M rows by 17k columns and Google Sheets can support approximately 10M cells. This may sound like a lot, and in many cases this is plenty, but chances are, you’ll quickly run up against this limit. In contrast, Python on a powerful machine can support many orders of magnitude more data. This is especially true if you leverage new technologies like polars and DuckDB.
We may see an increase in limits with spreadsheets over time, but Python (especially in tandem with SQL) will always be able to handle more.
2. Python supports advanced & customized visualizations
Spreadsheets can offer some pretty powerful visuals, but it’s only a small fraction of what you can do with Python. I’m a big believer that bar charts, line charts, and maps cover the vast majority of cases, but telling a story with data often requires breaking from the mundane and creating an engaging canvas.
For example, I love a good Sankey diagram to tell the story of how data flows from point A to point B. Or perhaps you want to create a radar plot to compare attributes from different categories.
These can be incredibly easy to build in Python with libraries like plotly, seaborn or bokeh.
To give you an example, let’s go back to our Superdope example from previous posts and say you want to compare product performance on a sunburst plot like the one below:
Generating this chart with code using a library such as plotly is rather straightforward:
import plotly.express as px
# Create the sunburst plot
fig = px.sunburst(
df_sunburst,
path=['Region', 'Category', 'Product'],
values='Sales',
color='Region',
title='Sales Distribution by Region, Category, and Product',
width=800,
height=450
)
# Update layout
fig.update_layout(
margin=dict(t=50, l=0, r=0, b=0)
)
# Show the plot
fig.show()
And this code can be generated by AI in about 3 seconds. Building something similar in a spreadsheet would require a lot more time and effort.
3. Python helps you automate data pipelines & cleaning
When working with data, you’ll oftentimes find yourself doing repetitive data transformation tasks. Say, for example, you work in an industry where your clients regularly send you CSV or Excel files and you have to clean up and format the data, and turn it into a report or prepare it for another step. This is a perfect task for Python. If you’re managing your own server and are resourceful, you can write a script and schedule it to run using a Cron job, or if you would like to go with managed solutions that work out of the box and handle orchestration and more complex jobs, you can use a solution like Dagster or Airflow.
As a general rule, these days it’s usually best to avoid home-grown Cron jobs unless you know exactly what you’re doing. Ensuring that these remain up and running, have proper logging and monitoring and are orchestrated properly can quickly turn into a lot of work.
Note: If you’re simply looking for a lightweight and quick way to build data pipelines, Fabi.ai may be a good option for you. We can help you easily set up a data wrangling and cleaning pipeline from and to any source, including CSV files and Excel files, in a matter of minutes.
4. Python supports for complex data analysis & machine learning
You can do a lot in a spreadsheet, but building and using more advanced statistical and machine learning models is not generally one of them. If you’re simply doing a univariate data analysis and some simple calculations like distributions, averages etc. a spreadsheet should be able to get the job done. But if you want to venture into more advanced multivariate analysis, or perhaps even clustering, forecasting and churn prediction, Python is equipped with a rich suite of tools that work out of the box.
Here are a few examples of the types of analysis you may want to do along with the corresponding Python package:
- Buyer or customer grouping using clustering: sklean.cluster (ex. Kmeans)
- Sales or marketing pipeline time series forecasting: Prophet or statsmodels (ex. ARIMA)
- Customer churn prediction: scikit-survival These are all advanced machine learning and statistical models implemented by some of the best engineers and researchers in the world, available for free and immediately ready to use in Python.
5. Python helps you follow code versioning & engineering best practices
Finally, in a lot of cases, it’s good practice to ensure that your work is traceable and reproducible.
In practice, what this means is that when someone else (or perhaps yourself at a later date), looks at your analysis, this individual should be able to understand:
- Where the data came from
- How the data was manipulated and how you got to your results
- Be able to reproduce the same results independently So, if working in a spreadsheet means exporting data and manipulating it somewhere that is disconnected from the original source, it can make the results very hard to reproduce. This also means the steps you take during your analysis aren’t version controlled. As you conduct your analysis and make adjustments, the exact steps may not get recorded. This can set you up for a tough situation that we’ve all been in at least once: You’ve built a beautiful analysis in a spreadsheet, shared it with some co-workers, went back at a later date and noticed that the data was different. You may go through the change history to understand what happened, to no avail.
Using a version control system like Github or Gitlab and committing changes to the underlying code as you conduct your analysis can help you avoid this type of situation.
Verdict: For large data sets; advanced analysis and visualization; and automation, Python wins🏅
If you’re looking to do complex ad hoc or exploratory data analysis, use advanced machine learning techniques, or build complex visualizations, Python is one of the best and most powerful tools for the job.
Yes, spreadsheets are incredibly popular for good reason. If you’re dealing with relatively small datasets, in a one-off analysis that doesn’t need to be automated, Excel or Google Sheets are great tools.
However, Python performs exceptionally well when dealing with large datasets which would be an issue for Excel or Google Sheets. Python is also commonly used to automate data pipelines, especially if it requires some form of data transformation and cleaning.
Like most things, there’s a time and place to use certain tools to make the most of their strengths. We built Fabi.ai to act as the bridge between all the tools, so you can have the best of both worlds.
We make it incredibly easy to connect to any data source, including spreadsheets and files and build lightweight data pipelines. Our built-in SQL and Python interface, augmented with AI, makes it incredibly easy to leverage advanced machine learning and statistical models, regardless of prior experience. If you’re interested in checking us out, you can get started for free today in less than 2 minutes.
Posted on November 14, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 14, 2024