wrighter
Posted on February 4, 2021
If you’ve spent any time in pandas at all, you’ve seen SettingWithCopyWarning
. If not, you will soon!
Just like any warning, it’s wise to not ignore it since you get it for a reason: it’s a sign that you’re probably doing something wrong. In my case, I usually get this warning when I’m knee deep in some analysis and don’t want to spend too much time figuring out how to fix it.
I’m going to cover a few typical examples of when this warning shows up, why it shows up, and how to quickly fix the underlying issue.
First, let’s make an example DataFrame
. I’m using a handy Python package called Faker to create some test data. You may need to install it first, with pip
.
%pip install Faker # notebook
pip install Faker # commmand line
As a quick aside, Faker is a great way to build test data for unit tests, test databases, or examples. It generates real-looking data that is not personally identifiable, since it’s all fake, but it’s based on rules that generate data combinations you’ll likely encounter in real life.
>>> import datetime
>>> import pandas as pd
>>> import numpy as np
>>> from faker import Faker
>>> fake = Faker()
>>> df = pd.DataFrame([
[fake.first_name(),
fake.last_name(),
fake.date_of_birth(),
fake.date_this_year(),
fake.city(),
fake.state_abbr(),
fake.postalcode()]
for _ in range(20)],
columns = ['first_name', 'last_name', 'dob', 'lastupdate', 'city', 'state', 'zip'])
>>> df.head(3)
first_name last_name dob lastupdate city state zip
0 Evan Daniels 1943-05-27 2021-01-11 North Erin AZ 27597
1 Christine Herrera 2019-04-11 2021-01-29 Ellenview AL 28989
2 Michelle Warren 2015-05-29 2021-01-11 Mcknighttown VA 55551
How do we set data again?
First, let’s just review the ways we can set data in a DataFrame
, using use the loc
or iloc
indexers. These are for label based or integer offset based indexing respectively. (See this article for more detail on the two methods)
The first argument in the indexer is for the row, the second is for the column (or columns), and if we assign to this expression, we will update the underlying DataFrame
.
Note that the index here is just a RangeIndex
, so the labels are numbers. Because of that, even though I’m passing in int values to loc
, this is looking up by label, not relative index.
>>> df.head(1)['zip']
0 27597
Name: zip, dtype: object
>>> df.loc[0, 'zip'] = '60601'
>>> df.head(1)['zip']
0 60601
Name: zip, dtype: object
>>> df.loc[0, ['city', 'state']] = ['Chicago', 'IL']
>>> df.head(1)
first_name last_name dob lastupdate city state zip
0 Evan Daniels 1943-05-27 2021-01-11 Chicago IL 60601
>>> # Here's an example of an iloc update.
>>> df.iloc[0, 0] = 'Josh'
>>> df.head(1)
first_name last_name dob lastupdate city state zip
0 Josh Daniels 1943-05-27 2021-01-11 Chicago IL 60601
Now, you can also do updates with the array indexing operator, but this can look very confusing because remember that on a DataFrame
, you are selecting columns first. I’d recommend not doing this for this reason alone, but as you’ll soon see, there are other issues that can arise.
>>> df["first_name"][0] = 'Joshy'
>>> df.head(1)
first_name last_name dob lastupdate city state zip
0 Joshy Daniels 1943-05-27 2021-01-11 Chicago IL 60601
When do we see this warning?
OK, now that we have updated our DataFrame
successfully, it’s time to see an example of where things can go wrong. For me, it’s very typical to select a subset of the original data to work with. For example, let’s say that we decide to only work with data where the person was born before 2000.
>>> dob_limit = datetime.date(2000, 1, 1)
>>> sub = df[df['dob'] < dob_limit]
>>> sub.shape
(16, 7)
>>> idx = sub.head(1).index[0] # save the location for update attempts below
>>> sub.head(1)
first_name last_name dob lastupdate city state zip
0 Joshy Daniels 1943-05-27 2021-01-11 Chicago IL 60601
Let’s try to update the lastupdate
column.
>>> sub.loc[idx, 'lastupdate'] = datetime.date.today()
/Users/mcw/.pyenv/versions/3.8.6/envs/pandas/lib/python3.8/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
iloc._setitem_with_indexer(indexer, value)
<ipython-input-14-5f1769c87aaf>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
sub.loc[idx, 'lastupdate'] = datetime.date.today()
Boom! There it is, we are told we are trying to set values on a copy of a slice from a DataFrame
. What ended up happening here? Well, sub
was updated, but df
wasn’t, even though we had the warning.
>>> sub.loc[idx, 'lastupdate']
datetime.date(2021, 2, 4)
>>> df.loc[idx, 'lastupdate']
datetime.date(2021, 1, 11)
Pandas is warning you that you might have not done what you expected. When you created sub
, you ended up with a copy of the data in df
. When you updated the value, you’re warned that you only updated the copy, not the original.
So how should you fix it?
There are two primary ways to address this, and which one you choose depends on what you are trying to accomplish in your code. The warning is telling you that you chose a path that could cause confusion or error down the road, and is pointing you toward using the best practices for updating data.
Update the original
If your intention is to update your original data, you just need to update it directly. So instead of doing your update on sub
, do it on df
instead.
>>> df.loc[idx, 'lastupdate'] = datetime.date.today()
>>> df.loc[idx, 'lastupdate']
datetime.date(2021, 2, 4)
Now note that when you do this, since your view is a copy, it isn’t updated. If you want both sub
and df
to match, you need to either update both or recreate sub
after the update. Because of this, it’s important for you to pause and think any time you update a DataFrame
. Have you created views of this data that now need to be refreshed?
Update the copy
If your goal is to update the copy of the data only, to eliminate the warning, tell pandas you want that view to always be a copy.
>>> sub2 = df[df['dob'] < dob_limit].copy()
>>> sub2.loc[idx, 'lastupdate'] = datetime.date.today()
>>> sub2.loc[idx, 'lastupdate']
datetime.date(2021, 2, 4)
In between
One common situation that happens is an initial full sized DataFrame
is narrowed down to a much smaller one by filtering the data. Maybe new columns are added as part of some calculations, and then as a final result, the original DataFrame
should be updated. One way to do that is to use the index to help you out.
>>> sub3 = df[df['dob'] < dob_limit].copy() # we'll be updating this DataFrame
>>> sub3['manualupdate'] = datetime.date.today() - datetime.timedelta(days=10) # you can modify this DataFrame
>>> sub3 = sub3.head(3) # or even make it smaller
>>> sub3['manualupdate']
0 2021-01-25
1 2021-01-25
3 2021-01-25
Name: manualupdate, dtype: object
Now, we’ll use the fact that sub3
shares an index with the original df
to use it to update the data. We can update all matching row of column lastupdate
for example.
>>> df.loc[sub3.index, 'lastupdate'] = sub3['manualupdate']
>>> df.loc[sub3.index]
first_name last_name dob lastupdate city state zip
0 Joshy Daniels 1943-05-27 2021-01-25 Chicago IL 60601
3 Vernon Hernandez 1989-04-10 2021-01-25 South Mark NE 05048
4 Mary Munoz 1933-03-16 2021-01-25 Ewingborough OK 31127
Now, you can see that those rows were updated from our smaller subset of data.
Subsets of columns
You also may encounter this warning when working with subsets of columns in a DataFrame
.
>>> df_d = df[['zip']]
>>> df_d.loc[idx, 'zip'] = "00313" # SettingWithCopyWarning
A great way to suppress the warning here is to do a full slice with loc
in your initial selection. You can also use copy
.
>>> df_d = df.loc[:, ['zip']]
>>> df_d.loc[idx, 'zip'] = "00313"
For completeness, some more details
Now you can read about this warning in many other places, and if you’ve come here through a search engine maybe you’ve already found them either confusing or not directly applicable to your situation. I took a slightly different approach above to show the situation where I usually see this error. However, a more common reason new pandas users encounter this error is when trying to update their DataFrame
using the array index operator ([]
).
>>> df[df['dob'] < dob_limit]['lastupdate'] = datetime.date.today()
file.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[df['dob'] < dob_limit]['lastupdate'] = datetime.date.today()
The fix here is pretty straightforward, use loc
. Let’s give that a try.
>>> df.loc[df['dob'] < dob_limit, 'lastupdate'] = datetime.date.today() - datetime.timedelta(days=1)
>>> df.loc[df['dob'] < dob_limit].head(1)
first_name last_name dob lastupdate city state zip
0 Joshy Daniels 1943-05-27 2021-02-03 Chicago IL 60601
That works. The warning here was telling us that our first update is (potentially) operating on a copy of our original data. I don’t think this is quite as obvious as our opening case because pandas has some complicated reasons for choosing to sometimes return a copy and sometimes return a view into the original data, and this may not seem obvious when the update is on one line. When it can detect that this is happening, it raises this warning.
This is called chained assignment. The assignment above with the warning is really doing this:
df. __getitem__ (df. __getitem__ ('dob') < dob_limit). __setitem__ ('lastupdate', datetime.date.today())
When you use the array index operator, the __getitem__
and __setitem__
methods are invoked for getting and setting respectively. That first function call to __getitem__
is returning a copy of the data, then attempting to set data on it, triggering the warning.
If we use loc
, though, it will be doing this, without returning a temporary view.
df.loc. __setitem__ ((df. __getitem__ ('dob') < dob_limit, 'lastupdate'), datetime.date.today())
So whenever you see this warning, just look at your code and check two things. Did you try to update the data using []
? If so, switch to loc
(or iloc
). If you’re doing that and it’s still complaining, it’s because your DataFrame
was created from another DataFrame
. Either make a full copy if you plant to update it, or update your original DataFrame
instead.
The post Views, Copies, and that annoying SettingWithCopyWarning appeared first on wrighters.io.
Posted on February 4, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 14, 2024