Views, Copies, and that annoying SettingWithCopyWarning

wrighter

wrighter

Posted on February 4, 2021

Views, Copies, and that annoying SettingWithCopyWarning

If you’ve spent any time in pandas at all, you’ve seen SettingWithCopyWarning. If not, you will soon!

Just like any warning, it’s wise to not ignore it since you get it for a reason: it’s a sign that you’re probably doing something wrong. In my case, I usually get this warning when I’m knee deep in some analysis and don’t want to spend too much time figuring out how to fix it.

I’m going to cover a few typical examples of when this warning shows up, why it shows up, and how to quickly fix the underlying issue.

First, let’s make an example DataFrame. I’m using a handy Python package called Faker to create some test data. You may need to install it first, with pip.

%pip install Faker # notebook
pip install Faker # commmand line
Enter fullscreen mode Exit fullscreen mode

As a quick aside, Faker is a great way to build test data for unit tests, test databases, or examples. It generates real-looking data that is not personally identifiable, since it’s all fake, but it’s based on rules that generate data combinations you’ll likely encounter in real life.

>>> import datetime
>>> import pandas as pd
>>> import numpy as np
>>> from faker import Faker
>>> fake = Faker()
>>> df = pd.DataFrame([
            [fake.first_name(),
             fake.last_name(),
             fake.date_of_birth(),
             fake.date_this_year(),
             fake.city(),
             fake.state_abbr(),
             fake.postalcode()]
                for _ in range(20)],
            columns = ['first_name', 'last_name', 'dob', 'lastupdate', 'city', 'state', 'zip'])

>>> df.head(3)
  first_name last_name dob        lastupdate city         state zip
0 Evan       Daniels   1943-05-27 2021-01-11 North Erin   AZ 27597
1 Christine  Herrera   2019-04-11 2021-01-29 Ellenview    AL 28989
2 Michelle   Warren    2015-05-29 2021-01-11 Mcknighttown VA 55551
Enter fullscreen mode Exit fullscreen mode

How do we set data again?

First, let’s just review the ways we can set data in a DataFrame, using use the loc or iloc indexers. These are for label based or integer offset based indexing respectively. (See this article for more detail on the two methods)

The first argument in the indexer is for the row, the second is for the column (or columns), and if we assign to this expression, we will update the underlying DataFrame.

Note that the index here is just a RangeIndex, so the labels are numbers. Because of that, even though I’m passing in int values to loc, this is looking up by label, not relative index.

>>> df.head(1)['zip']
0 27597
Name: zip, dtype: object
>>> df.loc[0, 'zip'] = '60601'
>>> df.head(1)['zip']
0 60601
Name: zip, dtype: object
>>> df.loc[0, ['city', 'state']] = ['Chicago', 'IL']
>>> df.head(1)
  first_name last_name dob lastupdate city state zip
0 Evan Daniels 1943-05-27 2021-01-11 Chicago IL 60601
>>> # Here's an example of an iloc update.
>>> df.iloc[0, 0] = 'Josh'
>>> df.head(1)
  first_name last_name dob        lastupdate city    state zip
0 Josh        Daniels  1943-05-27 2021-01-11 Chicago IL    60601
Enter fullscreen mode Exit fullscreen mode

Now, you can also do updates with the array indexing operator, but this can look very confusing because remember that on a DataFrame, you are selecting columns first. I’d recommend not doing this for this reason alone, but as you’ll soon see, there are other issues that can arise.

>>> df["first_name"][0] = 'Joshy'
>>> df.head(1)
  first_name last_name dob        lastupdate city    state zip
0 Joshy      Daniels   1943-05-27 2021-01-11 Chicago IL    60601
Enter fullscreen mode Exit fullscreen mode

When do we see this warning?

OK, now that we have updated our DataFrame successfully, it’s time to see an example of where things can go wrong. For me, it’s very typical to select a subset of the original data to work with. For example, let’s say that we decide to only work with data where the person was born before 2000.

>>> dob_limit = datetime.date(2000, 1, 1)
>>> sub = df[df['dob'] < dob_limit]
>>> sub.shape
(16, 7)
>>> idx = sub.head(1).index[0] # save the location for update attempts below
>>> sub.head(1)
  first_name last_name dob        lastupdate city    state zip
0 Joshy      Daniels   1943-05-27 2021-01-11 Chicago IL    60601
Enter fullscreen mode Exit fullscreen mode

Let’s try to update the lastupdate column.

>>> sub.loc[idx, 'lastupdate'] = datetime.date.today()
/Users/mcw/.pyenv/versions/3.8.6/envs/pandas/lib/python3.8/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
<ipython-input-14-5f1769c87aaf>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub.loc[idx, 'lastupdate'] = datetime.date.today()
Enter fullscreen mode Exit fullscreen mode

Boom! There it is, we are told we are trying to set values on a copy of a slice from a DataFrame. What ended up happening here? Well, sub was updated, but df wasn’t, even though we had the warning.

>>> sub.loc[idx, 'lastupdate']
datetime.date(2021, 2, 4)
>>> df.loc[idx, 'lastupdate']
datetime.date(2021, 1, 11)
Enter fullscreen mode Exit fullscreen mode

Pandas is warning you that you might have not done what you expected. When you created sub, you ended up with a copy of the data in df. When you updated the value, you’re warned that you only updated the copy, not the original.

So how should you fix it?

There are two primary ways to address this, and which one you choose depends on what you are trying to accomplish in your code. The warning is telling you that you chose a path that could cause confusion or error down the road, and is pointing you toward using the best practices for updating data.

Update the original

If your intention is to update your original data, you just need to update it directly. So instead of doing your update on sub, do it on df instead.

>>> df.loc[idx, 'lastupdate'] = datetime.date.today()
>>> df.loc[idx, 'lastupdate']
datetime.date(2021, 2, 4)
Enter fullscreen mode Exit fullscreen mode

Now note that when you do this, since your view is a copy, it isn’t updated. If you want both sub and df to match, you need to either update both or recreate sub after the update. Because of this, it’s important for you to pause and think any time you update a DataFrame. Have you created views of this data that now need to be refreshed?

Update the copy

If your goal is to update the copy of the data only, to eliminate the warning, tell pandas you want that view to always be a copy.

>>> sub2 = df[df['dob'] < dob_limit].copy()
>>> sub2.loc[idx, 'lastupdate'] = datetime.date.today()
>>> sub2.loc[idx, 'lastupdate']
datetime.date(2021, 2, 4)
Enter fullscreen mode Exit fullscreen mode

In between

One common situation that happens is an initial full sized DataFrame is narrowed down to a much smaller one by filtering the data. Maybe new columns are added as part of some calculations, and then as a final result, the original DataFrame should be updated. One way to do that is to use the index to help you out.

>>> sub3 = df[df['dob'] < dob_limit].copy() # we'll be updating this DataFrame
>>> sub3['manualupdate'] = datetime.date.today() - datetime.timedelta(days=10) # you can modify this DataFrame
>>> sub3 = sub3.head(3) # or even make it smaller
>>> sub3['manualupdate']
0 2021-01-25
1 2021-01-25
3 2021-01-25
Name: manualupdate, dtype: object
Enter fullscreen mode Exit fullscreen mode

Now, we’ll use the fact that sub3 shares an index with the original df to use it to update the data. We can update all matching row of column lastupdate for example.

>>> df.loc[sub3.index, 'lastupdate'] = sub3['manualupdate']
>>> df.loc[sub3.index]
  first_name last_name dob        lastupdate city         state zip
0 Joshy      Daniels   1943-05-27 2021-01-25 Chicago      IL 60601
3 Vernon     Hernandez 1989-04-10 2021-01-25 South Mark   NE 05048
4 Mary       Munoz     1933-03-16 2021-01-25 Ewingborough OK 31127
Enter fullscreen mode Exit fullscreen mode

Now, you can see that those rows were updated from our smaller subset of data.

Subsets of columns

You also may encounter this warning when working with subsets of columns in a DataFrame.

>>> df_d = df[['zip']]
>>> df_d.loc[idx, 'zip'] = "00313" # SettingWithCopyWarning
Enter fullscreen mode Exit fullscreen mode

A great way to suppress the warning here is to do a full slice with loc in your initial selection. You can also use copy.

>>> df_d = df.loc[:, ['zip']]
>>> df_d.loc[idx, 'zip'] = "00313"
Enter fullscreen mode Exit fullscreen mode

For completeness, some more details

Now you can read about this warning in many other places, and if you’ve come here through a search engine maybe you’ve already found them either confusing or not directly applicable to your situation. I took a slightly different approach above to show the situation where I usually see this error. However, a more common reason new pandas users encounter this error is when trying to update their DataFrame using the array index operator ([]).

>>> df[df['dob'] < dob_limit]['lastupdate'] = datetime.date.today()
file.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['dob'] < dob_limit]['lastupdate'] = datetime.date.today()
Enter fullscreen mode Exit fullscreen mode

The fix here is pretty straightforward, use loc. Let’s give that a try.

>>> df.loc[df['dob'] < dob_limit, 'lastupdate'] = datetime.date.today() - datetime.timedelta(days=1)
>>> df.loc[df['dob'] < dob_limit].head(1)
  first_name last_name dob lastupdate city state zip
0 Joshy Daniels 1943-05-27 2021-02-03 Chicago IL 60601
Enter fullscreen mode Exit fullscreen mode

That works. The warning here was telling us that our first update is (potentially) operating on a copy of our original data. I don’t think this is quite as obvious as our opening case because pandas has some complicated reasons for choosing to sometimes return a copy and sometimes return a view into the original data, and this may not seem obvious when the update is on one line. When it can detect that this is happening, it raises this warning.

This is called chained assignment. The assignment above with the warning is really doing this:

df. __getitem__ (df. __getitem__ ('dob') < dob_limit). __setitem__ ('lastupdate', datetime.date.today())
Enter fullscreen mode Exit fullscreen mode

When you use the array index operator, the __getitem__ and __setitem__ methods are invoked for getting and setting respectively. That first function call to __getitem__ is returning a copy of the data, then attempting to set data on it, triggering the warning.

If we use loc, though, it will be doing this, without returning a temporary view.

df.loc. __setitem__ ((df. __getitem__ ('dob') < dob_limit, 'lastupdate'), datetime.date.today())
Enter fullscreen mode Exit fullscreen mode

So whenever you see this warning, just look at your code and check two things. Did you try to update the data using []? If so, switch to loc (or iloc). If you’re doing that and it’s still complaining, it’s because your DataFrame was created from another DataFrame. Either make a full copy if you plant to update it, or update your original DataFrame instead.

The post Views, Copies, and that annoying SettingWithCopyWarning appeared first on wrighters.io.

💖 💪 🙅 🚩
wrighter
wrighter

Posted on February 4, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related