defaultdict
Paul Apivat
Posted on October 29, 2020
defaultdict
is a subclass of dictionaries (dict
, see previous post), so it inherits most of its behavior from dict
with additional features. To understand how those features make it different, and more convenient in some cases, we'll need to run into some errors.
If we try to count words in a document, the general approach is to create a dictionary where the dictionary keys
are words and the dictionary values
are counts of those words.
Let's try do do this with a regular dictionary.
First, to setup, we'll take a list of words and split()
into individual words. I took this paragraph from another project i'm working on and artificially added some extra words to ensure that certain words appeared more than once (it'll be apparent why soon).
# paragraph
lines = ["This table highlights 538's new NBA statistic, RAPTOR, in addition to the more established Wins Above Replacement (WAR). An extra column, Playoff (P/O) War, is provided to highlight stars performers in the post-season, when the stakes are higher. The table is limited to the top-100 players who have played at least 1,000 minutes minutes the table Wins NBA NBA RAPTOR more players"]
# split paragraphy into individual words
lines = " ".join(lines).split()
type(lines) # list
Now that we have our lines
list, we'll create an empty dict
called word_counts
and have each word be the key
and each value
be the count of that word.
# empty list
word_counts = {}
# loop through lines to count each word
for word in lines:
word_counts[word] += 1
# KeyError: 'This'
We received a KeyError
for the very first word in lines
(i.e. 'This') because the list tried to count a key that didn't exist. We've learned to handle exceptions so we can use try
and except
.
Here, we're looping through lines
and when we try to count a key that doesn't exist, like we did previously, we're now anticipating a KeyError
and will set the initial count to 1, then it can continue to loop-through and count the word, which now exists, so it can be incremented up.
# empty list
word_counts = {}
# exception handling
for word in lines:
try:
word_counts[word] += 1
except KeyError:
word_counts[word] = 1
# call word_counts
# abbreviated for space
word_counts
{'This': 1,
'table': 3,
'highlights': 1,
"538's": 1,
'new': 1,
'NBA': 3,
'statistic,': 1,
'RAPTOR,': 1,
'in': 2,
'addition': 1,
'to': 3,
'the': 5,
'more': 2,
...
'top-100': 1,
'players': 2,
'who': 1,
'have': 1,
'played': 1,
'at': 1,
'least': 1,
'1,000': 1,
'minutes': 2,
'RAPTOR': 1}
Now, there are other ways to achieve the above:
# use conditional flow
word_counts = {}
for word in lines:
if word in word_counts:
word_counts[word] += 1
else:
word_counts[word] = 1
# use get
for word in lines:
previous_count = word_counts.get(word, 0)
word_counts[word] = previous_count + 1
Here's where the author makes the case for defaultdict
, arguing that the two aforementioned approaches are unweildy. We'll come back full circle to try our first approach, using defaultdict
instead of the traditional dict
.
defaultdict
is a subclass of dict
and must be imported from collections
:
from collections import defaultdict
word_counts = defaultdict(int)
for word in lines:
word_counts[word] += 1
# we no longer get a KeyError
# abbreviated for space
defaultdict(int,
{'This': 1,
'table': 3,
'highlights': 1,
"538's": 1,
'new': 1,
'NBA': 3,
'statistic,': 1,
'RAPTOR,': 1,
'in': 2,
'addition': 1,
'to': 3,
'the': 5,
'more': 2,
...
'top-100': 1,
'players': 2,
'who': 1,
'have': 1,
'played': 1,
'at': 1,
'least': 1,
'1,000': 1,
'minutes': 2,
'RAPTOR': 1})
Unlike a regular dictionary, when defaultdict
tries to look up a key it doesn't contain, it'll automatically add a value for it using the argument we provided when we first created the defaultdict
. If you see above, we entered an int
as the argument, which allows it to automatically add an integer value.
If you want your defaultdict
to have values
be lists
, you can pass a list
as argument. Then, when you append
a value, it is automatically contained in a list
.
dd_list = defaultdict(list) # defaultdict(list, {})
dd_list[2].append(1) # defaultdict(list, {2: [1]})
dd_list[4].append('string') # defaultdict(list, {2: [1], 4: ['string']})
You can also pass a dict
into defaultdict
, ensuring that all appended values are contained in a dict
:
dd_dict = defaultdict(dict) # defaultdict(dict, {})
# match key-with-value
dd_dict['first_name'] = 'lebron' # defaultdict(dict, {'first_name': 'lebron'})
dd_dict['last_name'] = 'james'
# match key with dictionary containing another key-value pair
dd_dict['team']['city'] = 'Los Angeles'
# defaultdict(dict,
# {'first_name': 'lebron',
# 'last_name': 'james',
# 'team': {'city': 'Los Angeles'}})
Application: Grouping with defaultdict
The follow example is from Real Python, a fantastic resource for all things Python.
It is common to use defaultdict
to group items in a sequence or collection, setting the initial parameter (aka .default_factory
) set to list
.
dep = [('Sales', 'John Doe'),
('Sales', 'Martin Smith'),
('Accounting', 'Jane Doe'),
('Marketing', 'Elizabeth Smith'),
('Marketing', 'Adam Doe')]
from collections import defaultdict
dep_dd = defaultdict(list)
for department, employee in dep:
dep_dd[department].append(employee)
dep_dd
#defaultdict(list,
# {'Sales': ['John Doe', 'Martin Smith'],
# 'Accounting': ['Jane Doe'],
# 'Marketing': ['Elizabeth Smith', 'Adam Doe']})
What happens when you have duplicate entries? We're jumping ahead slightly to use set
handle duplicates and only group unique entries:
# departments with duplicate entries
dep = [('Sales', 'John Doe'),
('Sales', 'Martin Smith'),
('Accounting', 'Jane Doe'),
('Marketing', 'Elizabeth Smith'),
('Marketing', 'Elizabeth Smith'),
('Marketing', 'Adam Doe'),
('Marketing', 'Adam Doe'),
('Marketing', 'Adam Doe')]
# use defaultdict with set
dep_dd = defaultdict(set)
# set object has no attribute 'append'
# so use 'add' to achieve the same effect
for department, employee in dep:
dep_dd[department].add(employee)
dep_dd
#defaultdict(set,
# {'Sales': {'John Doe', 'Martin Smith'},
# 'Accounting': {'Jane Doe'},
# 'Marketing': {'Adam Doe', 'Elizabeth Smith'}})
Application: Accumulating with defaultdict
Finally, we'll use defaultdict
to accumulate values:
incomes = [('Books', 1250.00),
('Books', 1300.00),
('Books', 1420.00),
('Tutorials', 560.00),
('Tutorials', 630.00),
('Tutorials', 750.00),
('Courses', 2500.00),
('Courses', 2430.00),
('Courses', 2750.00),]
# enter float as argument
dd = defaultdict(float) # collections.defaultdict
# defaultdict(float, {'Books': 3970.0, 'Tutorials': 1940.0, 'Courses': 7680.0})
for product, income in incomes:
dd[product] += income
for product, income in dd.items():
print(f"Total income for {product}: ${income:,.2f}")
# Total income for Books: $3,970.00
# Total income for Tutorials: $1,940.00
# Total income for Courses: $7,680.00
I can see that defaultdict
and dictionaries
can be handy for grouping, counting and accumulating values in a column. We'll come back to revisit these foundational concepts once the data science applications are clearer.
In summary, dictionaries
and defaultdict
can be used to group items, accumulate items and count items. Both can be used even when the key
doesn't (yet) exist, but its defaultdict
handles this more succintly. For now, we'll stop here and proceed to the next topic: counters.
For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.
Posted on October 29, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 17, 2024