Counters & Sets
Paul Apivat
Posted on October 29, 2020
Counters
Counter
is a dict
subclass for counting hashable objects (see doc).
Back to our example in the previous section, we can use Counter
instead of dict
, specifically for counting:
from collections import Counter
# we can count the letters in this paragraph
count_letters = Counter("This table highlights 538's new NBA statistic, RAPTOR, in addition to the more established Wins Above Replacement (WAR). An extra column, Playoff (P/O) War, is provided to highlight stars performers in the post-season, when the stakes are higher. The table is limited to the top-100 players who have played at least 1,000 minutes minutes the table Wins NBA NBA RAPTOR more players")
# call count_letters
count_letters
# returns
Counter({'T': 4,
'h': 19,
'i': 22,
's': 24,
' ': 61,
't': 29,
'a': 20,
'b': 5,
'l': 14,
'e': 35,
'g': 5,
'5': 1,
'3': 1,
'8': 1,
"'": 1,
'n': 13,
'w': 3,
'N': 3,
'B': 3,
'A': 8,
'c': 3,
',': 6,
'R': 6,
'P': 4,
'O': 3,
'd': 7,
'o': 15,
'm': 8,
'r': 13,
'W': 4,
'v': 3,
'p': 8,
'(': 2,
')': 2,
'.': 2,
'x': 1,
'u': 3,
'y': 4,
'f': 3,
'/': 1,
'-': 2,
'k': 1,
'1': 2,
'0': 5})
Counter
very easily did what defaultdict(int)
did previously. We can even call the most_common
method to get the most common letters:
# get the thirteen most common letters
for letter, count in count_letters.most_common(13):
print(letter, count)
# returns - 13 items
61
e 35
t 29
s 24
i 22
a 20
h 19
o 15
l 14
n 13
r 13
A 8
m 8
Sets
We had a glimpse of set
previously. There are two things the author emphasize with set
. First, they're faster than lists for checking membership:
lines_list = ["This table highlights 538's new NBA statistic, RAPTOR, in addition to the more established Wins Above Replacement (WAR). An extra column, Playoff (P/O) War, is provided to highlight stars performers in the post-season, when the stakes are higher. The table is limited to the top-100 players who have played at least 1,000 minutes minutes the table Wins NBA NBA RAPTOR more players"]
"zip" in lines_list # False, but have to check every element
lines_set = set(lines_list)
type(lines_set) # set
"zip" in lines_set # Very fast to check
Because this was an arbitrary example, it's not obvious that checking membership in set
is faster than list
so we'll take the author's word for it.
The second highlight for set
is to find distinct items in a collection:
number_list = [1,2,3,1,2,3] # list with six items
item_set = set(number_list) # turn it into a set
item_set # now has three items {1, 2, 3}
turn_into_list = list(item_set) # turn into distinct item list
Here's a more applied example of using set
to handle duplicate entries. We'll import defaultdict
and pass set
as a default_factory
. This example is inspired by Real Python:
from collections import defaultdict
# departments with duplicate entries
dep = [('Sales', 'John Doe'),
('Sales', 'Martin Smith'),
('Accounting', 'Jane Doe'),
('HR', 'Elizabeth Smith'),
('HR', 'Elizabeth Smith'),
('HR', 'Adam Doe'),
('HR', 'Adam Doe'),
('HR', 'Adam Doe')]
# use defaultdict with set
dep_dd = defaultdict(set)
# set object has no attribute 'append'
# so use 'add' to achieve the same effect
for department, employee in dep:
dep_dd[department].add(employee)
dep_dd
#defaultdict(set,
# {'Sales': {'John Doe', 'Martin Smith'},
# 'Accounting': {'Jane Doe'},
# 'HR': {'Adam Doe', 'Elizabeth Smith'}})
For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.
Posted on October 29, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.