Counters & Sets

Counters

Counter is a dict subclass for counting hashable objects (see doc).
Back to our example in the previous section, we can use Counter instead of dict, specifically for counting:

from collections import Counter

# we can count the letters in this paragraph
count_letters = Counter("This table highlights 538's new NBA statistic, RAPTOR, in addition to the more established Wins Above Replacement (WAR). An extra column, Playoff (P/O) War, is provided to highlight stars performers in the post-season, when the stakes are higher. The table is limited to the top-100 players who have played at least 1,000 minutes minutes the table Wins NBA NBA RAPTOR more players")

# call count_letters
count_letters

# returns
Counter({'T': 4,
         'h': 19,
         'i': 22,
         's': 24,
         ' ': 61,
         't': 29,
         'a': 20,
         'b': 5,
         'l': 14,
         'e': 35,
         'g': 5,
         '5': 1,
         '3': 1,
         '8': 1,
         "'": 1,
         'n': 13,
         'w': 3,
         'N': 3,
         'B': 3,
         'A': 8,
         'c': 3,
         ',': 6,
         'R': 6,
         'P': 4,
         'O': 3,
         'd': 7,
         'o': 15,
         'm': 8,
         'r': 13,
         'W': 4,
         'v': 3,
         'p': 8,
         '(': 2,
         ')': 2,
         '.': 2,
         'x': 1,
         'u': 3,
         'y': 4,
         'f': 3,
         '/': 1,
         '-': 2,
         'k': 1,
         '1': 2,
         '0': 5})

Counter very easily did what defaultdict(int) did previously. We can even call the most_common method to get the most common letters:


# get the thirteen most common letters
for letter, count in count_letters.most_common(13):
   print(letter, count)

# returns - 13 items
  61
e 35
t 29
s 24
i 22
a 20
h 19
o 15
l 14
n 13
r 13
A 8
m 8

Sets

We had a glimpse of set previously. There are two things the author emphasize with set. First, they're faster than lists for checking membership:


lines_list = ["This table highlights 538's new NBA statistic, RAPTOR, in addition to the more established Wins Above Replacement (WAR). An extra column, Playoff (P/O) War, is provided to highlight stars performers in the post-season, when the stakes are higher. The table is limited to the top-100 players who have played at least 1,000 minutes minutes the table Wins NBA NBA RAPTOR more players"]

"zip" in lines_list # False, but have to check every element

lines_set = set(lines_list)
type(lines_set) # set

"zip" in lines_set # Very fast to check

Because this was an arbitrary example, it's not obvious that checking membership in set is faster than list so we'll take the author's word for it.

The second highlight for set is to find distinct items in a collection:

number_list = [1,2,3,1,2,3] # list with six items
item_set = set(number_list) # turn it into a set

item_set # now has three items {1, 2, 3}
turn_into_list = list(item_set) # turn into distinct item list

Here's a more applied example of using set to handle duplicate entries. We'll import defaultdict and pass set as a default_factory. This example is inspired by Real Python:

from collections import defaultdict

# departments with duplicate entries
dep = [('Sales', 'John Doe'),
       ('Sales', 'Martin Smith'),
       ('Accounting', 'Jane Doe'),
       ('HR', 'Elizabeth Smith'),
       ('HR', 'Elizabeth Smith'),
       ('HR', 'Adam Doe'),
       ('HR', 'Adam Doe'),
       ('HR', 'Adam Doe')]

# use defaultdict with set
dep_dd = defaultdict(set)

# set object has no attribute 'append'
# so use 'add' to achieve the same effect

for department, employee in dep:
    dep_dd[department].add(employee)

dep_dd
#defaultdict(set,
#            {'Sales': {'John Doe', 'Martin Smith'},
#             'Accounting': {'Jane Doe'},
#             'HR': {'Adam Doe', 'Elizabeth Smith'}})

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

Blog

Paul Apivat

Counters

Sets

Join Our Newsletter. No Spam, Only the good stuff.

Related