Collections and Comprehensions

paulapivat

Paul Apivat

Posted on November 3, 2020

Collections and Comprehensions

This is a continuation of the Data Science from Scratch series.

The book opens with a narrative motivating example where you, dear reader, are newly hired to lead data science at DataSciencester, a social network exclusively for data scientists.

Joel Grus, the author, explains:

Throughout the book, we'll be learning about data science concepts by solving problems that you encounter at work. Sometimes we'll look at data explicitly supplied by users, sometimes we'll look at data generated through their interactions with the site, and sometimes we'll even look at data from experiments that we'll design...we'll be building our tools from scratch.

You may be wondering why chapter2 precedes chapter1. This chapter is meant as a teaser for the rest of the book and its code not meant for implementation, but I wanted to revisit this first chapter with the python crash course fresh on our minds to highlight some frequently used concepts we can expect to see for the rest of the book.

You are just hired as "VP of Networking" and are tasked with finding out which data scientist is the most well connected in the DataSciencster network, you're giving a data dump πŸ‘‡. It's a list of users, each with a unique id.

users = [
    {"id": 0, "name": "Hero"},
    {"id": 1, "name": "Dunn"},
    {"id": 2, "name": "Sue"},
    {"id": 3, "name": "Chi"},
    {"id": 4, "name": "Thor"},
    {"id": 5, "name": "Clive"},
    {"id": 6, "name": "Hicks"},
    {"id": 7, "name": "Devin"},
    {"id": 8, "name": "Kate"},
    {"id": 9, "name": "Klein"}
]
Enter fullscreen mode Exit fullscreen mode

Of note here is that the users variable is a list of dict (dictionaries).

Moving along, we also receive "friendship" data. Of note here that this is a list of tuples:

friendship_pairs = [(0,1), (0,2), (1,2), (1,3), (2,3), (3,4),
                    (4,5), (5,6), (5,7), (6,8), (7,8), (8,9)]
Enter fullscreen mode Exit fullscreen mode

I had initially (and erroneously) thought of list, dict and tuple as data types (like int64, float64, string).

They're rather collections, and somewhat unique to Python and more importantly, informs the way Pythonistas approach and solve problems.

You may feel that having "friendship" data in a list of tuple is not the easiest way to work with data (nor may it be the best way to represent data, but we'll suspend those thoughts for now). Our first task is to convert this list of tuple into a form that's more workable; the author proposes we turn it into a dict where the keys are user_ids and the values are list of friends.

The argument is that its faster to look things up in a dict rather than a list of tuple (where we'd have to iterate over every tuple). Here's how we'd do that:

# Initialize the dict with an empty list for each user id
friendships = { user["id"]: [] for user in users }

# Loop over friendship pairs 
# This operation grabs the first, then second integer in each tuple
# It then appends each integer to the newly initialized friendships dict
for i, j in friendship_pairs:
    friendships[i].append(j)
    friendships[j].append(i)
Enter fullscreen mode Exit fullscreen mode

We're initializing a dict (called friendships), then looping over friendship_pairs to populate friendships. This is the outcome:

friendships

{
 0: [1, 2],
 1: [0, 2, 3],
 2: [0, 1, 3],
 3: [1, 2, 4],
 4: [3, 5],
 5: [4, 6, 7],
 6: [5, 8],
 7: [5, 8],
 8: [6, 7, 9],
 9: [8]
}
Enter fullscreen mode Exit fullscreen mode

Each key in friendships is matched with a value that is initially an empty list, which then gets populated as we loop over friendship_pairs and systematically append the user_id that is paired together.

To understand how the looping happends and, specifically how each pair of user_ids are connected to each other, I created my own mini-toy example. Let's say we're just going to focus on looping through friendship_pairs for the user Hero whose id is 0.

# we'll set hero to an empty list
hero = []

# for every friendship_pair, if the first integer is 0, which is Hero's id,
# then append the second integer
for x, y in friendship_pairs:
    if x == 0:
        hero.append(y)

# outcome: we can confirm that Hero is connected to  Dunn and Sue
hero # [1,2]
Enter fullscreen mode Exit fullscreen mode

The above gave me better intuition for how this works:

for i, j in friendship_pairs:
    friendships[i].append(j)  # Add j as a friend of user i
    friendships[j].append(i)  # Add i as a friend of user j
Enter fullscreen mode Exit fullscreen mode

Here are some other questions we may be interested in:

What is the total number of connections?

Look at how the problem is solved. What's notable to me is how we first define a function number_of_friends(user) that returns the number of friends for a particular user.

Then, total_connections is calculated using a comprehension (tuple):

def number_of_friends(user):
    """How many friends does _user_ have?"""
    user_id = user["id"]
    friend_ids = friendships[user_id]
    return len(friend_ids)

total_connections = sum(number_of_friends(user) for user in users)
Enter fullscreen mode Exit fullscreen mode

To be clear, the (tuple) comprehension is a pattern where a function is applied over a for-loop, in one line:

# (2, 3, 3, 3, 2, 3, 2, 2, 3, 1)
tuple((number_of_friends(user) for user in users))

# you can double check by calling friendships dict and counting the number of friends each user has
friendships

{
 0: [1, 2],
 1: [0, 2, 3],
 2: [0, 1, 3],
 3: [1, 2, 4],
 4: [3, 5],
 5: [4, 6, 7],
 6: [5, 8],
 7: [5, 8],
 8: [6, 7, 9],
 9: [8]
}
Enter fullscreen mode Exit fullscreen mode

This pattern of using a one-line for-loop (aka comprehension) will come up often. If we add up all the connections, we get 24 and to find the average, we simply divide by the number of users (10) for 2.4, this part is straight-forward.

Can we sort who has most-to-least friends to find the most connected individuals?

To answer this question, again, a list comprehension is used. The cool thing is that we re-use functions we had previously created (number_of_friends(user)).

# Create a list that loops over users dict, applying a previously defined function
num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]

# Then sort
num_friends_by_id.sort(                                 # Sort the list
    key=lambda id_and_friends: id_and_friends[1],       # by number friends
    reverse=True)                                       # descending order
Enter fullscreen mode Exit fullscreen mode

We have just identified how central an individual is to the network, and we can expect to explore degree centrality and networks more in future chapters, but for the purposes of this post, we have identified the central role that collections (lists, dictionaries, tuples) as well as comprehensions play in Python operations.

In the next post, we'll examing how friendship connections may or may not overlap with interests.


If you'd like a Python crash course, with an eye towards data science, you might check out these posts:

πŸ’– πŸ’ͺ πŸ™… 🚩
paulapivat
Paul Apivat

Posted on November 3, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related