Filter/Map and manipulating csv data, in Python
Liu Yongliang
Posted on June 27, 2020
When I was writing for this article, I came across a statement as follows:
Programming is the art of solving problems in a few lines of code.
Well, as I began writing programs, often time I realized that they are unnecessarily long. Part of the reason being laziness and not following good coding practices, which is an interesting topic for another day. The other part is that functions like map and filter are not as intuitive as a typical for-loop when used to process list or dictionary items. However, they are incredibly short!
Here's an example:
# example data
data = [ [0, ["bronze",] ] , [1,["gold","bronze"] ] , ... ]
# long version
filtered_data = []
for lst in data:
if "gold" in lst[1]:
filtered_data.append(lst)
# short version, checks if the word "gold" is in a particular list
filtered_data = list(filter(lambda x: "gold" in x[1], data))
# output
# [ [1,["gold","bronze"]] , ...]
The balance between readability and conciseness must be observed when we think about reducing the length of the code that we write. In the above example, both versions are valid solutions and the first one is closer to how I conceive of the solution. However, learning how to use the filter and also lambda functions are incredibly powerful and can save you a lot of space.
There are a few places that we can easily apply filter/map:
- when we deal with a list/tuple of lists/dictionaries/tuples, basically a collection of collections
- when we need to iterate through every item in the collection and determine whether to keep or discard (filter)
- when we need to iterate through every item in the collection and make direct modifications to the item (map)
You have to remember to apply a type conversion, typically list or tuple to the filter/map object to convert them into a collection again.
>>> l = [2,3,4]
>>> a = filter(lambda x:x==2,l)
>>> a
<filter object at 0x000001D8057D4780>
>>> list(a)
[2]
Example walkthrough
Suppose you have a CSV (comma separated data) file of movie data in IMDB.csv, where each entry contains the following information:
- Director Name
- Genre: It contains a list of genres separated by|
- Actor Name
- Movie Name
- IMDB Rating
- Release Year
Like so
director_name,genre,actor_name,movie_name,rating,year
Andrew Berends,Documentary|War,,The Blood of My Brother,3.5,2005
U. Roberto Romano,Documentary,,The Harvest/La Cosecha,4.6,2011
Amal Al-Agroobi,Documentary|Family,,The Brain That Sings,3.5,2013
Jem Cohen,Documentary,,Counting,3.1,2015
Now, if you observed carefully, there are empty spaces between commas, signaling a lack of data for certain columns in the above sample data.
Here's some data cleaning tasks that might be required for us to perform:
- Removing entries that do not conform to specifications. In this example, IMDB rates a movie on the scale of 0-10 (inclusive), we shall not trust any rating that's out of these bounds.
- The data should be complete. None of the fields, i.e. (i) director, (ii) genre, (iii) actor, (iv)movie, (v) rating, and (iv) release year should be blank.
- The data comprises of movies released between the years 1990 and 2019. If the entry of the movie is released outside this time span, it is considered unreliable.
Let' say we want to clean the data and return a list of movie names that has a rating higher or equal to 9. Here's how we will approach the question and use filter/map along the way.
This is a snippet of csv processing helper function in Python:
import csv
def read_csv(filename):
with open(filename, 'r') as f:
# reads csv into a list of lists
lines = csv.reader(f, delimiter=',')
return list(lines)[1:] # removes the header row # removes the header
Both filter/map function takes in two parameters, a function(can be a normal function or a lambda function) and a collection (list/tuple etc) that you want to work on. It returns a filter/map object. See comments for explanation of usage.
def clean_data(filename):
# Process csv into python readable format
data = read_csv(filename)
# Check that none of the column is empty
data = list(filter(lambda x: x[0]!="" and x[1]!="" and x[2]!="" and x[3]!="" and x[4]!="" and x[5]!="", data))
# another way using a second filter
# data = list(filter(lambda x: len(list(filter(lambda y:y!="",x)))==6,data))
# Check that movie is within timespan
data = list(filter(lambda x: 1990<= int(x[5])<=2019, data))
# Check that movie rating is within range
data = list(filter(lambda x: 0<=float(x[4])<=10, data))
# Check that movie rating is higher than 9
data = list(filter(lambda x: float(x[4])>=9, data))
# Returns only a list of movie names
return list(map(lambda x:x[3], data))
- Note that data from CSV is of type string and hence type conversion is required for number comparison.
- There is no stopping in applying multiple filter and/or map functions together, as long as they are not confusing to you (both now and the future you).
That's it. Hopefully, this simple example has shown you how to use the default filter and map function in Python.
Happy Learning.
Posted on June 27, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.