Tour of Python Itertools
Martin Heinz
Posted on February 15, 2020
Note: This was originally posted at martinheinz.dev
There are lots of great Python libraries, but most of them don't come close to what built-in itertools
and also more-itertools
provide. These two libraries are really the whole kitchen sink when it comes to processing/iterating over some data in Python. At first glance however, functions in those libraries might not seem that useful, so let's make little tour of (in my opinion) the most interesting ones, including examples how to get the most out of them!
Compress
Before we get to functions from more_itertools
library, let's first look into few of the more obscure ones from built-in itertools
module - first of them being itertools.compress
:
dates = [
"2020-01-01",
"2020-02-04",
"2020-02-01",
"2020-01-24",
"2020-01-08",
"2020-02-10",
"2020-02-15",
"2020-02-11",
]
counts = [1, 4, 3, 8, 0, 7, 9, 2]
from itertools import compress
bools = [n > 3 for n in counts]
print(list(compress(dates, bools))) # Compress returns iterator!
# ['2020-02-04', '2020-01-24', '2020-02-10', '2020-02-15']
You have quite a few option when it comes to filtering sequences, one of them is also compress
, which takes iterable and boolean selector and outputs items of the iterable where the corresponding element in the selector is True
.
We can use this to apply result of filtering of one sequence to another like in above example, where we create list of dates where the corresponding count is greater than 3.
Accumulate
As name suggests - we will use this function to accumulate results of some (binary) function. Example of this can be running maximum or factorial:
from itertools import accumulate
import operator
data = [3, 4, 1, 3, 5, 6, 9, 0, 1]
list(accumulate(data, max)) # running maximum
# [3, 4, 4, 4, 5, 6, 9, 9, 9]
list(accumulate(range(1, 11), operator.mul)) # Factorial
# [1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800]
If you don't care about intermediate results, you could use functools.reduce
(called fold
in other languages), which keeps only final value and is also more memory efficient.
Cycle
This function takes iterable and creates infinite cycle from it. This can be useful for example in a game, where players take turns. Another cool thing you can do with cycle
is to create simple infinite spinner:
# Cycling through players
from itertools import cycle
players = ["John", "Ben", "Martin", "Peter"]
next_player = cycle(players).__next__
player = next_player()
# "John"
player = next_player()
# "Ben"
# ...
# Infinite Spinner
import time
for c in cycle('/-\|'):
print(c, end = '\r')
time.sleep(0.2)
One thing you might need to do when using cycle
is skipping few elements of iterable (in other words - starting at different element). You can do that with itertools.islice
, so to start at third player from example above you could do: islice(cycle(players), 2, None)
.
Tee
Final one from itertools
module is tee
, this function creates multiple iterators from one, which allows us to remember what happened. Example of that is pairwise
function from itertools recipes (and also more_itertools
), which returns pairs of values from input iterable (current value and previous one):
from itertools import tee
def pairwise(iterable):
"""
s -> (s0, s1), (s1, s2), (s2, s3), ...
"""
a, b = tee(iterable, 2)
next(b, None)
return zip(a, b)
This function is handy every time you need multiple separate pointers to same stream of data. Be careful when using it though, as it can be pretty costly when it comes to memory. Also important to note is that you should not use original iterable after you use tee
on it as it will mess up (unintentionally advance) those new tee
objects.
more_itertools
Now, let's have a closer look at what more_itertools
library offers, as there are many interesting functions, that you might not have heard about.
divide
First up from more_itertools
is divide
. As the name suggests, it divides iterable into number of sub-iterables. As you can see in example below, the length of the sub-iterables might not be the same, as it depends on number of elements being divided and number of sub-iterables.
from more_itertools import divide
data = ["first", "second", "third", "fourth", "fifth", "sixth", "seventh"]
[list(l) for l in divide(3, data)]
# [['first', 'second', 'third'], ['fourth', 'fifth'], ['sixth', 'seventh']]
There is one more similar function in more_itertools
called distribute
, it however doesn't maintain order. If you don't care about order you should use distribute
as it needs less memory.
partition
With this function, we will be also dividing our iterable, this time however, using a predicate:
# Split based on age
from datetime import datetime, timedelta
from more_itertools import partition
dates = [
datetime(2015, 1, 15),
datetime(2020, 1, 16),
datetime(2020, 1, 17),
datetime(2019, 2, 1),
datetime(2020, 2, 2),
datetime(2018, 2, 4)
]
is_old = lambda x: datetime.now() - x < timedelta(days=30)
old, recent = partition(is_old, dates)
list(old)
# [datetime.datetime(2015, 1, 15, 0, 0), datetime.datetime(2019, 2, 1, 0, 0), datetime.datetime(2018, 2, 4, 0, 0)]
list(recent)
# [datetime.datetime(2020, 1, 16, 0, 0), datetime.datetime(2020, 1, 17, 0, 0), datetime.datetime(2020, 2, 2, 0, 0)]
# Split based on file extension
files = [
"foo.jpg",
"bar.exe",
"baz.gif",
"text.txt",
"data.bin",
]
ALLOWED_EXTENSIONS = ('jpg','jpeg','gif','bmp','png')
is_allowed = lambda x: x.split(".")[1] in ALLOWED_EXTENSIONS
allowed, forbidden = partition(is_allowed, files)
list(allowed)
# ['bar.exe', 'text.txt', 'data.bin']
list(forbidden)
# ['foo.jpg', 'baz.gif']
In the first example above, we are splitting list of dates into recent ones and old ones, using simple lambda function. For the second example we are partitioning files based on their extension, again using lambda function which splits file name into name and extension and checks whether the extension is in list of allowed ones.
consecutive_groups
If you need to find runs of consecutive numbers, dates, letters, booleans or any other orderable objects, then you might find consecutive_groups
handy:
# Consecutive Groups of dates
import datetime
import more_itertools
dates = [
datetime.datetime(2020, 1, 15),
datetime.datetime(2020, 1, 16),
datetime.datetime(2020, 1, 17),
datetime.datetime(2020, 2, 1),
datetime.datetime(2020, 2, 2),
datetime.datetime(2020, 2, 4)
]
ordinal_dates = []
for d in dates:
ordinal_dates.append(d.toordinal())
groups = [list(map(datetime.datetime.fromordinal, group)) for group in more_itertools.consecutive_groups(ordinal_dates)]
In this example, we have list of dates, where some of them are consecutive. To be able to pass these dates to consecutive_groups
function, we first have to convert them to ordinal numbers. Then using list comprehension we iterate over groups of consecutive ordinal dates created by consecutive_groups
and convert them back to datetime.datetime
using map
and fromordinal
functions.
side_effect
Let's say you need to cause side-effect when iterating over list of items. This side-effect could be e.g. writing logs, writing to file or like in example below counting number of events that occurred:
import more_itertools
num_events = 0
def _increment_num_events(_):
nonlocal num_events
num_events += 1
# Iterator that will be consumed
event_iterator = more_itertools.side_effect(_increment_num_events, events)
more_itertools.consume(event_iterator)
print(num_events)
We declare simple function that will increment counter every time it's invoked. This function is then passed to side_effect
along with non-specific iterable called events
. Later when the event iterator is consumed, it will call _increment_num_events
for each item, giving us final events count.
collapse
This is a more powerful version of another more_itertools
function called flatten
. collapse
allows you to flatten multiple levels of nesting. It also allows you to specify base type, so that you can stop flattening with one layer of lists/tuples remaining. One use-case for this function would be flattenning of Pandas DataFrame
. Here are little more general purpose examples:
import more_itertools
import os
# Get flat list of all files and directories
list(more_itertools.collapse(list(os.walk("/home/martin/Downloads"))))
# Get all nodes of tree into flat list
tree = [40, [25, [10, 3, 17], [32, 30, 38]], [78, 50, 93]] # [Root, SUB_TREE_1, SUB_TREE_2, ..., SUB_TREE_n]
list(more_itertools.collapse(tree))
First one generates list of files and directory paths by collapsing iterables returned by os.walk
. In the second one we take tree data structure in a form of nested lists and collapse it to get flat list of all nodes of said tree.
split_at
Back to splitting data. split_at
function splits iterable into lists based on predicate. This works like basic split
for strings, but here we have iterable instead of string and predicate function instead of delimiter:
import more_itertools
lines = [
"erhgedrgh",
"erhgedrghed",
"esdrhesdresr",
"ktguygkyuk",
"-------------",
"srdthsrdt",
"waefawef",
"ryjrtyfj",
"-------------",
"edthedt",
"awefawe",
]
list(more_itertools.split_at(lines, lambda x: '-------------' in x))
# [['erhgedrgh', 'erhgedrghed', 'esdrhesdresr', 'ktguygkyuk'], ['srdthsrdt', 'waefawef', 'ryjrtyfj'], ['edthedt', 'awefawe']]
Above, we simulate text file using list of lines. This "text file" contains lines with -------------
, which is used as delimiter. So, that's what we use as our predicate for splitting these lines into separate lists.
bucket
If you need to split your iterable into multiple buckets based on some condition, then bucket
is what you are looking for. It creates child iterables by splitting input iterable using key function:
# Split based on Object Type
import more_itertools
class Cube:
pass
class Circle:
pass
class Triangle:
pass
shapes = [Circle(), Cube(), Circle(), Circle(), Cube(), Triangle(), Triangle()]
s = more_itertools.bucket(shapes, key=lambda x: type(x))
# s -> <more_itertools.more.bucket object at 0x7fa65323f210>
list(s[Cube])
# [<__main__.Cube object at 0x7f394a0633c8>, <__main__.Cube object at 0x7f394a063278>]
list(s[Circle])
# [<__main__.Circle object at 0x7f394a063160>, <__main__.Circle object at 0x7f394a063198>, <__main__.Circle object at 0x7f394a063320>]
Here we show how to bucket iterable based on items type. We first declare a few types of shapes and create a list of them. When we call bucket
on this list with above key function, we create bucket object. This object supports lookup like built-in Python dict
. Also, as you can see, each item in the whole bucket object is a generator, therefore we need to call list
on it to actually get the values out of it.
map_reduce
Probably the most interesting function in this library for all the data science people out there - the map_reduce
. I'm not going to go into detail on how MapReduce works as that is not purpose of this article and there's lots of articles about that already. What I'm gonna show you though, is how to use it:
from more_itertools import map_reduce
data = 'This sentence has words of various lengths in it, both short ones and long ones'.split()
keyfunc = lambda x: len(x)
result = map_reduce(data, keyfunc)
# defaultdict(None, {
# 4: ['This', 'both', 'ones', 'long', 'ones'],
# 8: ['sentence'],
# 3: ['has', 'it,', 'and'],
# 5: ['words', 'short'],
# 2: ['of', 'in'],
# 7: ['various', 'lengths']})
valuefunc = lambda x: 1
result = map_reduce(data, keyfunc, valuefunc)
# defaultdict(None, {
# 4: [1, 1, 1, 1, 1],
# 8: [1],
# 3: [1, 1, 1],
# 5: [1, 1],
# 2: [1, 1],
# 7: [1, 1]})
reducefunc = sum
result = map_reduce(data, keyfunc, valuefunc, reducefunc)
# defaultdict(None, {
# 4: 5,
# 8: 1,
# 3: 3,
# 5: 2,
# 2: 2,
# 7: 2})
This MapReduce implementation allows us to specify 3 functions: key function (for categorizing), value function (for transforming) and finally reduce function (for reducing). Some of these function can be omitted to produce intermediate steps in MapReduce process, as shown above.
sort_together
If you work with spreadsheets of data, chances are, that you need to sort it by some column. This is simple task for sort_together
. It allows us to specify by which column(s) to sort the data:
# Table
"""
Name | Address | Date of Birth | Updated At
----------------------------------------------------------------
John | | 1994-02-06 | 2020-01-06
Ben | | 1985-04-01 | 2019-03-07
Andy | | 2000-06-25 | 2020-01-08
Mary | | 1998-03-14 | 2018-08-15
"""
from more_itertools import sort_together
cols = [
("John", "Ben", "Andy", "Mary"),
("1994-02-06", "1985-04-01", "2000-06-25", "1998-03-14"),
("2020-01-06", "2019-03-07", "2020-01-08", "2018-08-15")
]
sort_together(cols, key_list=(1, 2))
# [('Ben', 'John', 'Mary', 'Andy'), ('1985-04-01', '1994-02-06', '1998-03-14', '2000-06-25'), ('2019-03-07', '2020-01-06', '2018-08-15', '2020-01-08')]
Input to the function is list of iterables (columns) and key_list
which is tells sort_together
which of the iterables to use for sorting and with what priority. In case of the above example with first sort the "table" by Date of Birth and then by Updated At column.
seekable
We all love iterators, but you should always be careful with them in Python as one of their features is that they consume the supplied iterable. They don't have to though, thanks to seekable
:
from more_itertools import seekable
data = "This is example sentence for seeking back and forth".split()
it = seekable(data)
for word in it:
...
next(it)
# StopIteration
it.seek(3)
next(it)
# "sentence"
seekable
is function that wraps iterable in object that makes it possible to go back and forth through iterator, even after some elements were consumed. In the example you can see that we've got StopIteration
exception after going through whole iterator, but we can seek back and keep working with it.
filter_except
Let's look at following scenario: You received mixed data, that contains both text and numbers and all of it is in string form. You however, want to work only with numbers (floats/ints):
from more_itertools import filter_except
data = ['1.5', '6', 'not-important', '11', '1.23E-7', 'remove-me', '25', 'trash']
list(map(float, filter_except(float, data, TypeError, ValueError)))
# [1.5, 6.0, 11.0, 1.23e-07, 25.0]
filter_except
filters items of input iterable, by passing elements of iterable to provided function (float
) and checking whether it throws error (TypeError, ValueError
) or not, keeping only elements that passed the check.
unique_to_each
unique_to_each
is one of the more obscure functions in more_itertools
library. It takes bunch of iterables and returns elements from each of them, that aren't in the other ones. It's better to look at example:
from more_itertools import unique_to_each
# Graph (adjacency list)
graph = {'A': {'B', 'E'}, 'B': {'A', 'C'}, 'C': {'B'}, 'D': {'E'}, 'E': {'A', 'D'}}
unique_to_each({'B', 'E'}, {'A', 'C'}, {'B'}, {'E'}, {'A', 'D'})
# [[], ['C'], [], [], ['D']]
# If we discard B node, then C gets isolated and if we discard E node, then D gets isolated
Here, we define graph data structure using adjacency list (actually dict
). We then pass neighbours of each node as a set to unique_to_each
. What it outputs is a list of nodes that would get isolated if respective node was to be removed.
numeric_range
This one has a lot of use cases, as it's quite common that you would need to iterate over range of some non-integer values:
from more_itertools import numeric_range
import datetime
from decimal import Decimal
list(numeric_range(Decimal('1.7'), Decimal('3.5'), Decimal('0.3')))
# [Decimal('1.7'), Decimal('2.0'), Decimal('2.3'), Decimal('2.6'), Decimal('2.9'), Decimal('3.2')]
start = datetime.datetime(2020, 2, 10)
stop = datetime.datetime(2020, 2, 15)
step = datetime.timedelta(days=2)
list(numeric_range(start, stop, step))
# [datetime.datetime(2020, 2, 10, 0, 0), datetime.datetime(2020, 2, 12, 0, 0), datetime.datetime(2020, 2, 14, 0, 0)]
What is nice about numeric_range
is that it behaves the same way as basic range
. You can specify start
,stop
and step
arguments as in examples above, where we first use decimals between 1.7
and 3.5
with step of 0.3
and then dates between 2020/2/10
and 2020/2/15
with step of 2 days.
make_decorator
Last but not least, make_decorator
enables us to use other itertools as decorators and therefore modify outputs of other functions, producing iterators:
from more_itertools import make_decorator
from more_itertools import map_except
mapper_except = make_decorator(map_except, result_index=1)
@mapper_except(float, ValueError, TypeError)
def read_file(f):
... # Read mix of text and numbers from file
return ['1.5', '6', 'not-important', '11', '1.23E-7', 'remove-me', '25', 'trash']
list(read_file("file.txt"))
# [1.5, 6.0, 11.0, 1.23e-07, 25.0]
This example takes map_except
function and creates decorator out of it. This decorator will consume result of decorated function as its second argument (result_index=1
). In our case, the decorated function is read_file
, which simulates reading data of some file and outputs list of strings that might or might not be floats. The output however, is first passed to decorator, which maps and filters all the undesirable items, leaving us with only floats.
Conclusion
I hope that you learned something new in this article, as itertools
and more_itertools
can make your life a whole lot easier if you are processing lots of data frequently. Using these libraries and functions however, requires some practice to be efficient with. So, if you think that you can make use of some of the things shown in this article, then go ahead and checkout itertools recipes or just force yourself to use these as much as possible to get comfortable with it. 😉
Posted on February 15, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
March 12, 2024