Fridolín Pokorný
Posted on February 20, 2020
Let’s take a look at a solution that randomly picks an item from a list. Instead of assigning equal probability to each item in the list, let’s create an algorithm that assigns the highest probability to the item at index 0 and lower probabilities to all remaining items as the index in the list grows. Can we implement such a function?😓
You can use already available routines present in the Python standard library to pick a random item from a list. Let’s say we want to randomly pick a number from a list of integers. The only thing you need to do is to run the following snippet:
import random
my_list = [42, 33, 30, int("deadbeef", base=16)]
random.choice(my_list)
# results in 42 with a probability of 1 / len(my_list)
Now let’s say we want to prioritize number 42 over 33, prioritize number 33 over 30, and at the same time, prioritize number 30 over “0xdeadbeef”. We have 4 numbers in total in our list, let’s assign weights to these numbers in the following manner:
+--------------+-------------+
| number | weight |
+--------------+-------------+
| 42 | 4 |
| 33 | 3 |
| 30 | 2 |
| 0xdeadbeef | 1 |
+--------------+-------------+
The higher the weight is, the higher the probability we pick the given number.
You can see weighs as a number of “buckets” we assign to each number from the list. Subsequently, we randomly (random uniform) try to hit one bucket. After hitting the bucket, we check to which number this bucket corresponds to.
The total number of buckets we can hit is equal to the sum of weights:
4 + 3 + 2 + 1 = 10
The probability of hitting a bucket based on the number from the list is shown below:
+--------------+-----------------------+
| number | probability |
+--------------+-----------------------+
| 42 | 4/10 = 0.4 |
| 33 | 3/10 = 0.3 |
| 30 | 2/10 = 0.2 |
| 0xdeadbeef | 1/10 = 0.1 |
+--------------+-----------------------+
To generalize this for n numbers, we can come up with the following formula:
In other mathematical words:
Does this formula look familiar to you? It’s called termial of a positive integer n; from Wikipedia:
The termial was coined by Donald E. Knuth in his The Art of Computer Programming. It is the additive analog of the factorial function, which is the product of integers from 1 to n. He used it to illustrate the extension of the domain from positive integers to the real numbers.
Now, let’s make our hands dirty with some code. To compute the termial of n:
termial_of_n = sum(range(1, len(my_list) + 1)) # O(N)
Another, more effective way, to compute the termial of n is to use Binomial coefficient principle and compute (len(my_list) + 1)
over 2
:
l = len(my_list)
# (l + 1) over 2 = l! / (2!*(l-2)!) = l * (l - 1) / 2
termial_of_n = ((l**2) + l) >> 1 # O(1)
Finally, we can pick a random (random uniform) bucket from our buckets:
import random
choice = random.randrange(termial_of_n)
The result stored in the variable choice holds an integer starting from 0 to 9 (inclusively) and represents an index to the list of the buckets we created earlier:
+--------------+---------------+---------------+
| choice | bucket | number |
+--------------+---------------+---------------+
| 0 | 1 | 42 |
+--------------+---------------+---------------+
| 1 | 2 | 42 |
+--------------+---------------+---------------+
| 2 | 3 | 42 |
+--------------+---------------+---------------+
| 3 | 4 | 42 |
+--------------+---------------+---------------+
| 4 | 5 | 33 |
+--------------+---------------+---------------+
| 5 | 6 | 33 |
+--------------+---------------+---------------+
| 6 | 7 | 33 |
+--------------+---------------+---------------+
| 7 | 8 | 30 |
+--------------+---------------+---------------+
| 8 | 9 | 30 |
+--------------+---------------+---------------+
| 9 | 10 | 0xdeadbeef |
+--------------+---------------+---------------+
How do we find which number we hit through a randomly picked bucket for any n?
Let’s revisit how we computed the termial number of n using the Binomial coefficient based formula:
l = len(my_list)
termial_of_n = ((l**2) + l) >> 1
Based on termial function definition, we know that regardless of n, we always assign 1 bucket to number at index 0, 2 buckets to number at index 1, 3 buckets to number at index 2 and so on. Using this knowledge, we can transform the Binomial coefficient formula to the following equation:
choice = ((i**2) + i) >> 1
The next step is to find i that satisfies the given equation. The equation is a quadratic function described as:
a*i**2 + b*i + c = 0
where:
a = 1/2
b = 1/2
c = -choice
As the choice is expected to be always a non-negative integer (index to the list of buckets), we can search for a solution that always results in a non-negative integer (reducing one discriminant term that always results in negative i).
import math
# D = b**2 - 4*a*c
# x1 = (-b + math.sqrt(D)) / (2*a)
# x2 = (-b - math.sqrt(D)) / (2*a)
# Given:
# a = 1/2
# b = 1/2
# c = -choice
# D = (1/2)**2 + 4*0.5*choice = 0.25 + 2*choice
i = math.floor(-0.5 + math.sqrt(0.25 + (choice << 1)))
The solution has to be rounded using math.floor as it corresponds to the inverted index with respect to n. As i is inverted, the final solution (index to the original list) is:
my_list[n - 1 - i]
Asymptotic complexity analysis
Let’s assume:
- the len function can return the length of the list in O(1) time
- random.randrange* operates in O(1) time
- we use Binomial coefficient based equation for computing the termial of n
The whole computation is done in O(1) time with O(1) space.
If we would use the sum based computation of the termial of n, the algorithm would become O(n) time with O(1) space.
Disclaimer & original usage
I designed this algorithm for Thoth’s recommendation engine. Its main purpose is to prefer more recent versions of packages in the resolver during the resolution of Python software stacks.
I would be happy for any feedback or any similar approaches recommended.
A complete solution can be found on my GitHub gist:
Posted on February 20, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
October 1, 2024