Math for Devs - Birthday Paradox for random data generation

josethz00

José Thomaz

Posted on September 18, 2023

Math for Devs - Birthday Paradox for random data generation

The Birthday Paradox is a fascinating mathematical phenomenon, and it has significant implications for computer science and software development, especially in cryptography, hashing, and random data generation. It's a great application of probability in our day to day.

 

What is the Birthday Paradox?

The Birthday Paradox refers to the counterintuitive probability that in a set of just 23 people, there's a better-than-even chance that two of them share the same birthday. The paradox lies in how unexpectedly low the number 23 is, given there are 365 possible days for a birthday.

How is this possible?

Upon first hearing about the Birthday Paradox, many people find it baffling. With 365 possible days for birthdays, how can it be that with just 23 individuals, there's a 50% chance that at least two share the same birthday?


They don't know meme


A common mistake is to think about the problem linearly, as one might do with a "rule of three" approach. To illustrate, you might think:

If 1 person covers 1 day of the year, then 23 people should cover (23365)( \frac{23}{365} ) or about 6.3% of the year. Hence, the chance of a shared birthday should be very low.

But this reasoning isn't accurate because it doesn't consider all possible pairs of people in the group.

Let's break it down mathematically:

Consider the first person. When the second person enters the room, there's a (364365)( \frac{364}{365} ) chance they have a different birthday than the first. As the third person enters, they now have two people to potentially share a birthday with, so the probability they have a distinct birthday from the first two is (364365×363365)( \frac{364}{365} \times \frac{363}{365} ) . And so on.

The probability (P)( P ) that at least two people share a birthday in a group of 23 can be found by subtracting the probability that all of them have distinct birthdays from 1:

[P=1(365365×364365×363365×343365)][ P = 1 - \left( \frac{365}{365} \times \frac{364}{365} \times \frac{363}{365} \ldots \times \frac{343}{365} \right) ]

This results in a number slightly above 50%.

It happens, because in statistics when we want to sum probabilities, we have to multiply them one for the other.

 

Birthday Paradox Approximation

Calculating the probability the way we have done above, as you may have noticed is very difficult and computationally intensive. So, we have formulas that act as approximations for the Birthday Paradox, the main approximation formula is this:

 P(collision)1en22N\ P (collision) ≈ 1 - e ^ {- \frac{n ^ 2}{2N}}

Where:

nn is the size of your sample (23 people)
ee is the Euler number
NN is the total number of possibilities (365 days)

 

Real word problems

Theory is good, but we need practice and applications to learn. So in this section, I will share some challenges that I had at work, which I solved by using the Birthday Paradox.

Random Codes Generation

These days I had to generate around 400k unique random codes at once and insert them into the database. I had a table in my SQL database with a UNIQUE CONSTRAINT in the code field, to ensure that the database wouldn't accept duplicates. My SQL table was like this:

CREATE TABLE IF NOT EXISTS codes (
    id SERIAL PRIMARY KEY,
    code TEXT UNIQUE,
    used INTEGER DEFAULT 0
);
Enter fullscreen mode Exit fullscreen mode

After some codes were generated, I started to get this annoying error from the database: "UNIQUE Constraint Violation". It shouted that, amongst my fresh 400k codes, there were some duplicates. My first reaction was pure disbelief. I mean, how is this possible? Then I thought that maybe I wasn't using truly random data to generate the codes.

To give some context, the code format was:

  • Four numbers from 0 to 9 at the beginning
  • Three letters from A to Z at the end

Very similar to a US License Plate

License Plate


If we do the math, for this shape of code, there are:

N=104 × 263 N = 10^4 \space \times \space 26^3

N=175,760,00N = 175,760,00 different combinations

For me, it seemed very strange and wrong, because 400,000175,760,000=0.00227583067 = 0.23 %\frac{400,000}{175,760,000} = 0.00227583067 \space = \space 0.23 \space \% . How could I have been so unlucky to land within this tiny 0.23% window?

Then it hit me – the issue wasn't with the randomness of my code generation. I was witnessing the Birthday Paradox in action! After researching a little bit on the Birthday Paradox, I revisited my calculations and found out something pretty surprising: with 400k codes generated all at once, the chance of collision is almost of 100% (99.997%)

Using the simplified approximation of the Birthday Paradox:

p(n)1en22N p(n) \approx 1 - e^{-\frac{n^2}{2N}}

Where:

  • p(n) is the probability of a collision.
  • n is the number of generated codes.
  • N is the total number of possible combinations.
  • e is the base of the natural logarithm (approximated to 2.71828).

Plugging in n = 400,000 and N = 175,760,000:

p(400,000)1e400,00022×175,760,000p(400,000) \approx 1 - e^{-\frac{400,000^2}{2 \times 175,760,000}}

This equation gives you a value close to 1, indicating that a collision is almost certain

Solution 1 - Generate codes sequentially and programmatically label them

The first solution for this problem is to generate the codes sequentially, instead of generating them in a completely random way. By doing this, we ensure that there are no collisions, as we are following the combination of codes order. To not generate repeated codes every round, we insert them into the database and label them as "used".

Let's see some code. Starting with the function to generate the codes:

Create a file named generate-codes.ts , and add the following code:

export function generateSequentialCodes(nOfCodes: number = 100): string[] {
    const codes: string[] = [];
    let nextNumber = 0;
    const nextLetters = ['A'.charCodeAt(0), 'A'.charCodeAt(0), 'A'.charCodeAt(0)]
    let nextNumberPart = '0000';
    let nextLettersPart = 'AAA';

    for (let i = 0; i < nOfCodes; i++) {
        codes.push(`${nextNumberPart}${nextLettersPart}`);
        nextNumberPart = (nextNumber + 1).toString().padStart(4, '0');

        nextLetters[2]++;  // increment the last letter

        // check if the last letter is bigger than 'Z' in the ASCII table
        if (nextLetters[2] > 90) {
            nextLetters[2] = 65;  // reset to 'A'
            nextLetters[1]++;  // increment the 2nd letter

            // check if the second letter is bigger than 'Z' in the ASCII table
            if (nextLetters[1] > 90) {
                nextLetters[1] = 65;  // reset to 'A'
                nextLetters[0]++;    // increment the 1st letter

                // if all letters are 'Z', then reset them all to 'A'
                if (nextLetters[0] > 90) {
                    nextLetters[0] = 65;
                    nextLetters[1] = 65;
                    nextLetters[2] = 65;
                }
            }
        }

        nextLettersPart = String.fromCharCode(...nextLetters);
    }

    return codes;
}
Enter fullscreen mode Exit fullscreen mode

This function will be used to generate the codes sequentially, using permutation.

The next step is to generate the 400k codes and insert them into the database. This won't be covered in this article because it's not our main focus. However, I'll provide a link at the end with the complete implementation.

Solution 2 - Generating random codes in small batches

Generating random codes in small batches can be a pragmatic approach. Instead of generating 400k random codes all at once, which will almost certainly result in a collision, we can generate codes in small batches to mitigate this risk. For instance, if we generate 5k codes at a time, the collision probability stands around 6.8%.

For an even safer margin, let's generate 2.5k random codes in one go. This gives us a collision probability of about 1.8%, which is manageable and considerably low.

p(2,500)1e2,50022×175,760,000p(2,500)0.0178p(2,500) \approx 1 - e^{-\frac{2,500^2}{2 \times 175,760,000}} \newline p(2,500) \approx 0.0178

Let's proceed with our coding:

In our file generate-codes.ts, we'll append the following code:

export function generateRandomCode(): string {
    const numbersPart = getRndInteger(0, 9999).toString().padStart(4, '0');
    let lettersPart = '';
    const letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';

    for (let i = 0; i < 2; i++) {
        lettersPart += letters.charAt(getRndInteger(0, letters.length));
    }

    return numbersPart + lettersPart;
}
Enter fullscreen mode Exit fullscreen mode

The function above is responsible for generating random codes.

Now we generate the 2.5k codes and check if they are repeated:

export function generate2_5kCodes(): string[] {
    const codes: string[] = [];

    while(codes.length < 2500) {
        const newCode = generateRandomCode();

        // Only add if the new code isn't already in the array
        if (!codes.includes(newCode)) {
            codes.push(newCode);
        }
    }

    return codes;
}
Enter fullscreen mode Exit fullscreen mode

Solution 3 - Improving the results with Sets

The 3rd solution is an extension/improvement of the 2nd one. In the 2nd solution, we used arrays. Searching for an element in an array takes O(n)O(n) time. O(n)O(n) is alright, but with big data, it can slow things down a bit. To get around this, we can switch out Arrays for Sets.

Sets do the lookup job in O(1)O(1) time, which is much quicker. And another cool thing about Sets? They hold only unique values. That means they automatically take out any duplicate values. Here's how the code looks when using Sets:

export function generate2_5kCodes(): string[] {
    const codes: Set<string> = new Set();

    while(codes.size < 2500) {
        codes.add(generateRandomCode());
    }

    return Array.from(codes);
}
Enter fullscreen mode Exit fullscreen mode

By integrating Sets into our methodology, we've not only streamlined our process but also considerably boosted its efficiency.

 

HINT 💡 : To insert the codes into the database, use the COPY command instead of INSERT. For bulk inserts of simple data, COPY is way more efficient. See more about COPY: https://www.postgresql.org/docs/current/sql-copy.html

 

Conclusion

Throughout our exploration, we've encountered the unexpected twists brought by the Birthday Paradox while trying to generate unique codes. It served as a clear reminder that even seemingly simple tech challenges can have underlying complexities.

In this article, we didn't just talk about code. We meshed together math, probability, and data structures like arrays and sets. We examined time complexity and used it to improve our code's efficiency. As we continue our journey in tech, articles like this help in appreciating the blend of theory and practice in everyday coding challenges.

💖 💪 🙅 🚩
josethz00
José Thomaz

Posted on September 18, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Math for computer science roadmap
computerscience Math for computer science roadmap

October 30, 2023