Reproducible randomness and its applications

krofdrakula

Klemen Slavič

Posted on November 13, 2022

Reproducible randomness and its applications

Cover page by cottonbro studios

XKCD: #221 Random Number

Dilbert: 25 Oct 2001

While these comic strips might seem absurd, they actually carry deep insight into random numbers and how they're generated and used. Don't get me wrong; the joke is funny, but there's more to it than you may think.

Throwing dice in code

Randomness in programs is tricky — computers are ruled by determinism which means that a computer with the same inputs can only generate the same output every time. For true randomness, it needs an outside source to derive sequences of true random numbers. The physical processes involved in generating these sources of entropy are often too slow for typical usage, and not all hardware allows for such true random sources.

There is a faster compromise, though — algorithms called Pseudorandom Number Generators, or PRNG for short.

To generate a sequence of pseudorandom numbers, you must provide a seed value to the algorithm. The simplest way to think about what this value represents is to imagine a large book filled with random digits. Picking a seed value means flipping to a particular page of that book, finding the correct line and column on the page and start reading the sequence left-to-right, top-to-bottom, reading as many digits at a time as required. When you reach the end of the book, you flip to the first page and continue from there.

It might surprise you that a printed book was the source scientists and engineers used to obtain large amounts of random numbers for experimental purposes before the advent of accessible computing machines. The book contained sequences that were guaranteed to satisfy all of the important randomness criteria known at the time. You can even buy the book they used and try it for yourself! It sure beats rolling dice or flipping coins.

A million base-10 digits isn't really that much when you consider modern computers, however — the entire sequence would fit on a single floppy disk. The problem with such a relatively short sequence is that it loops back onto itself after using up enough digits. To generate a 64-bit floating number you'd need to use 20 base-10 digits at a time ( log1026420log_{10} 2^{64} \approx 20 ), which means that this random number sequence can only generate 50,000 numbers before the cycle starts again.

PRNGs works slightly differently. A seed value sets up the internal state which is used to shuffle bits around every time a new number in the sequence is generated. The idea is that a good PRNG algorithm will produce an impossible-to-guess sequence while satisfying as much of the criteria tests for randomness as possible. That way, the PRNG effectively writes the book of numbers while you're reading it, but the book doesn't need to be stored anywhere, and every seed value generates a different book.

Note: Some PRNGs may appear random, but given enough of its output, researchers have managed to deduce the PRNGs internal state, therefore making the sequence predictable, which is why it is important to use cryptographically secure PRNGs seeded with true random numbers.

Depending on the chosen algorithm, the book the PRNG writes could be very long — 21993712^{19937}-1 bits in the case of Mersenne Twister used in the examples below. But there are as many different sequences as there are seed values, so you can be sure not to run out of fresh unpredictable sequences of numbers any time soon. Crucially, the same seed value will always produce the same book.

Unpredictable yet reproducible

To see this in action, let's take a look at a side-by-side comparison between a seeded PRNG and the built-in Math.random() function in JavaScript:

In this example, the left column of names is picked using values from a Mersenne Twister generator set up with the initial seed value from the input field. In the right column, picking is done using Math.random() which is seeded by the browser for you (and no way for you to configure).

You'll notice that every time you rerun the example the left column always produces the same list of names given the same seed value. The right column generates a new list of names each time the page loads or you press the button.

The problem with the solution on the right is that after you press the button, the original pseudorandom sequence is gone and there is no way to reproduce it. If you happen to notice an interesting sample and you accidentally miss the opportunity to capture that state, then you're out of luck. With a seed value, all you have to do is capture that number and use it to rebuild the state.

This property also makes things like game world sharing possible by simply sharing a single short string or number and running the same algorithms to produce the same results everywhere.

At the end of the day, if you wanted randomness every time you boot up, you could simply choose a new seed every time by using an external value like Date.now(), Math.random() * Number.MAX_SAFE_INTEGER or some other means.

A practical example

Sometimes, the sheer number of different combinations of inputs might make it infeasible to do a thorough review of the runtime behaviour of a program. In those cases, having generators on hand to generate lots of unpredictable input could help spot problems that you hadn't anticipiated.

In order to make things easier to manage and compose, let's explore a construct in JavaScript called generators, which will let us control and orchestrate generating large amounts of data easily and without the need for writing hundreds of loops.

JavaScript generators and generator functions

To create a generator, we can leverage JavaScript generator functions. The nice thing about generator functions is that they evaluate lazily, meaning that users pull values from them rather than populating a collection like an array ahead of time. A generator function can use yield to emit a value, which pauses its execution until the next value is requested.

Note: The difference between generator functions and generators is that invoking a generator function will return a generator. A generator is an object with at least the next() method that returns an object with the properties { done, value }. If done is true, then the generator has finished generating its sequence and will no longer emit values, which also terminates iteration. This API is called the Iterator protocol and any object mathching this API can be used in a for...of loop.

Let's start with a simple example that takes a starting number and an increment step to generate an infinite sequence of integers:

function* int(start = 0, step = 1) {
  let i = start;
  while(true) {
    yield i;
    i += step;
  }
}

const generator = int();

for (const n of generator) {
  console.log(n);
}
Enter fullscreen mode Exit fullscreen mode

In this example, calling int() returns a generator. When the for...of loop calls the next() method, the generator runs the function body until it hits a yield keyword which pauses the generator and emits the next value. If the generator function returns, no more values are emitted.

Note: There is more to generator functions than is described here. yield statements return values passed to the next() method to provide two-way communication between the generator and consumer. A generator function can also delegate to other generators using yield*. They can yield Promises and take advantage of await statements and enable asynchronours iteration.

take

Most of the time, we'd like to limit the amount of items in the generated sequence without having to write for loops, so we'll define a take generator function that terminates the sequence when the limit is reached or the underlying generator is exhausted:

const take = function* (count, generator) {
  for (let i = 0; i < count; i++) {
    const { value, done } = generator.next();
    if (done) return;
    yield value;
  }
  generator.return?.();
};

const first10Numbers = take(10, int(1));

// will print [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
console.log([...first10Numbers]); 
Enter fullscreen mode Exit fullscreen mode

Note: The return?.() call signals to the wrapped generator that we are done consuming its values. This enables generators to clean up any side effects they may have created, like closing network connections, or cleaning up any open resources like I/O handles or caches. The nullish coalescing operator is there to protect in cases where the generator does not provide that function.

zip

For convenience, let's also define a zip generator. Zipping in this context means taking multiple generators, requesting a value from each of them and emitting a single value as an array of all of the values. This makes it easy to group results from multiple generators into a single value:

const zip = function* (...generators) {
  while (true) {
    const results = [];
    for (const g of generators) {
      const v = g.next();
      if (v.done) {
        generators.forEach(g => g.return?.());
        return;
      }
      results.push(v.value);
    }
    yield results;
  }
}

const sumsTo10 = zip(int(0, 1), int(10, -1));

// will print [[0, 10], [1, 9], [2, 8], [3, 7], [4, 6]]
console.log([...take(5, sumsTo10)]);
Enter fullscreen mode Exit fullscreen mode

map

Next we'll create a mapgenerator that returns a generator that translates values from the underlying generator using the provided mapping function:

const map = function*(mapper, generator) {
  for (const value of generator)
    yield mapper(value);
}

/**
 * Returns an ordinal representation of an integer. Only works
 * with positive integers.
 */
const intToOrdinal = n => {
  if (n % 10 == 1 && n % 100 != 11) {
    return `${n}st`;
  } else if (n % 10 == 2 && n % 100 != 12) {
    return `${n}nd`;
  } else if (n % 10 == 3 && n % 100 != 13) {
    return `${n}rd`;
  }
  return `${n}th`
};

const ordinals = map(intToOrdinal, int(1));

// will print ['1st', '2nd', '3rd', '4th', '5th']
console.log([...take(5, ordinals)]);
Enter fullscreen mode Exit fullscreen mode

Creating random number-driven generators

So far, we've only dealt with regular sequences, but that doesn't mean we can't use our PRNG to generate an (almost) never-ending stream of random numbers:

import mt from 'mersennetwister';

const rnd = function* (seed) {
  const prng = new mt(seed);
  while(true) yield prng.real();
}

// will always print [1125387415, 2407456957, 681542492]
console.log([...take(3, rnd(1337))]);
Enter fullscreen mode Exit fullscreen mode

To make this more reusable, let's create a generator that takes an array and seed value and returns a generator that will randomly pick values from that array:

import mt from 'mersennetwister';

const fromArray = function* (arr, seed) {
  const prng = new mt(seed);
  while(true) yield arr[prng.int() % arr.length];
}

const sample = ['apple', 'pear', 'banana', 'kiwi'];

// will always print ["kiwi", "pear", "apple"]
console.log([...take(3, fromArray(sample, 1337))]);
Enter fullscreen mode Exit fullscreen mode

Putting it all together

With these utilities, we can compile a sample data set as arrays to generate fake personal information to use in our app. We start by creating the generators for first and last names separately:

import * as data from './data/people';
import { fromArray } from './utils';

export const firstNames = (seed) => fromArray(data.firstNames, seed);

export const lastNames = (seed) => fromArray(data.lastNames, seed);

export const fullNames = (seed) => map(
  n => n.join(' '),
  zip(firstNames(seed), lastNames(seed + 10))
);
Enter fullscreen mode Exit fullscreen mode

So what is going on here? firstNames and lastNames return generators for first and last names respectively, and fullNames composes them and returns a generator that emits values with both joined with a space.

We can use these generators to quickly build up a mock application that shows the generated personal info as a list of avatars:

This example is immediately usable as a review tool, a performance benchmark and as a smoke testing tool. You can immediately see a myriad of first and last name combinations and generate a ton of them to measure how fast they render. For example, there is a bug in the mapping function that extracts initials from the generated names that is immediately apparent:

A screenshot highlighting the bug in one of the avatar components

Hint: JavaScript strings are encoded using UTF-16, index access in strings always returns a 16-bit value (a code unit) along that boundary. If you think you can crack it, fork the example and post your solution in the comments!

Being able to reproduce this state easily by simply sharing a few parameters means the developer can now exactly trace where the mistake is without having to guess where the problem lies.

Adopting seeded PRNG in your own project

While this generator-based approach can work well when building up your own data generation pipeline or niche data set (it has for me on several occasions on large commercial projects), it might sometimes be easier to adopt other generation libraries that provide sensible and realistic data already included in the example set. Here are a few libraries that I've used in the past to leverage seeded randomness:

  • faker — the controversial grandaddy of fake data generation
  • fiona — a leaner library with a similar feature set

Beyond sequences

While PRNG sequences are useful for discrete data generation, they can use them as a tool to combine with other techniques like interpolation which produce contiguous multi-dimensional fields. These can be used to produce textures that can mimic natural materials or model organic-looking structures based off of a few simple mathematical expressions. But that is a topic we can address in a future article.

💖 💪 🙅 🚩
krofdrakula
Klemen Slavič

Posted on November 13, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Reproducible randomness and its applications
programming Reproducible randomness and its applications

November 13, 2022