Towards generic high performance sorting algorithms

Original post at: https://siderite.dev/blog/towards-generic-high-performance-sorting-algorithm/

Intro

I want to examine together with you various types of sort algorithms and the tricks they use to lower the magic O number. I reach the conclusion that high performance algorithms that are labeled as specific to a certain type of data can be made generic or that the generic algorithms aren't really that generic either. I end up proposing a new form of function that can be fed to a sorting function in order to reach better performance than the classic O(n*log(n)).

Sorting

But first, what is sorting? Given a list of items that can be compared to one another as lower or higher, return the list in the order from lowest to highest. Since an item can be any type of data record, to define a generic sorting algorithm we need to feed it the rules that make an item lower than another and that is called the comparison function. Let's try an example in Javascript:

  // random function from start to end inclusive
  function rand(start,end) {
    return parseInt(start+Math.random()*(end-start+1));
  }

  // measure time taken by an action and output it in console
  let perfKey=0;
  function calcPerf(action) {
    const key = perfKey++;
    performance.mark('start_'+key);
    action();
    performance.mark('end_'+key);
    const measure = performance.measure('measure_'+key,'start_'+key,'end_'+key);
    console.log('Action took '+measure.duration);
  }

  // change this based on how powerful the computer is
  const size=10000000;
  // the input is a list of size 'size' containing random values from 1 to 50000
  const input=[];
  for (let i=0; i<size; i++)
    input.push(rand(1,50000));

  // a comparison function between two items a and b
  function comparisonFunction(a,b) {
    if (a>b) return 1;
    if (a<b) return -1;
    return 0;  
  }

  const output=[];
  // copy input into output, then sort it using the comparison function
  // same copying method will be used for future code
  calcPerf(()=> {
    for (let i=0; i<size; i++)
      output.push(input[i]);
      output.sort(comparisonFunction);
  });

It's not the crispest code in the world, but it's simple to understand:

calcPerf is computing the time it takes for an action to take and logs it to the console
start by creating a big array of random numbers as input
the array in a result array and sorting it with the default sort function, to which we give the comparison function
display the time it took for the operation.

This takes about 4500 milliseconds on my computer.

Focus on the comparison function. It takes two items and returns a number that is -1, 0 or 1 depending on whether the first item is smaller, equal or larger than the second. Now let's consider the sorting algorithm itself. How does it work?

A naive way to do it would be to find the smallest item in the list, move it to the first position in the array, then continue the process with the rest of the array. This would have a complexity of O(n²). If you don't know what the O complexity is, don't worry, it just provides a easy to spell approximation of how the amount of work would increase with the number of items in the input. In this case, 10 million records, squared, would lead to 100 trillion operations! That's not good.

Other algorithms are much better, bringing the complexity to O(n*log(n)), so assuming base 10, around 70 million operations. But how do they improve on this? Surely in order to sort all items you must compare them to each other. The explanation is that if a<b and b<c you do not need to compare a to c. And each algorithm tries to get to this in a different way.

However, the basic logic of sorting remains the same: compare all items with a subset of the other items.

Partitioning

A very common and recommended sorting algorithm is QuickSort. I am not going to go through the entire history of sorting algorithms and what they do, you can check that out yourself, but I can focus on the important innovation that QuickSort added: partitioning. The first step in the algorithm is to choose a value out of the list of items, which the algorithm hopes it's as close as possible to the median value and is called a pivot, then arrange the items in two partitions: the ones smaller than the pivot and the ones larger than the pivot. Then it proceeds on doing the same to each partition until the partitions are small enough to be sorted by some other sort algorithm, like insertion sort (used by Chrome by default).

Let's try to do this manually in our code, just the very first run of the step, to see if it improves the execution time. Lucky for us, we know that the median is around 25000, as the input we generated contains random numbers from 1 to 50000. So let's copy the values from input into two output arrays, then sort each of them. The sorted result would be reading from the first array, then from the second!

  // two output arrays, one for numbers below 25000, the other for the rest
  const output1 = [];
  const output2 = [];
  const pivot = 25000;

  calcPerf(() => {
    for (let i = 0; i < size; i++) {
      const val = input[i];
      if (comparisonFunction(val, pivot) < 0)
        output1.push(val);
      else
        output2.push(val);
    }
    // sorting smaller arrays is cheaper
    output1.sort(comparisonFunction);
    output2.sort(comparisonFunction);
  });

Now, the performance is slightly better. If we do this several times, the time taken would get even lower. The partitioning of the array by an operation that is essentially O(n) (we just go once through the entire input array) reduces the comparisons that will be made in each partition. If we would use the naive sorting, partitioning would reduce n²to n+(n/2)²+(n/2)² (once for each partitioned half), thus n+n²/2. Each partitioning almost halves the number of operations!

So, how many times can we half the number of operations for? Imagine that we do this with an array of distinct values, from 1 to 10 million. In the end, we would get to partitions of just one element and that means we did a log₂(n) number of operations and for each we added one n (the partitioning operation). That means that the total number of operation is... n*log(n). Each algorithm gets to this in a different way, but at the core of it there is some sort of partitioning, that b value that makes comparing a and c unnecessary.

Note that we treated the sort algorithm as "generic", meaning we fed it a comparison function between any two items, as if we didn't know how to compare numbers. That means we could have used any type of data as long as we knew the rule for comparison between items.

There are other types of sorting algorithms that only work on specific types of data, though. Some of them claim a complexity of O(n)! But before we get to them, let's make a short detour.

Distinct values

Another useful operation with lists of items is finding the list of distinct items. From [1,2,2,3] we want to get [1,2,3]. To do this, we often use something called a trie, a tree-like data structure that is used for quickly finding if a value exists or not in a list. It's the thing used for autocorrect or finding a word in a dictionary. It has an O(log n) complexity in checking if an item exists. So in a list of 10 million items, it would take maybe 20 operations to find the item exists or not. That's amazing! You can see that what it does is partition the list down to the item level.

Unfortunately, this only works for numbers and strings and such primitive values. If we want to make it generic, we need to use a function that determines when two items are equal and then we use it to compare to all the other items we found as distinct so far. That makes using a trie impossible.

Let me give you an example: we take [1,1,2,3,3,4,5] and we use an externally provided equality function:

create an empty output of distinct items
take first item (1) and compare with existing distinct items (none)
item is not found, so we add it to output
take next item (1) an compare with existing distinct items (1)
item is found, so we do nothing
...
we take the last item (5) and compare with existing items (1,2,3,4)
item is not found, so we add it to the output

The number of operations that must be taken is the number of total items multiplied by the average number of distinct items. That means that for a list of already distinct values, the complexity if O(n²). Not good! It increases exponentially with the number of items. And we cannot use a trie unless we have some function that would provide us with a distinctive primitive value for an item. So instead of an equality function, a hashing function that would return a number or maybe a string.

However, given the knowledge we have so far, we can reduce the complexity of finding distinct items to O(n*log(n))! It's as simple as sorting the items, then going through the list and sending to output an item when different from the one before. One little problem here: we need a comparison function for sorting, not an equality one.

So far

We looked into the basic operations of sorting and finding distinct values. To be generic, one has to be provided with a comparison function, the other with an equality function. However, if we would have a comparison function available, finding distinct generic items would become significantly less complex by using sorting. Sorting is better than exponential comparison because it uses partitioning as an optimization trick.

Breaking the n*log(n) barrier

As I said above, there are algorithms that claim a much better performance than n*log(n). One of them is called RadixSort. BurstSort is an optimized version of it. CountSort is a similar algorithm, as well. The only problem with Radix type algorithms is that they only work on numbers. How do they do that? Well, since we know we have numbers to sort, we can use math to partition the lot of them, thus reducing the cost of the partitioning phase.

Let's look at our starting code. We know that we have numbers from 1 to 50000. We can find that out easily by going once through all of them and computing the minimum and maximum value. O(n). We can then partition the numbers by their value. BurstSort starts with a number of "buckets" or lists, then assigns numbers to the buckets based on their value (dividing the value to the number of buckets). If a bucket becomes too large, it is "burst" into another number of smaller buckets. In our case, we can use CountSort, which simply counts each occurrence of a value in an ordered array. Let's see some code:

  const output = [];
  const buckets = [];
  calcPerf(() => {
    // for each possible value add a counter
    for (let i = 1; i <= 50000; i++)
      buckets.push(0);
    // count all values
    for (let i = 1; i <= size; i++) {
      const val = input[i];
      buckets[val - 1]++;
    }
    // create the output array of sorted values
    for (let i = 1; i <= 50000; i++) {
      const counter = buckets[i - 1];
      for (let j = 0; j < counter; j++)
        output.push(i);
    }
  });

This does the following:

create an array from 1 to 50000 containing zeros
for each value in the input, increment the bucket for that value
at the end just go through all of the buckets and output the value as many times as the value in the bucket shows

This algorithm generated a sorted output array in 300 milliseconds!

And of course, it is too good to be true. We used a lot of a priori knowledge:

min/max values were already known
the values were conveniently close together integers so we can use them as array indexes

I can already hear you sigh "Awwh, so I can't use it!". Do not despair yet!

The Radix algorithm, that is used only for numbers, is also used on strings. How? Well, a string is reducible to a list of numbers (characters) so one can recursively assign each string into a bucket based on the character value at a certain index. Note that we don't have to go through the entire string, the first few letters are enough to partition the list in small enough lists that can be cheaply sorted.

Do you see it yet?

A generic partition function

What if we would not use an equality function or a comparison function or a hashing function as a parameter for our generic sort/distinct algorithm? What if we would use a partition function? This partition function would act like a multilevel hashing function returning values that can also be compared to each other. In other words, the generic partition function could look like this:

function partitionFunction(item, level) returning a byte

For strings it returns the numeric value of the character at position level or 0. For numbers it returns the high to low byte in the number. For object instances with multiple properties, it would return a byte for each level in each of the properties that we want to order by. Radix style buckets would use the known values from 0 to 255. The fact that the multilevel partitioning function is provided by the user means we can pack in it all the a priori knowledge we have, while keeping the sorting/distinct algorithm unchanged and thus, generic! The sorting will be called by providing two parameters: the partitioning function and the maximum level to which it should be called:

sort(input, partitioningFunction, maxLevel)

A final example

Here is an implementation of a radix sorting algorithm that receives a multilevel partitioning function using our original input. Note that it is written so that it is easily read and not for performance:

  // will return a sorted array from the input array
  // using the partitioning function up to maxLevel
  function radixSort(input, partitioningFunction, maxLevel) {
    let buckets = Array.from({length: 256}, () => []);
    buckets[0] = input;
    // reverse order, because level 0 should be the most significant
    for (let level = maxLevel-1; level >=0; level--) {
      let tempBuckets = Array.from({length: 256}, () => []);
      for (let bucketIndex = 0; bucketIndex < buckets.length; bucketIndex++) {
        const bucket = buckets[bucketIndex];
        const bucketLength = bucket.length;
        for (let bucketOffset = 0; bucketOffset < bucketLength; bucketOffset++) {
          const val = bucket[bucketOffset];
          const partByte = partitioningFunction(val, level);
          tempBuckets[partByte].push(val);
        }
      }
      buckets = tempBuckets;
    }
    const output = [].concat(...buckets);
    return output;
  }

  // return value bytes, from the most significant to the least
  // being <50000 the values are always 2 bytes  
  function partitioningFunction(item, level) {
    if (level === 0) return item >> 8;
    if (level === 1) return item & 255;
    return 0;
  }

  let output3 = [];
  calcPerf(() => {
    output3 = radixSort(input, partitioningFunction, 2);
  });

Wanna know how long it took? 1300 milliseconds.

You can see how the same kind of logic can be used to find distinct values, without actually sorting, just by going through each byte from the partitioning function and using them as values in a trie, right?

Conclusion

Here is how a generic multilevel partitioning function replaces comparison, equality and hashing functions with a single concept that is then used to get high performance from common data operations such as sorting and finding distinct values.

I will want to work on formalizing this and publishing it as a library or something like that, but until then, what do you think?

Wait, there is more!

There is a framework in which something similar is being used: SQL. It's the most common place where ORDER BY and DISTINCT are used. In SQL's case, we use an optimization method that uses indexes, which are also trie data structures storing the keys that we want to order or filter by. Gathering the data to fill a database index also has its complexity. In this case, we pre-partition once and we sort many. It's another way of reducing the cost of the partitioning

However, this is just a sub-type of the partition function that I am talking about, one that uses a precomputed data structure to reach its goal. The multilevel partition function concept I am describing here may be pure code or some other encoding of information we know out of hand before doing the operation.

Finally, the complexity. What is it? Well instead of O(n*log(n)) we get O(n*k), where k is the maximum level used in the partition function. This depends on the data, so it's not a constant, but it's the closest theoretical limit for sorting, closer to O(n) than the classic log version. I am not the best algorithm and data structure person, so if you have ideas about it and want to help me out, I would be grateful.

Blog