Arrays, the slow parts — we can do better

krofdrakula

Klemen Slavič

Posted on March 27, 2019

Arrays, the slow parts — we can do better

Cover photo by Dan Deaner via Unsplash

There are many advantages to arrays as a data structure that make them ideal for certain scenarios, but make them quite unflatteringly slow when using their built-in methods in others. In this article, we'll have a look at some alternative data structures that makes the job much more efficient.

The right tool for the right job

In the previous article we explored Big-O notation so that we can make better decisions about how to analyze our algorithms to achieve better performance. We know that arrays are great when you're accessing an element by index (O(1)), and are great to use when mutations occur at the end of the array (O(1)), so if we can avoid mutations at the beginning of long arrays, our code will perform best. We can even improve the performance of shift() and unshift() by slicing up the array into multiple parts, with the overhead of having to keep track of indices of elements in each slice separately. Laborious, but depending on the choice of slicing, it might be quite fast.

There's one operation that seems to be unavoidably slow for arrays, though: indexOf(), and its related functions, find(), findIndex() and includes(). The latter three functions are just convenience functions that use indexOf() internally, so their performance is identical, if we ignore the cost of the function passed as the parameter.

The O(n) performance means that an array twice as big will take twice as long to search. We can do better. Much better.

Values, variables and references

You may be aware that JavaScript has two types of values: primitives and objects. Variables can refer to these primitives and objects by assigning those values to a name. When a variable references a value, we say that it contains a reference to the value.

const a = 3;     // variable `a` points to the primitive value `3`
const b = {};    // variable `b` points to an object instance
Enter fullscreen mode Exit fullscreen mode

The difference between primitives (like null, undefined, booleans, strings and numbers) and all the other objects is that primitives are immutable -- only one copy ever exists at any given time within the same environment, and they cannot be changed after they're created. No matter how many times you create the same string or number, the result will be the same:

const a = 3;     // we create the integer `3` and assign it to `a`
const b = 3;     // `3` already exists, so `b` points to the same number

const c = 'hello';   // we create the string 'hello' and assign to `c`
const d = 'hello';   // 'hello' exists, so `d` points to the same string
Enter fullscreen mode Exit fullscreen mode

When we say that we compare references, we mean using strict equality (===), which compares two values to see if they are pointing to (referencing) the same thing. Given the above, we should expect the following to all be true:

const a = 'hello'; const b = 'hello'; console.assert(a === b); console.assert(a === 'hello'); console.assert('hello' === b); console.assert('hello' === 'hello'); console.log('All good!')

Still with me? Here's where it gets interesting: whenever you create an object (i.e. not a primitive value), JavaScript allocates new memory for the object, regardless of what it contains, and returns a reference to it. A reference is a kind of unique address for that value, a way for the runtime to know where to look for a value when needed.

And yes, arrays are objects too, so the same rules apply. Let's put it to the test:

const check = (a, b, msg) => console.log(msg + (a === b ? ': yes' : ': no')); const a = {}; const b = {}; const c = b; // check that comparing the value to itself works check(a, a, 'a and a'); check(b, b, 'b and b'); // what about combinations? check(a, b, 'a and b'); check(a, {}, 'a and new'); check({}, b, 'new and b'); // what about newly created objects? check({}, {}, 'new and new'); // what about variables with the same reference assigned? check(c, b, 'c and b');

Even if the objects contain the same primitive values with the same keys, they will have unique references.

There are two data structures that take advantage of this property to great effect: Set and Map.

Keeping track of references using Set

Conceptually, references are numbers that JavaScript uses to find the values in memory for a particular value. Those numbers are hidden inside the internals of the JavaScript engine, but some built-in objects have access to them and this enabled them to provide some unique capabilities.

With arrays, checking that a value is present in it requires searching the elements one by one and seeing if any of the references match the one we're searching for. Set, on the other hand, uses references as numbers to search for a number using binary search trees.

Imagine you have a huge pile of manuscript pages on your desk. You know the pile is ordered, but some of the pages are missing, so you don't have a good idea of where exactly a particular page is, if it is in the pile at all.

You can peek at the top and bottom pages and see that they range between 1 and 1000. Someone asks you to check if page 314 is in the pile. How would you search?

Going from top to bottom would mean it would take you up to 314 steps, so that's not quite efficient. But what if we just pick the middle of the pile to see how close we are?

Let's split the pile roughly in the middle and look at the top page of the bottom half. We discover it's page 562:

|1.....................562.....................1000|
                        ^
Enter fullscreen mode Exit fullscreen mode

Hm, that means it must be in the top part. Let's split the top part again:

|1.........193.........562|
            ^
Enter fullscreen mode Exit fullscreen mode

OK, too far, it's in the lower half now:

          |193...397...562|
                  ^
Enter fullscreen mode Exit fullscreen mode

Close! At this point, would you just flip through the pages to try to find the elusive 314 or continue to split the pile? How do you know when to stop? Which approach would be faster, assuming that splitting the pile takes as much time as flipping a single page? How many steps would you need to finish the task by only splitting the pile?

Let's test this out in code and see how well it performs against a page-by-page search:

// this function creates an array of n numbers with random gaps; // the array is sorted in ascending order and contains unique numbers const createPile = n => { let start = 0; const pile = [start]; while (pile.length < n) { start += 1 + Math.floor(Math.random() * 3); pile.push(start); } return pile; }; // create an array of 1000 numbers const pile = createPile(1000); // uses the list splitting technique described above // returns [steps, index] const smartSearch = (needle, haystack) => { let steps = 0; let min = 0; let max = haystack.length - 1; while (max - min > 1) { steps++; if (haystack[min] === needle) return [steps, min]; else if (haystack[max] === needle) return [steps, max]; const halfway = Math.floor((min + max) / 2); if (haystack[halfway] > needle) max = halfway; else min = halfway; } return [steps, null]; }; // uses a classic for loop from start to finish // returns [steps, index] const naiveSearch = (needle, haystack) => { for (let i = 0; i < haystack.length; i++) { if (haystack[i] === needle) return [i + 1, i]; } return [haystack.length, null]; }; console.log('Smart search [steps, index]', smartSearch(314, pile)); console.log('Naive search [steps, index]', naiveSearch(314, pile));

Depending on the random number list, the list might or might not contain the number 314. You'll notice, however, that there's a stark difference in the amount of steps needed to find (or not find) the value in the random number array.

This approach is called the binary search algorithm. It belongs to a whole family of related algorithms that have different speed and memory tradeoffs that can be applied to specific cases for maximum effect. The expected complexity of the binary search algorithm is O(log2 n). In contrast, includes() uses a linear search algorithm, which has a complexity of O(n).

The Set is a data structure that uses those internal IDs within the JavaScript engine to be able to quickly search through the pile for a given reference and determine if it is in the pile or not.

So how does that compare to Array::includes? Here's a benchmark result on my laptop that compares the runtime performance of using either method on an array of 100k integers:

The higher the op/s (operations per second), the better. In this example on Chrome 73, using a Set to determine whether the chosen number is in the list of numbers is more than 1000 times faster! Here's a link to the benchmark so you can test it out yourself.

Of course, this won't always mean that one method is 1000 times faster; it just means that on the scale of 100k elements, Set ends up being 1000 times faster in this specific example. It will depend on the number of elements you have, and the smaller the set, the less noticeable the difference will be. In most cases involving more than, say, a hundred elements, you should see an improvement of orders of magnitude.

When to use Set

If the problem you're solving requires testing whether a given value is part of a set of values, then this is the data structure for you. Here are a couple of examples:

const bunchOfNumbers = [1,1,2,3,5,5,7,9,11,15,17,17,17,3,2,2,5,5]; // create the set const set = new Set(bunchOfNumbers); console.log('does the set contain 5?', set.has(5)); console.log('does the set contain 16?', set.has(16)); // create an array from the set const unique = Array.from(set); // the array created from the set contains only the unique values console.log('unique values', unique);

Creating associations between values with Map

If Set lets you easily look up references in a set, Map lets you associate that reference with another, essentially mapping one value to another. Before we get into it, let's try to model this behaviour using an array.

To do this we'll start with an array containing a pair of values, or a tuple. A tuple is an ordered list of values, and in our case, our tuples will contain a key and a value.

Note: for this example to work correctly, the keys in the tuples must be unique across the list, otherwise duplicates will terminate the search on the first occurence.

// we can use any type of reference as the key, so let's create an object
const three = { value: 3 };

// construct the list as an array of arrays
const list = [
  ['one', 'eins'],
  [2, 'zwei'],
  [three, 'drei']
];
Enter fullscreen mode Exit fullscreen mode

Next, we need a lookup function. This will take a list and a key, and return the associated value, or undefined if not found.

const get = (list, key) => {
  const pair = list.find(
    (pair) => pair[0] === key
  );
  return pair !== undefined ? pair[1] : undefined;
};
Enter fullscreen mode Exit fullscreen mode

Let's test it out:

const three = { value: 3 }; const list = [ ['one', 'eins'], [2, 'zwei'], [three, 'drei'], [null, NaN] ]; const get = (list, key) => { const pair = list.find( (pair) => pair[0] === key ); return pair !== undefined ? pair[1] : undefined; }; console.log(get(list, 'one')); // 'eins' console.log(get(list, 2)); // 'zwei' console.log(get(list, three)); // 'drei' console.log(get(list, '2')); // undefined console.log(get(list, { value: 3 })); // undefined console.log(get(list, null)); // NaN

Since find() is a linear search, its complexity is O(n), which is far from ideal. And this is where Map can really bring in the big guns.

Just as with Set, it contains a has(key) method which returns a true or false based on reference equality. It also has a get(key) method, which allows us to get the associated value by key.

Now you might be thinking, wait, couldn't we just use objects for this? The answer is yes, so long as all of your keys are strings, otherwise you're setting yourself up for failure. If you wanted to have a lookup by string, a plain old object would do just fine:

const germanNumbers = {
  one: 'eins',
  two: 'zwei',
  three: 'drei'
};

const key = 'one';

germanNumbers[key]; // 'eins'
Enter fullscreen mode Exit fullscreen mode

But this strategy falls flat if you try to assign a key that isn't a string, since all object property lookups get converted to a string first. You wouldn't be able to look up a value given an object reference, since objects are cast to strings, resulting in "[Object object]" by default. And you can't differentiate between 2 (a number) and "2" (a string).

This is the reason we had to implement the list as an array of key, value pairs and use === to compare the values. Map works by letting you assign any reference as the key, not just strings.

Additionally, it enjoys the same speed benefits as Set does, so looking up values in the map also has a complexity of O(log2 n). How about a quick race to see just how fast?

const get = (list, key) => { const pair = list.find( (pair) => pair[0] === key ); return pair !== undefined ? pair[1] : undefined; }; // create a list of 100k numbers, and create values that represent the number // to 3 significant digits const list = Array(100000).fill(0).map((_, n) => [n, n.toPrecision(3)]); // let's repeat the search this many times const numberOfLoops = 5000; const target = 31415; // time how long it takes to find 3141 using linear search const linearStartTime = Date.now(); for (let i = 0; i < numberOfLoops; i++) get(list, target); console.log( 'it took ' + (Date.now() - linearStartTime) + 'ms to find the value for array' ); // what about a map? const map = new Map(list); const mapStartTime = Date.now(); for (let i = 0; i < numberOfLoops; i++) map.get(target); console.log( 'it took ' + (Date.now() - mapStartTime) + 'ms to find the value for map' );

When to use Map

Map can be used to preserve references in cases where you cannot convert a key to a string, or want to avoid casting other primitive values to strings. Its performance is a little worse than object property or array index access (O(log2 n) instead of O(1)).

The most common use case is when you want to create associations between objects. There are generally two ways you can do this:

  • you can assign the associated value to a property on the object; or
  • you can generate unique IDs and use those to look up the values.

The first method can create cyclic references, which makes it impossible to convert those objects to JSON strings. The second requires a lot of bookkeeping for each value being referenced, and can often be impractical and slow to implement.

This is where a Map offers a way out:

// let's create some frozen object so we can't cheat and just assign spouses // as object properties const Jill = Object.freeze({ name: 'Jill' }); const Jane = Object.freeze({ name: 'Jane' }); const John = Object.freeze({ name: 'John' }); const noone = Object.freeze({}); const married = new Map([ [Jill, Jane], // we create an association for Jill -> Jane [Jane, Jill], // we also create a reverse map for Jane -> Jill [John, noone] // John is not married, so John -> noone ]); // who's married to Jill? console.log(married.get(Jill)); // is John taken? console.log(married.get(John));

We can create many different associations by just creating more maps, and we never have to modify the objects.

Caveats to consider when dealing with JSON data

While this means that the values being mapped can still be converted to JSON strings, the Maps themselves cannot, since there is no way to serialize references. In this case, generating unique keys is a necessity, but keeping track of which objects need to have their IDs generated can be handled by another Map instance and used in the replacer function of JSON.stringify(). Similarly, a reviver function can recreate the maps. I wrote an article on this that you might find useful:

Conclusion

If your data requires you to iterate over a collection in order to check the presence of a key or to look up a value, you might consider using Set and Map to use as a data structure instead of arrays. They offer a fast and safe way to look up values, and you can iterate over them or convert them back into strings if needed.

Next time, we'll have a look at their weakly-referenced siblings, WeakSet and WeakMap!

💖 💪 🙅 🚩
krofdrakula
Klemen Slavič

Posted on March 27, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Arrays, the slow parts — we can do better
computerscience Arrays, the slow parts — we can do better

March 27, 2019