[V8 Deep Dives] Understanding Map Internals
Andrei Pechkurov
Posted on October 6, 2020
With this blog post, I am starting V8 Deep Dives series dedicated to my experiments and findings in V8, which is, no doubt, a well-engineered and sophisticated software. Hopefully, you will find this blog post valuable and share your ideas for the next topic.
Intro
ECMAScript 2015, also known as ES6, introduced many built-in collections, such as Map, Set, WeakMap, and WeakSet. They appeared to be an excellent addition to the standard JS library and got widely adopted in libraries, applications, and Node.js core. Today we are going to focus on Map collection and try to understand V8 implementation details, as well as make some practical conclusions.
The spec does not dictate a precise algorithm used to implement Map support, but instead gives some hints for possible implementations and expected performance characteristics:
Map object must be implemented using either hash tables or other mechanisms that, on average, provide access times that are sublinear on the number of elements in the collection. The data structures used in this Map objects specification is only intended to describe the required observable semantics of Map objects. It is not intended to be a viable implementation model.
As we see here, the spec leaves a lot of room for each implementer, i.e., JS engine, but does not give a lot of certainty on the exact algorithm, its performance, or memory footprint of the implementation. If your application deals with Maps on its hot path or you store a lot of data in a Map, such details may be certainly of great help.
As a developer with a Java background, I got used to Java collections, where one can choose between multiple implementations of Map interface and even fine-tune it if the selected class supports that. Moreover, in Java it is always possible to the open source code of any class from the standard library and get familiar with the implementation (which, of course, may change across versions, but only in a more efficient direction). So, that is why I could not stand not to learn how Maps work in V8.
Now, let’s start the dive.
Disclaimer. What’s written below is implementation details specific to V8 8.4 bundled with a recent dev version of Node.js (commit 238104c to be more precise). You should not expect any behavior beyond the spec.
Underlying Algorithm
First of all, Maps in V8 are built on top of hash tables. The subsequent text assumes that you understand how hash tables work. If you are not familiar with the concept, you should learn it first (e.g., by reading this wiki page) and then return here.
If you have substantial experience with Maps, you might already notice a contradiction here. Hash tables do not provide any order guarantees for iteration, while ES6 spec requires implementations to keep the insertion order while iterating over a Map. So, the “classical” algorithm is not suitable for Maps. But it appears that it is still possible to use it with a slight variation.
V8 uses the so-called deterministic hash tables algorithm proposed by Tyler Close. The following TypeScript-based pseudo-code shows main data structures used by this algorithm:
Here CloseTable interface stands for the hash table. It contains hashTable array, which size is equal to the number of buckets. The Nth element of the array stands for the Nth bucket and holds an index of the bucket’s head element in the dataTable array. In its turn, dataTable array contains entries in the insertion order. Finally, each Entry has chain property, which points to the next entry in the bucket’s chain (or singly linked list, to be more precise).
Each time when a new entry is inserted into the table, it is stored in the dataTable array under the nextSlot index. This process also requires an update in the chain of the corresponding bucket, so the inserted entry becomes the new tail.
When an entry is deleted from the hash table, it is removed from the dataTable (e.g., with = undefined). As you might notice, this means that all deleted entries still occupy space in the dataTable.
As the last piece of the puzzle, when a table gets full of entries (both present and deleted), it needs to be rehashed (rebuilt) with a bigger (or smaller) size.
With this approach, iteration over a Map is just a matter of looping through the dataTable. That guarantees the insertion order requirement for iteration. Considering this, I expect most JS engines (if not all of them) to use deterministic hash tables as the building block behind Maps.
Algorithm in Practice
Let’s go through more examples to see how the algorithm works. Say, we have a CloseTable with 2 buckets (hashTable.length) and total capacity of 4 (dataTable.length) and the hash table is populated with the following contents:
In this example, the internal table representation can be expressed like the following:
If we delete an entry by calling table.delete(1), the table turns into this one:
If we insert two more entries, the hash table will require rehashing. We will discuss this process in more detail a bit later.
The same algorithm can be applied to Sets. The only difference is that Set entries do not need value property.
Now, when we have an understanding of the algorithm behind Maps in V8, we are ready to take a deeper dive.
Implementation Details
The Map implementation in V8 is written in C++ and then exposed to JS code. The main part of it is defined in OrderedHashTable and OrderedHashMap classes. We already learned how these classes work, but if you want to read the code yourself, you may find it here, here, and, finally, here.
As we are focused on the practical details of V8’s Map implementation, we need to understand how table capacity is selected.
Capacity
In V8, hash table (Map) capacity is always equal to a power of two. As for the load factor, it is a constant equal to 2, which means that max capacity of a table is 2 * number_of_buckets. When you create an empty Map, its internal hash table has 2 buckets. Thus the capacity of such a Map is 4 entries.
There is also a limit for the max capacity. On a 64-bit system that number would be 2²⁷, which means that you can not store more than around 16.7M entries in a Map. This restriction comes from the on-heap representation used for Maps, but we will discuss this aspect a bit later.
Finally, the grow/shrink factor used for rehashing is equal to 2. So, as soon as a Map gets 4 entries, the next insert will lead to a rehashing process where a new hash table of a twice as big (or less) size will be built.
To have a confirmation of what may be seen in the source code, I have modified V8 bundled in Node.js to expose the number of buckets as a custom buckets property available on Maps. You may find the result here. With this custom Node.js build we can run the following script:
The above script simply inserts 100 entries into an empty Map. It produces the following output:
As we see here, the Map grows as a power of two when map capacity is reached. So, our theory is now confirmed. Now, let’s try to shrink a Map by deleting all items from it:
This script produces the following output:
Again, we see that the Map shrinks as a power of two, once there are fewer remaining entries than number_of_buckets / 2.
Hash Function
So far, we did not discuss how V8 calculates hash codes for keys stored in Maps, while this is a good topic.
For number-like values (Smis and heap numbers, BigInts and other similar internal stuff), it uses one or another well-known hash function with low collision probability.
For string-like values (strings and symbols), it calculates hash code based on the string contents and then caches it in the internal header.
Finally, for objects, V8 calculates the hash code based on a random number and then caches it in the internal header.
Time Complexity
Most Map operations, like set or delete, require a lookup. Just like with the “classical” hash table, the lookup has O(1) time complexity.
Let’s consider the worst-case when the table has N out of N entries (it is full), all entries belong to a single bucket, and the required entry is located at the tail. In such a scenario, a lookup requires N moves through the chain elements.
On the other hand, in the best possible scenario when the table is full, but each bucket has 2 entries, a lookup will require up to 2 moves.
It is a well-known fact that while individual operations in hash tables are “cheap”, rehashing is not. Rehashing has O(N) time complexity and requires allocation of the new hash table on the heap. Moreover, rehashing is performed as a part of insertion or deletion operations, when necessary. So, for instance, a map.set() call could be more expensive than you would expect. Luckily, rehashing is a relatively infrequent operation.
Memory Footprint
Of course, the underlying hash table has to be somehow stored on the heap, in a so-called “backing store”. And here comes another interesting fact. The whole table (and thus, Map) is stored as a single array of fixed length. The array layout may be illustrated with the below diagram.
Specific fragments of the backing store array correspond to the header (contains necessary information, like bucket count or deleted entry count), buckets, and entries. Each entry of a bucket chain occupies three elements of the array: one for the key, one for the value, and one for the “pointer” to the next entry in the chain.
As for the array size, we can roughly estimate it as N * 3.5, where N is the table capacity. To have an understanding of what it means in terms of memory footprint, let’s assume that we have a 64-bit system, and pointer compression feature of V8 is disabled. In this setup, each array element requires 8 bytes, and a Map with the capacity of 2²⁰ (~1M) should take around 29 MB of heap memory.
Summary
Gosh, that was a long journey. To wraps things up, here is a shortlist of what we have learned about Maps in V8:
- V8 uses deterministic hash table algorithm to implement Maps, and it is very likely that other JS engines do so.
- Maps are implemented in C++ and exposed via JS API.
- Just like with “classical” hash maps, lookups required for Map operations are O(1) and rehashing is O(N).
- On a 64-bit system, when pointer compression is disabled, a Map with 1M entries occupies ~29 MB on the heap.
- Most of the things described in this blog post can also be applied to Sets.
That’s it for this time. Please share your ideas for the next V8 Deep Dive.
Posted on October 6, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.