Degeneralize your Data Structures
Patrick Kelly
Posted on September 12, 2020
In my last two posts, I've talked up generic programming. In fact, this is my favorite thing in programming. The sheer power of it, and how efficiently it implements its goals, is fantastic. However, you should beware anyone who never shows you downsides or times to avoid the same things they talk up. Everything in our programming toolbox is suitable for specific tasks and fails miserably at others. This is a case where generics, and the abstraction so favored in academia, is best avoided.
So first, so we're all on the same page, what is a data structure? Some might be quick to shout out things like "stack" or "list", although this is incorrect. So, quick computer science primer, because our terminology is going to matter. Abstract Data Types are models that describe the user-viewed semantics of a type: a stack, a list, a heap, etc. Data Structures are how that data is actually structured for the computer: the difference between a resizing array or a linked list or a tree who's leaves form a list. In more practical terms, your Abstract Data Type is your interface
, and the Data Structure is the class
or struct
that implements that interface
.
We're going to talk about why Data Structures should be degeneralized; that is, made not generic.
Let me get out of the way what this isn't: I'm not saying don't ever write generic data structures. In fact, a good language standard library must provide some common ADT's as generic data structures. It's always best that your initial implementation have an adequate solution and get released, rather than trying to perfect it and never be released in the first place.
Consider the case of a table, the simplest of all tables being an associative array. An obvious way to speed up the performance of many operations, such as look ups, is to hash the contents, and use the hashes to fetch the buckets. At long as no collisions had occurred, this is highly effective. This is commonly done with Dictionary<Key, Value>
types across languages. But what about when the Key
is a numeric type? Numeric types aren't reference types, they always have unique identifiers regardless of the instance, and therefore there's no need to hash them at all. With the fully generalized data structure we don't have a choice. While the hash function could just return the instance value, the simple fact is, the hash set functionality internal to the associative array is still occurring, even though we don't need it. Why pay for something you aren't going to use?
Some languages have fantastic capabilities for addressing these situations. C++ has template specialization, for instance, where the template/generic, given certain parameters, instantiates differently. In cases where a language doesn't support this, there is luckily still ways of addressing it, although not as elegantly. A close relative, C#, lacks this possibility, but does have work arounds. The mechanisms for doing so are likely, but not universally, applicable to other generic languages lacking specialization.
If the degeneralization only needs to occur on operations, then generic functions or generic extension methods can potentially address this. In every language I've come across where generic subroutines exist, overload resolution occurs with starting preference to the most specialized, and will therefore select your specialized operation when possible.
However, if degeneralization would obviate internals, you need to write a new data structure implementing that specialization. There are practices, like OOP hierarchies and CRTP, which allow for substantial code reuse and can make this far less daunting than it originally seems. You'd still have your generic associative array which uses a hash set to assist with faster look ups. Only now, you'd also have associative arrays that accept numeric types as the keys and don't use any hash set internals at all. That means less memory usage, and avoiding completely useless parts of the operations, which improves performance.
But wait, these are different types and that requires relearning API's!
Remember where I said understanding the difference between ADT's and DS's was important for this? Here's where. As long as every associative array shares the same ADT (interface), then code written to the interface can have the DS (implementation) swapped out, recompiled, and just work. No Todd Howarding going on here.
Not only can we get improved performance or lighter weight out of this, but it can sometimes even offer new features. Recently, I needed a specialized Trie, itself a specialized Tree structure optimized around retrievals. This Trie held a single character in each node, and only ever a single character. It specifically did not allow the possibility of holding chunks of text in a node. Furthermore, there was atypically the requirement that any node could be a terminal, not exclusively the leaves. This clearly is outside of the relm of generic data structures. So, I implemented it by hand. Didn't actually take too long, maybe 40 minutes at most, including some debugging. Because I knew so much about the Trie, including the "every node is a character", that means the Trie could also be used for... parsing. Not only was the tree optimized for retrievals, making it a trie, but it was done so in a way that allowed character-by-character parsing to occur. Furthermore, because these were the same operation, the entire process could be done as a single traversal through the Trie, finding either a leaf, or reaching the end of the sequence being retrieved.
Don't be afraid to degeneralize and specialize your data structures. They're easier to maintain the more specialized they are, because they aren't trying to cater to everything out there.
Posted on September 12, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.