I’ve been writing C as my preferred language for many years now, but in the kinds of projects I’ve worked on I’ve never needed to use a hashmap. That’s not because I’ve never encountered a use case where it makes sense, but because the software I’ve worked on has always been dealing with a few thousand elements at the very most, and not doing too many searches of those elements.
With so few elements, any processor with a cache and prefetcher worth its salt will yield excellent performance using just a plain dynamic array, which are trivial to implement as needed. So that’s what I’ve used.
However, on day 23 of 2021’s advent of code I finally had need of a hashmap as an optimisation. For that programming problem I found myself dealing with hundreds of thousands of elements and doing (probably) millions of lookups. Switching from a dynamic array to a hashmap improved the runtime of my solution from 10 seconds to 0.3 seconds. For that I used klib’s excellent hashmap implementation.
While I know how it works conceptually, there’s a huge difference between reading about how something works and implementing it yourself, so I wanted to finally learn how to implement my own hashmap from scratch. In this blog post, I’ll be sharing what I came up with.
Design considerations
As Wikipedia will tell you, there’s a lot of ways to build a hashmap, each with its own sets of pros and cons. I’ll be focusing on the simplest possible approach, just to get something working. I can always make more sophisticated optimisations later as needed.
Hashing
A hash map needs a hash function, but I’ve never actually written a hash function before either. I don’t know what a good hash function for this use case would actually be, since I’ve only ever heard of checksums and cryptographically secure hashes.
After asking some friends on irc, I settled on djb2, a very simple, but apparently effective, hash function by the famous DJB.
The function is very simple:
|
|
That’s actually simple enough that I could pull it from memory if the need should ever arise in future. Nice!
Collision resolution
So what happens if two keys hashes point to the same bucket? The two most common solutions seem to be separate chaining and open addressing. The former means each bucket holding a linked list containing all the key-value pairs that live in that bucket. The latter says to just roll over onto the next adjacent bucket until you find one that has space.
In the interests of simplicity I’m going to use open addressing. That way I won’t also have to much around with linked lists, I’ll just have a big array of buckets to oversee.
Open addressing does have an issue though. Let’s look at what happens if a conflict occurs but then the key that ‘won’ the conflict is deleted:
- The hashmap is initially empty
fizz
is looked up, which maps to bucket 1- Bucket 1 is empty, so the lookup fails
foo=bar
is inserted, mapping to bucket 1fizz=buzz
is inserted, also mapping to bucket 1- Bucket 1 is occupied, so it goes into bucket 2 instead
fizz
is looked up- Bucket 1 is occupied but has the wrong key
- Bucket 2 is occupied and has the right key, so the lookup succeeds
foo
is deleted, leaving bucket 1 emptyfizz
is looked up, which maps to bucket 1- Bucket 1 is empty, so the lookup fails
The key for fizz
is now in the wrong bucket with no indication
that there was ever a conflict. We could check the next bucket
just in case, but what if there were two conflicts and fizz
is
actually in bucket 3? We don’t want to fall back to a full search
of the entire hashmap for a missing key. So what do we do?
A peek at klib’s code gives the solution. When deleting a value, instead of marking its bucket as empty, mark it as ‘deleted’ instead. When writing, treat a deleted bucket the same as an empty one. When reading, skip over deleted buckets but don’t stop searching until an empty non-deleted bucket is found.
That way, even though fizz
is in the wrong bucket, the search will
still move past the ‘deleted’ bucket onto the one fizz
is actually
stored in.
Load factor
How full should we allow the hashmap to get before increasing its size? If we let it get too full then we’ll start to have many collisions and long chains of contiguous entries that have to be searched linearly, losing that O(1) performance we’re looking for.
If we oversize it, we waste memory.
Wikipedia suggests a factor of 0.6-0.75 is good. I’m going to keep the maths simple and double the size of the hashmap any time we get beyond a load factor of 0.75, and halve the size of the hashmap any time that it drops below 0.25.
Type specialisation
I quite like klib’s approach of using macros to generate type specific versions of its hashmap as needed, similar to how C++’s templating works. I’m not going to incorporate that in my first attempt though. Instead I’ll use hardcoded key and value types for now, but typedef them so that they’re easy to change out later.
The types
With those decisions made, we’re ready to begin:
|
|
So first we typedef some custom types for easy replacement later. Then we create an enum for tracking the state of each bucket in our hashmap. Lastly we define the hashmap struct itself.
The functions
Initialisation
|
|
I’m a fan of data structures that are valid when zeroed out. That allows me to instantiate them on the stack effortlessly like so:
|
|
But users of it shouldn’t need to know that, so if they malloc
the
hashmap it’s good to have the hashmap_init
function for them to use
without needing to know about the design choices I’ve made.
|
|
Next is the hashmap_free
which again is nothing exciting, just some
good housekeeping. Zeroing itself out isn’t strictly necessary but will
help generate nice clean crashes if misused. A null pointer is way more
obvious than a stale one in a core dump.
Hashing
|
|
This is exactly as I described before, except that it’s now returning a
size_t
. This is just done as a convenience since I know I’ll be
using it as an index into my buckets, albeit with the modulo operator.
It saves me some casting but I also know that it’ll be 64-bits wide
most everywhere, unlike unsigned int
.
Getter & Setters
|
|
Again inspired by klib, I decided to make these accessor functions
preprocessor macros, to give the compiler the best opportunity to
optimise them. These could just be static inline
functions in the
header arguably, but for the key and value macros there is one nice
advantage:
|
|
That kind of assignment can be done without having to dance around pointers.
Insertion
With the boilerplate done, we can get onto the meat of the hashmap.
Insertion is a nice simple operation conceptually:
- Hash the key, modulo by the number of buckets to get the bucket index
- If the bucket is occupied with a different key, try the next bucket.
- Write the key and value into the bucket, mark it as valid, and update
len
.
In code, this only takes a few minutes to whip up:
|
|
One operation I didn’t mention before is the hashmap_resize
function
that I call. That is going to be the most difficult part of the hashmap
so I’ll save it for last. We can just assume that if it returns true
then the hashmap has buckets allocated and a reasonable load factor.
Removal
Compared to insertion, removal is trivial. We’re going to assume that the caller already has an iterator to the bucket they wish to remove, so we just need to check it’s a valid iterator, mark the bucket deleted if so, and then shrink the hashmap if necessary.
|
|
Lookup
Now for the most useful function of all: looking up a value in the hashmap by key. Conceptually it’s quite similar to what we do for insertion. We hash the key and modulo it to get the bucket index the value should live in. If it’s the right bucket, return its index. If not, but the bucket is populated or deleted, try the next bucket in case there was a collision.
|
|
Resizing
We can read from and write to the hashmap, but we’ve yet to provide that resize function we called before.
Resizing is going to be an expensive operation, since it involves allocating memory, iterating through every element in the hashmap and rehashing all their keys. So we don’t want to do it regularly. The first thing we want to determine then is whether we should resize or not. Most of the time we won’t, which will be when our load factor is between 0.25 and 0.75. If our load factor is less than 0.25 we should maintain a minimum size. I’ve picked 128 somewhat arbitrarily.
|
|
Once we’ve decided to resize, we need to know what the new size will be. We want this to be a large enough increment to avoid having to resize regularly but without growing too large or too small too rapidly. I’ve settled on doubling and halving in size respectively, which seems to be fairly standard.
Knowing the new size we can allocate a new set of buckets. Then we simply have to iterate through all the existing buckets and for any populated ones, rehash the key modulo the new capacity, and write the key-value pairs to the new buckets.
I’ve made this slightly more complicated by defining my key, value,
and states arrays separately. They could have been grouped together
under a struct hashmap_bucket
struct, but this is how I’ve done
it. Arguably this way should have better performance since the states
array is going to be accessed and iterated through frequently, so
keeping it densely packed in memory will be more cache friendly.
This is absolutely a premature and untested optimisation, but so be
it.
|
|
Putting it all together
With all of the above combined, we have a fully functioning hashmap, and that item is checked off my bucket list. Pardon the pun.
There’s many many improvements that could be made to this in terms of versatility, robustness, and performance, but in my rough testing using my advent of code solution as a benchmark, I found this performed within 10% of the much more mature klib implementation.
As a first stab at a hashmap I’m very happy with that. I hope that you found this exercise as helpful as I did. It turns out that hashmaps are very easy to build so long as you have a decent hash function to use.