Harry Jeffery

Let's write a hashmap in C

2022-01-02

I’ve been writing C as my preferred language for many years now, but in the kinds of projects I’ve worked on I’ve never needed to use a hashmap. That’s not because I’ve never encountered a use case where it makes sense, but because the software I’ve worked on has always been dealing with a few thousand elements at the very most, and not doing too many searches of those elements.

With so few elements, any processor with a cache and prefetcher worth its salt will yield excellent performance using just a plain dynamic array, which are trivial to implement as needed. So that’s what I’ve used.

However, on day 23 of 2021’s advent of code I finally had need of a hashmap as an optimisation. For that programming problem I found myself dealing with hundreds of thousands of elements and doing (probably) millions of lookups. Switching from a dynamic array to a hashmap improved the runtime of my solution from 10 seconds to 0.3 seconds. For that I used klib’s excellent hashmap implementation.

While I know how it works conceptually, there’s a huge difference between reading about how something works and implementing it yourself, so I wanted to finally learn how to implement my own hashmap from scratch. In this blog post, I’ll be sharing what I came up with.

Design considerations

As Wikipedia will tell you, there’s a lot of ways to build a hashmap, each with its own sets of pros and cons. I’ll be focusing on the simplest possible approach, just to get something working. I can always make more sophisticated optimisations later as needed.

Hashing

A hash map needs a hash function, but I’ve never actually written a hash function before either. I don’t know what a good hash function for this use case would actually be, since I’ve only ever heard of checksums and cryptographically secure hashes.

After asking some friends on irc, I settled on djb2, a very simple, but apparently effective, hash function by the famous DJB.

The function is very simple:

1
2
3
4
5
6
7
unsigned int djb2(const char *bytes, size_t len)
{
  unsigned int hash = 5381;
  for (size_t i = 0; i < len; ++i)
    hash = hash * 33 + bytes[i];
  return hash;
}

That’s actually simple enough that I could pull it from memory if the need should ever arise in future. Nice!

Collision resolution

So what happens if two keys hashes point to the same bucket? The two most common solutions seem to be separate chaining and open addressing. The former means each bucket holding a linked list containing all the key-value pairs that live in that bucket. The latter says to just roll over onto the next adjacent bucket until you find one that has space.

In the interests of simplicity I’m going to use open addressing. That way I won’t also have to much around with linked lists, I’ll just have a big array of buckets to oversee.

Open addressing does have an issue though. Let’s look at what happens if a conflict occurs but then the key that ‘won’ the conflict is deleted:

  1. The hashmap is initially empty
  2. fizz is looked up, which maps to bucket 1
    • Bucket 1 is empty, so the lookup fails
  3. foo=bar is inserted, mapping to bucket 1
  4. fizz=buzz is inserted, also mapping to bucket 1
    • Bucket 1 is occupied, so it goes into bucket 2 instead
  5. fizz is looked up
    • Bucket 1 is occupied but has the wrong key
    • Bucket 2 is occupied and has the right key, so the lookup succeeds
  6. foo is deleted, leaving bucket 1 empty
  7. fizz is looked up, which maps to bucket 1
    • Bucket 1 is empty, so the lookup fails

The key for fizz is now in the wrong bucket with no indication that there was ever a conflict. We could check the next bucket just in case, but what if there were two conflicts and fizz is actually in bucket 3? We don’t want to fall back to a full search of the entire hashmap for a missing key. So what do we do?

A peek at klib’s code gives the solution. When deleting a value, instead of marking its bucket as empty, mark it as ‘deleted’ instead. When writing, treat a deleted bucket the same as an empty one. When reading, skip over deleted buckets but don’t stop searching until an empty non-deleted bucket is found.

That way, even though fizz is in the wrong bucket, the search will still move past the ‘deleted’ bucket onto the one fizz is actually stored in.

Load factor

How full should we allow the hashmap to get before increasing its size? If we let it get too full then we’ll start to have many collisions and long chains of contiguous entries that have to be searched linearly, losing that O(1) performance we’re looking for.

If we oversize it, we waste memory.

Wikipedia suggests a factor of 0.6-0.75 is good. I’m going to keep the maths simple and double the size of the hashmap any time we get beyond a load factor of 0.75, and halve the size of the hashmap any time that it drops below 0.25.

Type specialisation

I quite like klib’s approach of using macros to generate type specific versions of its hashmap as needed, similar to how C++’s templating works. I’m not going to incorporate that in my first attempt though. Instead I’ll use hardcoded key and value types for now, but typedef them so that they’re easy to change out later.

The types

With those decisions made, we’re ready to begin:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
typedef char* hm_key;
typedef char* hm_value;

enum hm_state {
  HM_EMPTY = 0,
  HM_VALID = 1,
  HM_DELETED = 2,
};

struct hashmap {
  size_t len; // number of buckets in use, for tracking load
  size_t cap; // number of buckets allocated
  enum hm_state *states; // array of bucket states
  hm_key *keys; // array of bucket keys
  hm_value *values; // array of bucket values
};

So first we typedef some custom types for easy replacement later. Then we create an enum for tracking the state of each bucket in our hashmap. Lastly we define the hashmap struct itself.

The functions

Initialisation

1
2
3
4
static void hashmap_init(struct hashmap *hm)
{
  memset(hm, 0, sizeof *hm);
}

I’m a fan of data structures that are valid when zeroed out. That allows me to instantiate them on the stack effortlessly like so:

1
struct hashmap hmap = {0};

But users of it shouldn’t need to know that, so if they malloc the hashmap it’s good to have the hashmap_init function for them to use without needing to know about the design choices I’ve made.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
static void hashmap_free(struct hashmap *hm)
{
  if (!hm)
    return;
  if (hm->cap) {
    free(hm->keys);
    free(hm->values);
    free(hm->states);
  }
  memset(hm, 0, sizeof *hm);
}

Next is the hashmap_free which again is nothing exciting, just some good housekeeping. Zeroing itself out isn’t strictly necessary but will help generate nice clean crashes if misused. A null pointer is way more obvious than a stale one in a core dump.

Hashing

1
2
3
4
5
6
7
static size_t hashmap_hash_key(hm_key key)
{
  size_t v = 5381;
  for (size_t i = 0; key[i]; ++i)
    v = v * 33 + key[i];
  return v;
}

This is exactly as I described before, except that it’s now returning a size_t. This is just done as a convenience since I know I’ll be using it as an index into my buckets, albeit with the modulo operator. It saves me some casting but I also know that it’ll be 64-bits wide most everywhere, unlike unsigned int.

Getter & Setters

1
2
3
4
5
6
#define hashmap_begin(hm) ((size_t)(0))
#define hashmap_end(hm) (((hm)->cap))
#define hashmap_states(hm, it) ((hm)->states[(it)])
#define hashmap_key(hm, it) ((hm)->keys[(it)])
#define hashmap_value(hm, it) ((hm)->values[(it)])
#define hashmap_exists(hm, it) ((it) < (hm)->cap && hashmap_states((hm), (it)) == HM_VALID)

Again inspired by klib, I decided to make these accessor functions preprocessor macros, to give the compiler the best opportunity to optimise them. These could just be static inline functions in the header arguably, but for the key and value macros there is one nice advantage:

1
2
3
struct hashmap hm; // our hashmap
size_t it; // assume it is a valid index or 'iterator' into the hashmap
hashmap_value(&hm, it) = "value"; // assign a value

That kind of assignment can be done without having to dance around pointers.

Insertion

With the boilerplate done, we can get onto the meat of the hashmap.

Insertion is a nice simple operation conceptually:

  1. Hash the key, modulo by the number of buckets to get the bucket index
  2. If the bucket is occupied with a different key, try the next bucket.
  3. Write the key and value into the bucket, mark it as valid, and update len.

In code, this only takes a few minutes to whip up:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
static size_t hashmap_insert(struct hashmap *hm, hm_key key, bool *existed)
{
  // First see if we need to resize the hashmap
  // If that fails, abort and return an invalid iterator
  if (!hashmap_resize(hm))
    return hm->cap;

  // Hash the key, modulo by the number of buckets
  size_t it = hashmap_hash_key(key) % hm->cap;

  // Skip over full buckets until we find an available one,
  // either empty or deleted is fine. We know this can't get
  // into an infinite loop due to lack of space since we limi
  // the load factor to 0.75.
  while (hm->states[it] == HM_VALID && strcmp(key, hm->keys[it]))
    it = (it + 1) % hm->cap;

  // If we're not overwriting an existing value with the same key then
  // to increment the count of how many buckets are in use
  if (hm->states[it] != HM_VALID)
    hm->len += 1;
  // If we've been given a valid pointer, use it to report whether the
  // key already existed in the hashmap or not.
  if (existed)
    *existed = hm->states[it] == HM_VALID;
  // Lastly, mark the bucket as in use and set its key and value.
  hm->states[it] = HM_VALID;
  hm->keys[it] = key;
  hm->values[it] = value;
  // And return an iterator to the bucket
  return it;
}

One operation I didn’t mention before is the hashmap_resize function that I call. That is going to be the most difficult part of the hashmap so I’ll save it for last. We can just assume that if it returns true then the hashmap has buckets allocated and a reasonable load factor.

Removal

Compared to insertion, removal is trivial. We’re going to assume that the caller already has an iterator to the bucket they wish to remove, so we just need to check it’s a valid iterator, mark the bucket deleted if so, and then shrink the hashmap if necessary.

1
2
3
4
5
6
7
8
static void hashmap_remove(struct hashmap *hm, size_t it)
{
  if (hashmap_exists(hm, it)) {
    hm->states[it] = HM_DELETED;
    hm->len -= 1;
  }
  hashmap_resize(hm);
}

Lookup

Now for the most useful function of all: looking up a value in the hashmap by key. Conceptually it’s quite similar to what we do for insertion. We hash the key and modulo it to get the bucket index the value should live in. If it’s the right bucket, return its index. If not, but the bucket is populated or deleted, try the next bucket in case there was a collision.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
static size_t hashmap_find(const struct hashmap *hm, hm_key key)
{
  // Avoid dereferencing null pointers if we've not allocated any buffers yet
  if (hm->cap == 0)
    return hm->cap;

  // Calculate the bucket the key corresponds to
  size_t it = hashmap_hash_key(key) % hm->cap;

  // Search for a bucket with a matching key.
  // Keep going for deleted buckets, in case there was a collision
  // but then the original entry was deleted.
  while (hm->states[it] == HM_DELETED || (hm->states[it] == HM_VALID && strcmp(key, hm->keys[it])))
    it = (it + 1) % hm->cap;

  // If we found the right bucket, return the index. Otherwise return an invalid iterator
  if (hm->states[it] != HM_VALID)
    return hm->cap;
  return it;
}

Resizing

We can read from and write to the hashmap, but we’ve yet to provide that resize function we called before.

Resizing is going to be an expensive operation, since it involves allocating memory, iterating through every element in the hashmap and rehashing all their keys. So we don’t want to do it regularly. The first thing we want to determine then is whether we should resize or not. Most of the time we won’t, which will be when our load factor is between 0.25 and 0.75. If our load factor is less than 0.25 we should maintain a minimum size. I’ve picked 128 somewhat arbitrarily.

1
#define HM_MIN_CAP 128

Once we’ve decided to resize, we need to know what the new size will be. We want this to be a large enough increment to avoid having to resize regularly but without growing too large or too small too rapidly. I’ve settled on doubling and halving in size respectively, which seems to be fairly standard.

Knowing the new size we can allocate a new set of buckets. Then we simply have to iterate through all the existing buckets and for any populated ones, rehash the key modulo the new capacity, and write the key-value pairs to the new buckets.

I’ve made this slightly more complicated by defining my key, value, and states arrays separately. They could have been grouped together under a struct hashmap_bucket struct, but this is how I’ve done it. Arguably this way should have better performance since the states array is going to be accessed and iterated through frequently, so keeping it densely packed in memory will be more cache friendly. This is absolutely a premature and untested optimisation, but so be it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
static bool hashmap_resize(struct hashmap *hm)
{
  size_t oldCap = hm->cap;
  size_t newCap;

  // Calculate the new capacity depending on our current load
  // factor
  if (!hm->cap || hm->len * 4 > hm->cap * 3) {
    newCap = oldCap > 0 ? oldCap * 2 : HM_MIN_CAP;
  } else if (hm->cap > HM_MIN_CAP && hm->len * 4 < hm->cap) {
    newCap = oldCap / 2;
  } else {
    // Or if no resizing required, return success early
    return true;
  }

  // Allocate our new buckets
  hm_key *newKeys = calloc(newCap, sizeof *hm->keys);
  hm_value *newValues = calloc(newCap, sizeof *hm->values);
  enum hm_state *newStates = calloc(newCap, sizeof *hm->states);
  // If any of the allocations failed, we need to clean them up
  // and abort. free on a null pointer is a no-op, helpfully.
  if (!newStates || !newKeys || !newValues) {
    free(newStates);
    free(newKeys);
    free(newValues);
    return false;
  }

  // Now rehash all the old buckets, keeping only those
  // holding a value
  for (size_t i = 0; i < oldCap; ++i) {
    if (hm->states[i] != HM_VALID)
      continue;
    size_t it = hashmap_hash_key(hm->keys[i]) % newCap;
    while (newStates[it] == HM_VALID)
      it = (it + 1) % newCap;
    newStates[it] = HM_VALID;
    newKeys[it] = hm->keys[i];
    newValues[it] = hm->values[i];
  }

  // Clean up the old buckets and finally install our new ones
  free(hm->keys);
  free(hm->values);
  free(hm->states);
  hm->keys = newKeys;
  hm->values = newValues;
  hm->states = newStates;
  hm->cap = newCap;

  return true;
}

Putting it all together

With all of the above combined, we have a fully functioning hashmap, and that item is checked off my bucket list. Pardon the pun.

There’s many many improvements that could be made to this in terms of versatility, robustness, and performance, but in my rough testing using my advent of code solution as a benchmark, I found this performed within 10% of the much more mature klib implementation.

As a first stab at a hashmap I’m very happy with that. I hope that you found this exercise as helpful as I did. It turns out that hashmaps are very easy to build so long as you have a decent hash function to use.