Back

How to calculate hash of a complex object in C# efficiently?

How to calculate hash of a complex object in C# efficiently?

Rokas Balevičius

Senior Software Engineer

2023-02-23


How to calculate hash of a complex object in C# efficiently

Caching is one of the best solutions when it comes to cutting latencies and increasing the throughput of a C# web application. Replacing heavy database queries and code computations with a single query on a key - enables noticeable gains without too much work. All you really need to do is figure out what your key is and what you want to store as a value. Once that part is done, just select the place you want to store the key-value and enjoy the results in your prometheus, zabbix or other dashboard.

It is all fun and games as long as your key can be built by concatenating a few strings. Or maybe you are lucky and you can use one of the fields directly (say, the ID of an order). In other cases though, you might run into a problem.

Let's say you are working on a system where the user is performing searches. In your application, users can set many filters, some of which can be long lists or nested structures. At the same time, your service is heavily A/B tested, which means that you will get another set of knobs that can be turned for every request - this time by the system itself. To top it all off, you might be serving different markets and there might be an admin panel where other settings might be changed by business people (or other developers). This leads to a situation where your “key” depends on a large object. To add more complexity, not all properties of an object might be important for a cache key. You might be putting data into cache not at the end of the pipeline, but somewhere in the middle, and some of the properties might not be impacting anything up to that point.


Under such circumstances we need to achieve the following:

  1. Calculate a hash from a large object quickly and efficiently.
  2. The hashing algorithm has to have good enough distribution to avoid excessive collisions.
  3. Include all properties of the object automatically. It is better to include something you should not have and see hit rates plummet, than forget to include something important and return incorrect data to the user.
  4. Allow developers to easily exclude some of the properties in case they are not important for the cache key.
  5. The solution has to be reasonably easy to maintain.

Hash length

Selecting the proper hash length to keep the chances of collision acceptably low requires some diligence. The main input to the decision should be the amount of data variations. In the case of temporary caching (think minutes), time span also plays a role. Short cache durations reduce the chances of hashes colliding. This is a whole different topic, and for our purposes, we can simplify and say that 64-bit hash should be good enough.

Test data

To figure out what works and what doesn’t, we need to define some kind of stable test data. To keep things simple (and because naming is hard), we will use the following data structure - do follow to GitHub

In reality, such a data structure is going to have various levels of “filling”. In some cases, the list might be completely empty, while in others it might contain strings in the low hundreds. So it makes sense to create three different levels of “filling” to test against:

  1. Light - empty object. Only default values.
  2. Medium - half of the collections have data. Half of the nullable properties have data. No nested objects.
  3. Heavy - all collections have data. All nullable properties have data. Objects are nested on one level.

In reality, most of the cases will fall somewhere in the “medium” category, but it makes sense to check both extremes to see how well the hashing strategy scales with the size of the input. You are also free to clone the repo, and change the test data to your liking.

Strategy number one - Serialise and hash

The very first idea on how to tackle such a problem is to serialize the data we want to be involved into a hash and run a well-established hashing algorithm to obtain the key.

Selection of hash function

There are a lot of different hashing functions around. Most developers are familiar with cryptographically hardened ones. Such functions are most commonly used for password hashing. An example would be SHA1 or the infamous MD5.

Using a cryptographically hardened hash function in our case would be counterproductive. Cryptographically hardened hash functions are deliberately slow to force attackers to take more time, so definitely not something we are looking for. 

What we need is a different type of hash function. An example of such a hash function would be crc32, which is used quite often to create checksums to validate downloaded packages or archives. Such a type of hash function is made exactly for our use case - mapping a bunch of data into a smaller set of bytes as fast as possible with good distribution.

After some research, the most promising candidates available in C# were - CRC64, murmur3 (128-bit), CityHash64, XXHash64. Murmur3 does not have a 64-bit variant, so I decided to try out the 128-bit version instead. Some of the hash algorithms have multiple implementations in C#, hence the decision to test the most popular ones.

To benchmark, the hashing performance all we need to do is to generate some test data using our “filing” levels, serialize it to JSON and run it through hashing functions. JSON does add some extra bytes to the data, but we do not care about absolute numbers for now. The fast algorithm will remain fast, and that's all we need.

All the benchmarks were run on MacBook Pro 2020, core i7, 32 GB ram, MacOS Monteray 12.6 using DotNet 6.0.403 SDK and BenchmarkDotNet.

Here are the results for the “low” filling level:

Here are the results for “medium” filing level:

Here are the results for “high” filling level:

The same algorithm implemented by different packages have quite a large performance variance. From the results it is clear that XXHash64 implemented by StandardHash and K4os.Hash libraries are the fastest options available to us, and as a bonus they do their work without any heap allocations.

Due to StandardHash and K4o.Hash XXHash implementations being so close, it makes sense to continue with both of them for future steps. One of the reasons for that is that both implementations provide different ways of data entry, and this might come in handy later down the line.

Selection of serialization

The next step is to figure out how we want to serialize objects (so we can get bytes to feed into the function). There are two important considerations here:

  1. we want minimal changes to the code;
  2. we need to be able to easily exclude a field from being serialized (thus excluding it from the hash).

The candidates selected are - JSON and Protobuf.

JSON gets its spot due to the fact that it is easy to use, and modern DotNet has optimized the serialization quite a bit. It also allows us to exclude properties by using the JsonExclude attribute, and requires no annotations for our use case to work. We can also optimize it somewhat by excluding null properties via serializer option settings.

Protobuf is considered to be one of the fastest, if not the fastest serialization format at the moment. The downside is that we must mark all properties we want to include with an attribute. If we want to exclude a property, we just do not mark it. This is more invasive than the JSON approach, but it is acceptable.

Due to our hash functions having different ways to feed the data, I decided to try them all.

Here are the results for “low” filling level:

Here are the results for “medium” filling level:

Here are the results for “high” filling level:

Benchmarks with a low filling level do not show us much. All the approaches are very fast and have minimal impact. Once we get to a medium test case we can observe two things.

First, Protobuff is faster than JSON, but also has a higher memory impact. At a high fill rate Protobuff not only widens the lead, but also does it with lower impact on GC. Second interesting thing to note is that the same library with the same data can have different performance depending on the data input method.

What we can clearly see with this approach is that large objects do not serialize that quickly and also that also makes an impact on GC. Allocating 0.5 MB to 1 MB of memory just to calculate a hash of an object does not feel right, especially given that we already have all the data in place (the object itself). 

Another problem might await you with the exclusion of properties. Let's say an object you want to use as part of your hash is also an object you deserialize from JSON (say part of API request, or its stored as JSON in some sort of storage). Excluding one property will not only exclude it from the hash, but also from the initial deserialisation as well. This might or might not be a problem to you, but keep this situation in mind.

Second strategy - Hashing without serialization

The only way to avoid serialization and its problems is to build the hash on the fly while traversing the object graph. We already have the data we need, it's just in the wrong format.

To test this idea, we can craft our own hashing algorithm (please do not use this in production).

What a hashing function does is it essentially maps input data of arbitrary length into N buckets. The simplest algorithm would be a mod on an integer. If we had 10 buckets, all we had to do is mod by ten and use the remainder as our bucket id. 1 would go to 1, 2 would go to 2 and so on.

Such an approach is naive and has its pitfalls, so we have to get more clever.

One of the ways to make a somewhat decent hashing function is this (again please do not use in production):

var accumulator = 17; //must be a prime number
//foreach property of hashable object
accumulator = (accumulator * 397) ^ property.GetHash();

This allows us to leverage the GetHash method to obtain hash representation of items in our object, and combine all of property hashes into one hash. I also assume that there is a mathematician somewhere shaking his head with disapproval right now.

Getting the hash directly on reference types will yield references of the objects, and that's not something we want. To avoid that we need to travel the object graph until we hit the underlying simple types.

A small note on strings. In the classical net framework, the GetHash method provided the same hashes from run to run. This was due to us using the same hardcoded seed. In dotnet core it was changed to random seed for each run. You can read more about it here in Andrew Lock blog post.

We do seek a stable hash, so I have added the original implementation into the code and uses it instead of GetHash for string. Also notice that the algorithm for hash accumulation of string looks a lot like the hash algorithm we are going to use.

As we need to traverse the object - dotnet provides us with four options:

  1. Write code by hand - very easy to forget to include data members into the hash.
  2. Reflection - solves the problem of “I forgot to add it”, but is slow and inefficient.
  3. Code generation - should be as fast as handwritten code, and solves the “I forgot to add it”.
  4. Expression trees - should be as fast as handwritten code, and solves the “I forgot to add it”. 

In theory, both code generation and expression trees should yield more or less the same performance. To keep things simple (and to showcase a cool feature) I decided to do the traversal using the expression trees.

Expression trees is one of those features developers tend to forget about. It is very powerful, but has a somethat hefty maintenance penalty and usually there is no good justification for using it.

For those uninitiated, expression trees allow developers to build code on the fly (as expression objects connected into a tree structure), compile it, and execute it as if it was handwritten.

The very first try did not yield that good of the results as I expected. Both performance and memory allocations were disappointing. The traversal part worked well, but getting the bytes of the underlying object still created a bunch of heap allocation and caused lots of memory copy operations.

It was clear that it was time to apply some of the “unsafe” magic to make things work. As a side note - look into the code for protobuf serialization and what xxHash libraries do. They also heavily rely on the same “tricks”. So I figured it is not cheating - but rather good engineering.

After all was done, results got much better.

Here are the results for “low” filling level:

Here are the results for “medium” filling level:

Here are the results for “high” filling level:

As you can clearly see this approach not only is at least two times as fast as the fastest  alternative, but also reduces heap memory allocation by at least 30-40 times! It can be improved further still, but at this point it just made very little sense due to diminishing returns.

How to avoid custom hash logic

Even though we made things way faster and heavily reduced memory allocations, we still have a problem - namely, the custom hash function. On the surface level, our hash function seems ok, but where is the proof?

As we saw from our “serialize and hash” approach, the main culprit is not the hashing function itself, but rather the fact that we need to build the bytes from the object as input. Luckily, the K4os.Hash XXHash64 implementation provides us with a method to piecemeal feed the bytes into it and calculate the hash.

All we have to do is traverse the object as we did before, but instead of calling GetHash and our custom logic, we can just get the bytes and feed them directly into the library. As a bonus, this also made the expression tree code simpler and easier to understand.

After some modifications of the original expression trees, we get ourselves the following results:

Here are the results for “low” filling level:

Here are the results for “medium” filling level:

Here are the results for “high” filling level:

As you can see the performance dropped a bit, but it's still the second fastest time (after the naive custom hash), and we still keep the same advantage in memory allocations (notice: they are identical). 

This looks like the winning formula for fast and efficient hashing. We have a proven hash algorithm, fast speed, and low GC impact.

There are only two drawbacks to this approach:

  1. The current expression trees code does not cover all types we would like to “traverse”. This might not be a big deal as builder can just throw an exception in case it encounters something it cannot handle during startup.
  2. Not every developer has experience with expression trees, stumbling on such code for the first time might be frustrating. For me personally, it became easier and easier with every line of code, but I can see how it might be much harder to do any changes to such code after not touching it for a year or two.

Also please note that objects with deep nesting and/or cyclical relationships have to be handled. The test code throws an exception once nesting reaches 5 levels. This is not a unique problem for the approach described above. Every serialiser runs into this problem, the important thing here is to keep this scenario in mind and choose the most acceptable approach for the situation.

Conclusions

For some systems, it will be more than enough to go with the “serialize and hash” approach, especially if objects hashed are not that large, or the application is not latency sensitive. Performance tends to be good enough and having near-zero maintenance costs is an enticing proposition. But it is also nice to know that for those latency conscious there is an option in C# to take performance to the next level. Expression trees might feel daunting at first, but once you get the hang of them, it is really not that hard to work with. Also, it is highly unlikely that you will need to touch traversal code very often, under most circumstances it's a “write once” type of deal. If you are having doubts about which way to go, the most sensible option would be to start with the “serialize and hash”, monitor performance, and switch to “traverse and hash” if need be.