How to calculate hash of a complex object in C# efficiently?

How to calculate hash of a complex object in C# efficiently?

Rokas Balevičius

Senior Software Engineer


How to calculate hash of a complex object in C# efficiently

Caching is one of the best solutions when it comes to cutting latencies and increasing the throughput of a C# web application. Replacing heavy database queries and code computations with a single query on a key - enables noticeable gains without too much work. All you really need to do is figure out what your key is and what you want to store as a value. Once that part is done, just select the place you want to store the key-value and enjoy the results in your prometheus, zabbix or other dashboard.

It is all fun and games as long as your key can be built by concatenating a few strings. Or maybe you are lucky and you can use one of the fields directly (say, the ID of an order). In other cases though, you might run into a problem.

Let's say you are working on a system where the user is performing searches. In your application, users can set many filters, some of which can be long lists or nested structures. At the same time, your service is heavily A/B tested, which means that you will get another set of knobs that can be turned for every request - this time by the system itself. To top it all off, you might be serving different markets and there might be an admin panel where other settings might be changed by business people (or other developers). This leads to a situation where your “key” depends on a large object. To add more complexity, not all properties of an object might be important for a cache key. You might be putting data into cache not at the end of the pipeline, but somewhere in the middle, and some of the properties might not be impacting anything up to that point.

Under such circumstances we need to achieve the following:

  1. Calculate a hash from a large object quickly and efficiently.
  2. The hashing algorithm has to have good enough distribution to avoid excessive collisions.
  3. Include all properties of the object automatically. It is better to include something you should not have and see hit rates plummet, than forget to include something important and return incorrect data to the user.
  4. Allow developers to easily exclude some of the properties in case they are not important for the cache key.
  5. The solution has to be reasonably easy to maintain.

Hash length

Selecting the proper hash length to keep the chances of collision acceptably low requires some diligence. The main input to the decision should be the amount of data variations. In the case of temporary caching (think minutes), time span also plays a role. Short cache durations reduce the chances of hashes colliding. This is a whole different topic, and for our purposes, we can simplify and say that 64-bit hash should be good enough.

Test data

To figure out what works and what doesn’t, we need to define some kind of stable test data. To keep things simple (and because naming is hard), we will use the following data structure - do follow to GitHub

In reality, such a data structure is going to have various levels of “filling”. In some cases, the list might be completely empty, while in others it might contain strings in the low hundreds. So it makes sense to create three different levels of “filling” to test against:

  1. Light - empty object. Only default values.
  2. Medium - half of the collections have data. Half of the nullable properties have data. No nested objects.
  3. Heavy - all collections have data. All nullable properties have data. Objects are nested on one level.

In reality, most of the cases will fall somewhere in the “medium” category, but it makes sense to check both extremes to see how well the hashing strategy scales with the size of the input. You are also free to clone the repo, and change the test data to your liking.

Strategy number one - Serialise and hash

The very first idea on how to tackle such a problem is to serialize the data we want to be involved into a hash and run a well-established hashing algorithm to obtain the key.

Selection of hash function

There are a lot of different hashing functions around. Most developers are familiar with cryptographically hardened ones. Such functions are most commonly used for password hashing. An example would be SHA1 or the infamous MD5.

Using a cryptographically hardened hash function in our case would be counterproductive. Cryptographically hardened hash functions are deliberately slow to force attackers to take more time, so definitely not something we are looking for. 

What we need is a different type of hash function. An example of such a hash function would be crc32, which is used quite often to create checksums to validate downloaded packages or archives. Such a type of hash function is made exactly for our use case - mapping a bunch of data into a smaller set of bytes as fast as possible with good distribution.

After some research, the most promising candidates available in C# were - CRC64, murmur3 (128-bit), CityHash64, XXHash64. Murmur3 does not have a 64-bit variant, so I decided to try out the 128-bit version instead. Some of the hash algorithms have multiple implementations in C#, hence the decision to test the most popular ones.

To benchmark, the hashing performance all we need to do is to generate some test data using our “filing” levels, serialize it to JSON and run it through hashing functions. JSON does add some extra bytes to the data, but we do not care about absolute numbers for now. The fast algorithm will remain fast, and that's all we need.

All the benchmarks were run on MacBook Pro 2020, core i7, 32 GB ram, MacOS Monteray 12.6 using DotNet 6.0.403 SDK and BenchmarkDotNet.

Here are the results for the “low” filling level:

Here are the results for “medium” filing level:

Here are the results for “high” filling level: