In my previous post, I asked why this change would result in a better performing system, since the total amount of work that is done is the same:
The answer is quite simple. The amount of work that our code is doing is the same, sure, but that isn’t all the code that runs.
In the first version, we would allocate the string, and then we’ll start a bunch of async operations. Those operations are likely to take some time and involve I/O (otherwise, they wouldn’t be async).
It is very likely that in the meantime, we’ll get a GC run. At that point, the string pointed to be the ids variable will be promoted (since it survived a GC). That means that it would be collected much later.
Using the new code, the scope of the ids string is far shorter. That means that the GC is more likely to catch it very early and significantly reduce the cost of releasing the memory.
Take a look at the following code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
public async Task<ComputeResult> Execute(List<Item> items)
{
var sw = Stopwatch.StartNew();
var ids = string.Join(", ", items.Select(x=>x.Id));
foreach(var item in items)
{
await Write(item);
}
await FlushToClient();
var result = await ReadResult();
log.Info($"Executed computation for '{ids}' in {sp.Elapsed}");
return result;
}
view raw
bad.cs
hosted with ❤ by GitHub
If we move line 4 to line 11, we can improve the performance of this code significantly. Here is what this looks like:The question is, why?The exact same amount of work is being done in both cases, after all. How can this cause a big difference?
I run into this blog post about the Hare language and its approach to generic data structures. From the blog post, we have this quote:
…it’s likely that the application of a higher-level data structure will provide a meaningful impact to your program. Instead of providing such data structures in the standard library (or even, through generics, in third-party libraries), Hare leaves this work to you.
And this one, at the end:
Hare doesn’t provide us with a generic hash map, but we were able to build one ourselves in just a few lines of code. A hash map is one of the simpler data structures we could have shown here, but even for more complex ones, you’ll generally find that it’s not too difficult to implement them yourself in Hare.
I… don’t really know where to begin. The relevant code is here, by the way, and you can see how this works.
A hash table is not a simple data structure, let’s start with that. It is the subject of much research and a ton of effort was spent on optimizing them. They are not the sort of things that you roll out yourself. To give some context, here are some talks from CppCon that talks about this:
Abseil's Open Source Hashtables: 2 Years In - Matt Kulukundis - CppCon 2019
C++Now 2018: You Can Do Better than std::unordered_map: New Improvements to Hash Table Performance
CppCon 2017: Matt Kulukundis “Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step”
CppCon 2017 Designing a Fast, Efficient, Cache friendly Hash Table, Step by Step
So in a quick search, we can see that there is a lot to discuss here. For that matter, here are some benchmark results, which compare:
tsl::hopscotch_map
tsl::robin_map
tsl::sparse_map
std::unordered_map
google::dense_hash_map
QHash
Why are there so many of those?
Well, because that matters. Each of those implementations is optimizing for something specific in different ways. There isn’t just a hash table algorithm, the details matter. A lot.
The fact that Hare believes that a Hashtable or a map does not have to have a solution is pure insanity in my opinion. Let’s look at the example that is provided in the post, shall we? You can see the raw code here.
Let’s take a look to understand what is going on here. There is a static array with 64 buckets that are used as the module cache. In each one of those buckets, you have an array of entries that match that bucket. The hash key here is the FNV32 of the AST node in question.
Let’s see how many things just pop to mind immediately in here as issues. Let’s start with the fact that this is a statically sized hash table, which may be appropriate for this scenario, but won’t fit many others. If we need to handle growing the underlying array, the level of complexity will shoot up significantly.
The code is also not handling deletes (another complicated topic), and the hash collision mode is chaining (via growing the array). In other words, for many other scenarios, you’ll need to roll your own hash table (and see above about the complexities involved).
But let’s take it a bit further. The code is using FNV to compute the hash key. It is also making an assumption here, that the keys will never collide. Let’s see how well that holds up, shall we?
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
import pyhash
hashes = {}
for word in open("/usr/share/dict/words", "rt").readlines():
hasher = pyhash.fnv1_32()
h = hasher(word)
if h in hashes:
print(hashes[h] + " " + word + " " + str(h))
hashes[h] = word
view raw
fnv_collision.py
hosted with ❤ by GitHub
In other words, it took me a few minutes and under 130 ms to find a hash collision for this scenario. The code above does not handle it. For example, here are a couple of collisions:
“intoxicated” and “tonsillectomy's”
“Helvetius2” and “Rochester0”
Those are going to be counted as the same value by the Hare code above. Fixing this requires non trivial amount of code changes.
For that matter, let’s talk for a second about the manner in which I found it. If I were trying to write the same code in Hare, what would I have to do?
Well, the answer to that is to write a lot of code, of course. Because I would have to re-implement a hash table from scratch.
And the design of the Hare language doesn’t even allow me to provide that as a library. I have to fall down to code generation at best.
These sorts of things matter. In C, you don’t have a hash table, and the most natural data structure is some form of a linked list. So that gets used a lot. You can bring in a hash table, of course, but adapting it for use is non trivial, so they are used a lot less often. Try writing the same in Hare, and then compare the cost in time to run (and time to execute).
In modern languages, the inclusion of a hash table in the base language is a basic requirement. Languages like C++, Rust or Zig have that in the base class library and have the facilities to allow you to write your own generic data structure. That means that good data structures exist. That it make sense to spend the time writing them because they’ll be broadly applicable. Languages like C# or Java took this further and make sure that all objects have GetHashCode() and Equals() methods, specifically to support the hash table scenario. It is that important.
Even Go, before it had generics, had a dedicated syntax carved out in the language to allow you to use maps natively. And now that Go has generics, that is actually far faster.
In many systems, a hash table is one of the core data structures. It is used everywhere, and it lack make the ecosystem a lot more painful. Take a look at how Hare handles query string parameters:
I mean, I guess it would be nice to have a way to do streaming on query strings? But the most natural way to do that is to use a hash table directly. The same applies for things like headers in web requests, how would you even model that in Hare?
I couldn’t disagree more with the premise of the original post. A hashtable is not something that you should punt, the consequences for your users are dire.
In a previous post, I explained how to monitor a .NET application using OpenTelemetry. The telemetry data includes traces, metrics, and logs. When using OpenTelemetry, the application publishes the data to the OpenTelemetry Collector or exposes endpoints to get the data. However, .NET provides a wa
A customer called us with an interesting issue. They have a decently large database (around 750GB or so) that they want to replicate to another node. They did all the usual things that you need to do and the process started running as expected. However… that wouldn’t make for an interesting postmortem post if everything actually went right…
Their problem was that the replication stalled midway through. There were no resource limits, but the replication didn’t progress even though the network traffic was high. So something was going on, but it didn’t move the replication for some reason.
We first ruled out the usual suspects (replication issue causing a loop, bad network, etc) and we were left scratching our heads. Everything seemed to be fine, the replication was working, but at a rate of about 1 – 2 documents a minute. In almost 12 hours since the replication started, only about 15GB were replicated to the other side. That was way outside expectations, we assumed that the whole replication wouldn’t take this long.
It turns out that the numbers we got were a lie. Not because the customer misled us, but because RavenDB does some smarts behind the scenes that end up being pretty hard on us down the road. To get the full picture, we need to understand exactly what we have in the customer’s database.
Let’s say that you store data about Players in a game. Each player has a bunch of stats, characters, etc. Whenever a player gets an achievement, the game will store a screenshot of the achievement. This isn’t the actual scenario, but it should make it clear what is going on. As players play the game, they earn achievements. The screenshots are stored as attachments inside of RavenDB. That means that for about 8 million players, we have about 72 million attachments or so.
That explains the size of the database, of course, but not why we aren’t making progress in the replication process. Digging deeper, it turns out that most of the achievements are common across players (naturally), and that in many cases, the screenshots that you store in RavenDB are also identical.
What happens when you store the same attachment multiple times in RavenDB? Well, there is no point in storing it twice, RavenDB does transparent de-duplication behind the scenes and only stores the attachment’s data once. Attachments are de-duplicated based on their content, not their name or the associated document. In this scenario, completely accidentally, the customer set up an environment where they would upload a lot of attachments to RavenDB, which are then de-duplicated by RavenDB.
None of that is intentional, it just came out that way. To be honest, I’m pretty proud of that feature, and it certainly helped a lot in this scenario. Most of the disk space for this database was taken by attachments, but only a small number of the attachments are actually unique. Let’s do some math, then.
Total attachments' size is: 700GB. There are about half a million unique attachments. There are a total of 72 million attachments. That means that the average size of an attachment is about 1.4MB or so. And the total size of attachments (without de-duplication) is over 100 TB.
I’ll repeat that again, the actual size of the data is 100 TB. It is just that RavenDB was able to optimize that using de-duplication to have significantly less on disk due to the pattern of data that is stored in the database.
However, that applies at the node level. What happens when we have replication? Well, when we send an attachment to the other side, even if it is de-duplicated on our end, we don’t know if it is on the other side already. So we always send the attachments. In this scenario, where we have so many duplicate attachments, we end up sending way too much data to the other side. The replication process isn’t sending 750GB to the other side but 100 TB of data.
The customer was running RavenDB 5.2 at the time, so the first thing to do when we figured this out was to upgrade to RavenDB 5.3. In RavenDB 5.3 we have implemented TCP compression for internal data (replication, subscription, etc). Here are the results of this change:
In other words, we were able to compress the 1.7 TB we sent to under 65 GB. That is a nice improvement. But the situation is still not ideal.
De-duplication over the wire is a pretty tough problem. We don’t know what is the state on the other side, and the cost of asking each time can be pretty high.
Luckily, RavenDB has a relevant feature that we can lean on. RavenDB has to handle a scenario where the following sequence of events occurs (two nodes, A & B, with one way replication happening from A to B):
Node A: Create document – users/1
Node B: Replication document: users/1
Node A: Add attachment to users/1 (also modifies users/1)
Node B: Replication of attachment for users/1 & users/1 document
Node A: Modify users/1
Node B: Replication of users/1 (but not the attachment, it was already sent)
Node B: Delete users/1 document (and the associated attachment)
Node A: Modify users/1
Node B: Replication of users/1 (but not the attachment, it was already sent)
Node B is now in trouble, since it has a missing attachment
Note that this sequence of events can happen in a distributed system, and we don’t want to leave “holes” in the system. As such, RavenDB knows to detect this properly. Node B will tell Node A that it is missing an attachment and Node A will send it over.
We can utilize the same approach. RavenDB will now remember the last 16K attachments that it sent in the current connection to a node. If the attachment was already sent, we can skip sending it. But if it is missing on the other side, we fall back to the missing attachment behavior and send it anyway.
In a scenario like the one we face, where we have a lot of duplicated attachments, that can reduce the workload by a significant amount, without having to change the manner in which we replicate data between nodes.
I’m happy to announce that we have released v1.0 of RavenDB’s Grafana Data Source. This RavenDB data source plugin allows you to query and visualize your RavenDB data in Grafana. Note that this is distinct from monitoring RavenDB itself in Grafana (which is also possible, of course, see here).
You can see what this looks like here:
For more details, see the detailed guide.
RavenDB 4.2 came out in May 2019. It was our first LTS (long term support) edition for RavenDB and it has had a great run. All good things must come to an end, and RavenDB 4.2 is scheduled to go out of regular support on June 30, 2022.
Today, RavenDB Cloud completed the migration of the last remaining 100 or so customers from RavenDB 4.2 to RavenDB 5.2.
Everyone who is still running RavenDB 4.2 in production is encouraged to move to RavenDB 5.2 (LTS) or later.
After June 30, 2022 – everything will still work for RavenDB 4.2 instances, but we’ll not be providing support for them (nor updates, patches, etc).
Customers with extended support contracts will not be affected, but we won’t be offering extended support contracts for RavenDB 4.2 past the June 30 deadline.
Please upgrade, there are good things in store there.