skip to content
Relatively General .NET

Fight for every byte it takes

by Oren Eini

posted on: April 24, 2023

I write databases for a living, which means that I’m thinking a lot about persistence. Here is a fun challenge that we went through recently. We have the need to store a list of keys and values and then lookup a value by key. Pretty standard stuff. The keys and values are both 64 bits integers. In other words, what I would like to have is: Dictionary<long,long> lookup; That would be perfect, except that I’ve to persist the data, which means that I have to work with raw bytes. It’s easiest to think about it if we have some code in front of us. Here is the interface that I need to implement: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters void Set(Span<byte> page, long key, long val) { Debug.Assert(page.Length == 8192); // impl... } bool TryGetValue(Span<byte> page, long key, out long val) { Debug.Assert(page.Length == 8192); // impl... } view raw Page.cs hosted with ❤ by GitHub As you can see, we have a byte buffer (8KB in size) and we want to add or lookup values from the buffer. All the data resides in the buffer, nothing is external. And we cannot unpack it in memory, because this is used for lookups, so this needs to be really fast. The keys we are storing are file offsets, so they correlate quite nicely to the overall size of the file. Meaning that you’ll have a lot of small values, but also large ones. Given a key, we need to be able to look its value quickly, since we may run this lookup billions of times. Given that I have 8KB of data, I can do the following, just treat the buffer as a sorted array, which means that I get a pretty easy way to search for a particular value and a simple way to actually store things. Theoretically, given an 8KB page, and 16 bytes per each (key, value) entry, we can store up to 512 entries per page. But it turns out that this is just a theory. We also need to keep track of the number of items that we have, and that takes some space. Just a couple of bytes, but it means that we don’t have those bytes available. A page can now contain up to 511 entries, and even at full capacity, we have 14 bytes wasted (2 for the number of entries, and the rest are unused). Here is what this looks like in code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters bool Set(Span<byte> page, long key, long val) { Debug.Assert(page.Length == 8192); // takes two bytes, but we reserve 8 bytes to make alignment easier ref var numberOfItems = ref Unsafe.AsRef(MemoryMarshal.Cast<byte, ushort>(page)[0]); // actual usable range var entries = MemoryMarshal.CreateSpan(ref MemoryMarshal.Cast<byte, long>(page)[1], page.Length / sizeof(long) - 1); var keys = entries[..numberOfItems]; // from the bottom var values = entries[(entries.Length - numberOfItems)..]; // from the top var idx = keys.BinarySearch(key); // makes searching easy if (idx >= 0) // update { values[idx] = val; return true; } if (numberOfItems >= entries.Length / 2) return false; // we are full, need to split the page idx = ~idx; // find where it *should* go numberOfItems++; // now extend the ranges keys = entries[..numberOfItems]; values = entries[(entries.Length - numberOfItems)..]; // move the existing entries to make room for the new one keys[..idx].CopyTo(keys[idx..]); values[1..(idx + 1)].CopyTo(values[0..idx]); keys[idx] = key; values[idx] = val; return true; } bool TryGetValue(Span<byte> page, long key, out long val) { Debug.Assert(page.Length == 8192); // takes two bytes, but we reserve 8 bytes to make alignment easier ref var numberOfItems = ref Unsafe.AsRef(MemoryMarshal.Cast<byte, ushort>(page)[0]); // actual usable range var entries = MemoryMarshal.CreateSpan(ref MemoryMarshal.Cast<byte, long>(page)[1], page.Length / sizeof(long) - 1); var keys = entries[..numberOfItems]; var values = entries[(entries.Length - numberOfItems)..]; var idx = keys.BinarySearch(key); if (idx >= 0) // update { val = values[idx]; return true; } val = default; return false; } view raw SimpleArray.cs hosted with ❤ by GitHub As you can see, we are creating two arrays, the keys are growing from the bottom of the page and the values are growing from the top. The idea is that I can utilize the BinarySearch() method to quickly find the index of a key (or where it ought) to go. From there, I can look at the corresponding values array to get the actual value. The fact that they are growing separately (and toward each other) means that I don’t need to move as much memory if I’m getting values out of order. For now, I want to set up the playground in which we’ll operate. The type of data that you write into such a system is important. I decided to use the following code to generate the test set: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters IEnumerable<long> Generate() { var rand = new Random(2023_04_21); while (true) { yield return rand.Next(100) switch { < 3 => rand.NextInt64(0, 1L << 7), // 3% < 10 => rand.NextInt64(0, 1L << 15),// 7% < 35 => rand.NextInt64(0, 1L << 23),// 25% < 75 => rand.NextInt64(0, 1L << 31),// 35% < 90 => rand.NextInt64(0, 1L << 39),// 15% < 95 => rand.NextInt64(0, 1L << 47),// 5% < 98 => rand.NextInt64(0, 1L << 55),// 3% _ => rand.NextInt64(0, 1L << 62) // 2% }; } } view raw Generate.cs hosted with ❤ by GitHub The idea is that we’ll generate a random set of numbers, in the given distribution. Most of the values are in the range of 8MB to 512GB, representing a pretty good scenario overall, I think. And with that, we just need to figure out what metrics we want to use for this purpose. My goal is to push as many values as I can into the buffer, while maintaining the ability to get a value by its key as fast as possible. The current approach, for example, does a binary search on a sorted array plus an extra lookup to the companion values array. You really can’t beat this, if you allow to store arbitrary keys. Here is my test bench: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters Span<byte> buffer = new byte[8192]; var enumerator = Generate().GetEnumerator(); var index = 0; while (true) { enumerator.MoveNext(); var k = enumerator.Current; enumerator.MoveNext(); var v = enumerator.Current; if (Set(buffer, k, v) == false) { Console.WriteLine("Inserted: " + Validate(buffer, index)); break; } index++; Validate(buffer, index); } int Validate(Span<byte> buffer, int index) { var enumerator = Generate().GetEnumerator(); var dic = new Dictionary<long, long>(); for (int i = 0; i < index; i++) { enumerator.MoveNext(); var k = enumerator.Current; enumerator.MoveNext(); var v = enumerator.Current; dic[k] = v; // need to handle update to the same value } foreach (var (k,expected) in dic) { if (TryGetValue(buffer, k, out var actual) == false || expected != actual) { Console.WriteLine($"Missed: {k}! {expected} != {actual}"); break; } } return dic.Count; } view raw Tester.cs hosted with ❤ by GitHub This will insert key/value pairs into the page until it is full. Note that we allow duplicates (we’ll just update the value), so we need to keep track of the number of entries inserted, not just the number of insertions.  We also validate the structure at any step of the way, to ensure that we always get the right behavior. This code runs as expected and we can put 511 values into the page before it gives up. This approach works, it is simple to reason about and has very few flaws. It is also quite wasteful in terms of information density. I would like to do better than 511 entries / pager. Is it possible to drop below 16 bytes per entry? Give it some thought, I’m going to present several ways of doing just that in my next post…

Retrying a bash command

by Gérald Barré

posted on: April 24, 2023

There are many reasons for commands to fail. Some of them provide retry options, but others don't. In this post, I describe how to retry a bash command multiple times until it succeeds.To retry a command in Bash, you can use a while loop. The loop will execute the command and if the command returns

Allocation, sizing and freeing with double-wide bitmaps: Oh my!

by Oren Eini

posted on: April 21, 2023

A few years ago I wrote about how you can use a bitmap to efficiently find space to allocate. I ran into a similar scenario recently and had another thought about that. Let’s say that we want to allocate from a buffer, which you can see below: Set bits in the buffer (marked by X) are busy, and cleared bits are empty. We can find a range of cleared bits to allocate easily enough, as I have shown in my previous post. The question we now need to answer is one of freeing the memory. In the image above, you can see that we have two allocations, the yellow one (the first item) and the green one (which is composed of two cells). The problem is how do we know what to free in this case? In the case of yellow it is really easy since we can see that it is a single element. But freeing the green element is much harder since we can’t tell its size. Usually you need to keep that size somewhere, and usually you store it into the memory itself (taking some part of the allocation to yourself as metdata overhead). The advantage of bitmaps is that they are simple, memory efficient, and quite fast. The problem is that they are really limited in what information they give us. The buffer above shows us busy or free, nothing more. What if we’ll make it more interesting? Instead of using a single bit per cell, we’ll use two. Then we have the following states: 0b00 – free 0b11 – allocated (and has next) 0b10 – end of allocation This doubles the memory we use, but also gives us a really important property. Given an index in the bitmap, we can easily ask what the allocated is. We just need to find the next cleared bit, and compute the distance. Let’s consider this simple bitmap: 0b_00_01_00_01_11_11_00_01_11_01_01 As you can see, we have the following allocations (I separated each two bits to make it easy to show): 0 – 1 cell 1 – 1 cells 2 – 2 cells 5 – 3 cell 9 – 1 cell The code to do the actual accounting looks like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters int AllocationSize(ulong bitmap, int pos) { var offset = pos * 2; if (((bitmap >> offset) & 1) == 0) return 0; var mask = ulong.MaxValue << (offset + 1); ulong bitset = ~bitmap & mask; var nextSetBit = BitOperations.TrailingZeroCount(bitset); return (nextSetBit + 1 - offset) / 2; } view raw double-bitmap.cs hosted with ❤ by GitHub And now we have a bitmap where the cost of tracking the size of the allocation is a single additional bit. If you are expecting a lot of big allocations, then this may not be worth it, but if you have many small allocations, removing the need to track this is a major benefit.

Top Free Tools for .NET Web API Load Testing and Benchmarking

by Ardalis

posted on: April 20, 2023

I'm evaluating different approaches to data access and other cross-cutting concerns in .NET 7 Web APIs, both for clients and in preparation…Keep Reading →

Deploying RavenDB with Helm Chart

by Oren Eini

posted on: April 19, 2023

Helm is the package manager for Kubernetes. It allows you to easily deploy applications and systems to a Kubernetes cluster easily, safely and in a reproducible manner. We provide you with a chart so you can use Helm to deploy RavenDB clusters. You can visit this link for a full discussion on how to do so.

Smoother rebases with auto-squashing Git commits

by Andrew Lock

posted on: April 18, 2023

In this post I introduce the --autosquash flag for git interactive rebases, describe what it does and how to format your commits to use it…

Looking into Corax’s posting lists: Part III

by Oren Eini

posted on: April 17, 2023

We looked into the internal of Corax’s posting list and in the last post I mentioned that we have a problem with the Baseline of the page. We are by no means the first people to work with posting lists, and there is a wide body of knowledge on the topic. When it came time to figure out the compression format for our posting list, we used the PFOR format (Patched Frame of Reference). It, like pretty much all other integer compression methods, uses 32 bits integers. Corax utilizes 64 bits integer as the document ids, so that was a problem. We solved that problem by using a Baseline for each page. In other words, each page would be able to contain values in a range of 2.1 billion of one another. That is a very reasonable range, given that a page is 8KB in size. There was a problem as we built more features into Corax. The posting list needed to store not just the document id, but also the frequency of the term in the source document. It turns out that we need 8 bits to do so, and we already have 64 bits range so… Instead of creating another location to store the frequencies, we put them directly inside the posting list. But that meant that we reserved a range of bits. We have 64 bits overall, so not a big problem, right? Except that on a page basis, we have a lot less. Before, a page could contain a range of 2.1 billion, but we reserved 10 bits (frequency and tag, the details of which are not important to our story) and we ended up with a range that is 4 million per page. That little tidbit meant that we could only store in a page items that were within 4MB of one another. And that was a problem. Whenever we had a posting list where two values would be more than 4MB from one another, we would need to split the page. And since the posting list and the entries live on the same file, having more page splits means that entries are now further apart. Here is an example of what this looks like: The index is taking more space than the source data, and most of that is used to store… nothing, since we ended up with a very wide posting list containing very few entries. One of the cases of two independent issues compounding each other very quickly. So we changed things again, instead of limiting ourselves to 32 bits range per page, we changed the PFor format to allow for 64 bits integers directly. Once again, that leads to simplification in the codebase and has greatly reduced the amount of disk space that we are actually using. To give some context, here is the current amount of disk space taken by the same entity that previously took 800+GB: The problem wasn’t with the number of entries, but that each entry would consume 8KB of disk space on its own, and in the image, you are seeing the number of posting lists, not the number of posting lists entries.

Detecting Dark and Light themes in a WPF application

by Gérald Barré

posted on: April 17, 2023

If an application support both light and dark themes, it is important to detect the current system theme and update the application accordingly. In this post, I describe how to detect if a WPF application should use a light or dark theme.#Method 1: Using the registry and the Windows message loopWin

Looking into Corax’s posting lists: Part II

by Oren Eini

posted on: April 14, 2023

In a previous post (which went out a long time ago) I explained that we have the notion of a set of uint64 values that are used for document IDs. We build a B+Tree with different behaviors for branch pages and leaf pages, allowing us to pack a lot of document IDs (thousands or more) per page. The problem is that this structure hold the data compressed, so when we add or remove a value, we don’t know if it exists already or not. That is a problem, because while we are able to do any relevant fixups to skip duplicates and erase removed values, we end up in a position where the number of entries in the set is not accurate. That is a Problem, with a capital P, since we use that for query optimizations. The solution for that is to move to a different manner of storing the data in the leaf page, instead of going with a model where we add the data directly to the page and compress when the raw values section overflows, we’ll use the following format instead: Basically, I removed the raw values section from the design entirely. That means that whenever we want to add a new value, we need to find the relevant compressed segment inside the page and add to it (potentially creating a page split, etc). Obviously, that is not going to perform well for write. Since on each addition, we’ll need to decompress the segment, add to it and then compress it again. The idea here is that we don’t need to do that. Instead of trying to store the entries in the set immediately, we are going to keep them in memory for the duration of the transaction. Just before we commit the transaction, we are going to have two lists of document IDs to go through. One of added documents and one of removed documents. We can then sort those ids and then start walking over the list, find the relevant page for each section in the list, and merging it with the compressed values. By moving the buffering stage from the per-page model to the per-transaction model, we actually gain quite a lot of performance, since if we have a lot of changes to a single page, we can handle compression of the data only once. It is a very strange manner of working, to be honest, because I’m used to doing the operation immediately. By delaying the cost to the end of the transaction, we are able to gain two major benefits. First, we have a big opportunity for batching and optimizing work on large datasets. Second, we have a single code path for this operation. It’s always: “Get a batch of changes and apply them as a unit”. It turns out that this is far easier to understand and work with. And that is for the writes portion of Corax. Remember, however, that Corax is a search engine, so we expect a lot of reads. For reads, we can now stream the results directly from the compressed segments. Given that we can usually pack a lot of numbers into a segment, and that we don’t need to compare to the uncompressed portion, that ended up benefiting us significantly on the read side as well, surprisingly. Of course, there is also another issue, look at the Baseline in the Page Header? We’ll discuss that in the next post, turned out that it wasn’t such a good idea.

That weird slow benchmark and its root cause

by Oren Eini

posted on: April 12, 2023

We care a lot about the performance of RavenDB, like a whole lot. To the point where we have a dedicated team that is continuously burning money CPU cycles testing out all sorts of scenarios with RavenDB. You can see the performance page on the website for some of their results. It got to the point where we stock NVMe drives at the office because we go through them often enough that we need them available for replacement. The benchmark must run, and the numbers must rise, after all. But the story today isn’t about the costs we pay to reach our performance goals. Rather, it is about a curious little snafu that we ran into when looking at the results. Here are the benchmark results, I intentionally stripped out everything that will give context to this story. What you can see is the same scenario being run on two identical machines, with the difference being the disk that is being used to host the database. In this case, the blue line is io1 disk (high IOPS, low latency, and high costs) versus gp3 (reasonable IOPS, much higher latency, and lower costs). In this case, lower numbers are better. If you’ll look at the benchmark, you can see that it makes complete sense. RavenDB is a database product, we are running a benchmark, and we use the disk. It’s predictable that the disk latency will have an impact on the performance of the benchmark. Except… in this case, we are looking at a benchmark that is read-only, and it is meant to run completely from memory. We wrote it so the data size is less than the amount of memory on the system. So why do we have an impact of the disk at all in this case? We even have a warmup phase before we actually start measuring, to ensure that everything is in memory. Something here does not line up. After investigating deeper, we discovered the following: When running the automated benchmark, the performance was always correlated to the disk type. When running the same benchmark, manually, there was much better performance and no slowdown related to the slower disk. That is a really annoying bug, because the fact that we are observing it somehow makes it go away? What is going on? After a while, we finally figured it out. The problem was the warmup phase. Basically, the warmup is just running the benchmark itself, discarding the results. Can you guess what the problem was? The warmup phase is running when the system is cold (naturally), we were hitting the server with enough requests up front that it was unable to process them all (it was queuing on the disk). That meant that a very large portion of the requests in the warmup would timeout or just fail. When we started the benchmark phase, significant portions of the system were still on the disk. So the benchmark become a disk-bound test, with predictable results. When we ran it manually, we would always do that after the benchmark already run, so our system would be warm (and fast, with no disk access). The solution for the problem was to scale down the number of requests that the warmup phase is running, to allow gradual loading of the data to memory, without overloading the hardware. A case where our expectations and what really happened did not line up, creating some… interesting surprises in the end result.