skip to content
Relatively General .NET

Optimizing a three-way merge

by Oren Eini

posted on: September 04, 2023

Deep inside of the Corax indexing engine inside of RavenDB there is the notion of a posting list. A posting list is just an ordered set of entry ids that contains a particular term. During the indexing process, we need to add and remove items from that posting list. This ends up being something like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters // All three parameters are sorted, the output is the merged list in // the existing parameter long[] Merge(long[] existing, long[] additions, long[] removals); view raw merge.cs hosted with ❤ by GitHub For fun, go and ask ChatGPT to write you the code for this task. You can assume that there are no duplicates between the removals and additions, and that adding an existing item is a no-op (so just one value would be in the end result). Here is a quick solution for this task (not actually tested that much, mind, but sufficient to understand what I’m trying to do): This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters static long[] Merge(long[] existing, long[] additions, long[] removals) { List<long> result = new List<long>(); int existingIndex = 0; int additionsIndex = 0; int removalsIndex = 0; for (; existingIndex < existing.Length; existingIndex++) { bool removed = false; for (; removalsIndex < removals.Length; removalsIndex++) { if (removals[removalsIndex] > existing[existingIndex]) break; // will remove in the future if (removals[removalsIndex] < existing[existingIndex]) continue; // not in the existing, can skip removed = true; break; } if (removed) continue; if (additionsIndex >= additions.Length) break; // no work remaining for (; additionsIndex < additions.Length; additionsIndex++) { if (additions[additionsIndex] > existing[existingIndex]) break; else if (additions[additionsIndex] < existing[existingIndex]) result.Add(additions[additionsIndex]); else { // it's equal, so we'll move to the next one and let the existing add it } } result.Add(existing[existingIndex]); } // add the remainders result.AddRange(additions[additionsIndex..]); result.AddRange(existing[existingIndex..]); return result.ToArray(); } view raw Merge.cs hosted with ❤ by GitHub If you look at this code in terms of performance, you’ll realize that this is quite expensive. In terms of complexity, this is actually pretty good, we iterate over the arrays just once, and the number of comparisons is also bounded to the lengths of the list. However, there is a big issue here, the number of branches that you have to deal with. Basically, every if and every for loop is going to add a tiny bit of latency to the system. This is because these are unpredictable branches, which are pretty nasty to deal with. It turns out that the values that we put in the posting list are actually always a multiple of 4, so the bottom 2 bits are always cleared. That means that we actually have a different way to deal with it. Here is the new logic: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters static long[] MergeOptimized(long[] existing, long[] additions, long[] removals) { for (int i = 0; i < removals.Length; i++) { removals[i] |= 1; } List<long> result = new List<long>(); result.AddRange(existing); result.AddRange(additions); result.AddRange(removals); result.Sort(); int outputIdx = 0; for (int i = 0; i < result.Count - 1; i++) { outputIdx += ToInt32((result[i + 1] & ~1) != result[i]); result[outputIdx] = result[i + 1]; outputIdx -= (int)(result[i + 1] & 1); } return result.Take(outputIdx).ToArray(); } view raw MergeOptimized.cs hosted with ❤ by GitHub This code was written with an eye to being able to explain the algorithm, mind, not performance. The idea goes like this. We flag the removals with a bit, then concatenate all the arrays together, sort them, and then do a single scan over the whole thing, removing duplicates and removals. In the real code, we are using raw pointers, not a List, so there are no access checks, etc. From an algorithmic perspective, this code makes absolutely no sense at all. We concatenate all the values together, then sort them (O(NlogN) operation) then scan it again?! How can that be faster than a single scan across all three arrays? The answer is simple, we have a really efficient sort primitive (vxsort) that is able to sort things really fast (GB/sec). There is a really good series of posts that explain how that is achieved. Since we consider sorting to be cheap, the rest of the work is just a single scan on the list, and there are no branches at all there. The code plays with the offset that we write into, figuring out whether we need to overwrite the current value (duplicate) or go back (removal), but in general it means that it can execute very quickly. This approach also has another really important aspect. Take a look at the actual code that we have in production. This is from about an hour worth of profiling a busy indexing session: And the more common code path: In both of them, you’ll notice something really important. There isn’t a call to sorting at all in here. In fact, when I search for the relevant function, I find: That is 25 ms out of over an hour. How can this be? As efficient as the sorting can be, we are supposed to be calling it a lot. Well, consider one scenario, what happens if: There are no removals All additions happen after the last existing item in the list In this case, I don’t need to do anything beyond concatenate the lists. I can skip the entire process entirely, just copy the existing and additions to the output and call it a day. Even when I do have a lot of removals and complicated merge processes, the code structure means that the CPU can get through this code very quickly. This isn’t super friendly for humans to read, but for the CPU, this is chump change.

Turn off monitors when locking the computer

by Gérald Barré

posted on: September 04, 2023

When I lock my computer, it's because I'm going to leave it for a while. I don't want to waste energy by keeping the monitors on and I don't like having a useless source of light. So, I want to turn them off automatically when I lock my computer.To turn off monitors on Windows, you can use the Send

Not all O(1) operations are considered equal

by Oren Eini

posted on: August 30, 2023

At some point in any performance optimization sprint, you are going to run into a super annoying problem: The dictionary. The reasoning is quite simple. One of the most powerful optimization techniques is to use a cache, which is usually implemented as a dictionary. Today’s tale is about a dictionary, but surprisingly enough, not about a cache. Let’s set up the background, I’m looking at optimizing a big indexing batch deep inside RavenDB, and here is my current focus: You can see that the RecordTermsForEntries take 4% of the overall indexing time. That is… a lot, as you can imagine. What is more interesting here is why. The simplified version of the code looks like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters private readonly Dictionary<long, List<RecordedTerm>> _termsPerEntryId = new(); private void RecordTermsForEntries(List<TermInEntryModification> entriesForTerm, long termContainerId) { for (int i = 0; i < entriesForTerm.Count; i++) { var entry = entriesForTerm[i]; ref var entryTerms = ref CollectionsMarshal.GetValueRefOrAddDefault(_termsPerEntryId, entry.EntryId, out var exists); if (exists == false) { entryTerms = new List<RecordedTerm>(); } entryTerms.Add( new RecordedTerm { TermContainerId = recordedTermContainerId } ); } } view raw RecordTermsForEntries.cs hosted with ❤ by GitHub Basically, we are registering, for each entry, all the terms that belong to it. This is complicated by the fact that we are doing the process in stages: Create the entries Process the terms for the entries Write the terms to persistent storage (giving them the recorded term id) Update the entries to record the term ids that they belong to The part of the code that we are looking at now is the last one, where we already wrote the terms to persistent storage and we need to update the entries. This is needed so when we read them, we’ll be able to find the relevant terms. At any rate, you can see that this method cost is absolutely dominated by the dictionary call. In fact, we are actually using an optimized method here to avoid doing a TryGetValue() and then Add() in case the value is not already in the dictionary. If we actually look at the metrics, this is actually kind of awesome. We are calling the dictionary almost 400 million times and it is able to do the work in under 200 nanoseconds per call. That is pretty awesome, but that still means that we have over 2% of our total indexing time spent doing lookups. Can we do better? In this case, absolutely. Here is how this works, instead of doing a dictionary lookup, we are going to store a list. And the entry will record the index of the item in the list. Here is what this looks like: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters private readonly List<List<RecordedTerm>> _termsPerEntryId = new(); private void RecordTermsForEntries(List<TermInEntryModification> entriesForTerm, long termContainerId) { for (int i = 0; i < entriesForTerm.Count; i++) { var entry = entriesForTerm[i]; if(entry.TermsPerEntryIndex == -1) { entry.TermsPerEntryIndex = _termsPerEntryId.Count; _termsPerEntryId.Add(new List<RecordedTerm>()); } ref var entryTerms = ref CollectionsMarshal.AsSpan(_termsPerEntryId)[_termsPerEntryId.TermContainerId]; entryTerms.Add( new RecordedTerm { TermContainerId = recordedTermContainerId } ); } } view raw RecordTermsForEntries2.cs hosted with ❤ by GitHub There isn’t much to this process, I admit. I was lucky that in this case, we were able to reorder things in such a way that skipping the dictionary lookup is a viable method. In other cases, we would need to record the index at the creation of the entry (effectively reserving the position) and then use that later. And the result is… That is pretty good, even if I say so myself. The cost went down from 3.6 microseconds per call to 1.3 microseconds. That is almost 3 folds improvement.

Using RavenDB from Cloudflare Workers

by Oren Eini

posted on: August 29, 2023

RavenDB is a multi-primary database, which means that it allows you to write to multiple nodes at the same time, without needing synchronization between them. This ability to run independently from the other nodes in the cluster (or even across clusters) makes RavenDB highly suitable for running on the edge. We have recently published a guide on using RavenDB from Cloudflare Workers, as well as a full template so you can get up to speed in a few minutes. The ability to run in a Cloudflare Worker (and use a nearby RavenDB server) means that your logic is running closer to the client, which can greatly reduce your overall latency and improve the overall user experience.

Recording

by Oren Eini

posted on: August 28, 2023

I spoke for over an hour about how you can build high-performance systems and how we utilize these techniques inside of RavenDB.

How to convert YAML to JSON with PowerShell Core

by Gérald Barré

posted on: August 28, 2023

PowerShell Core doesn't provide a built-in cmdlet to parse YAML. However, there is a PowerShell module called powershell-yaml that provides the ConvertFrom-Yaml and ConvertTo-Yaml cmdlets. The following example shows how to convert a YAML file to a JSON file with PowerShell Core:PowerShellcopy$Inpu

A twisted tale of memory optimization

by Oren Eini

posted on: August 24, 2023

I was looking into reducing the allocation in a particular part of our code, and I ran into what was basically the following code (boiled down to the essentials): This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public class Scenario { Stream _stream; public void Write(char[] buffer, int len) { var bytes = Encoding.UTF8.GetBytes(new string(buffer,0, len)); _stream.Write(bytes); } } view raw Scenario.orig.cs hosted with ❤ by GitHub As you can see, this does a lot of allocations. The actual method in question was a pretty good size, and all those operations happened in different locations and weren’t as obvious. Take a moment to look at the code, how many allocations can you spot here? The first one, obviously, is the string allocation, but there is another one, inside the call to GetBytes(), let’s fix that first by allocating the buffer once (I’m leaving aside the allocation of the reusable buffer, you can assume it is big enough to cover all our needs): This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public class Scenario { Stream _stream; byte[] _reusableBuffer; public void Write(char[] buffer, int len) { var bytes = Encoding.UTF8.GetBytes(new string(buffer,0, len), _reusableBuffer); _stream.Write(_reusableBuffer[..bytes]); } } view raw Scenario.step-1.cs hosted with ❤ by GitHub For that matter, we can also easily fix the second problem, by avoiding the string allocation: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public class Scenario { Stream _stream; byte[] _reusableBuffer; public void Write(char[] buffer, int len) { var bytes = Encoding.UTF8.GetBytes(buffer[..len], _reusableBuffer); _stream.Write(_reusableBuffer[..bytes]); } } view raw Scenario.step2.cs hosted with ❤ by GitHub That is a few minutes of work, and we are good to go. This method is called a lot, so we can expect a huge reduction in the amount of memory that we allocated. Except… that didn’t happen. In fact, the amount of memory that we allocate remained pretty much the same. Digging into the details, we allocate roughly the same number of byte arrays (how!) and instead of allocating a lot of strings, we now allocate a lot of character arrays. I broke the code apart into multiple lines, which made things a lot clearer. (In fact, I threw that into SharpLab, to be honest). Take a look: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public class Scenario { Stream _stream; byte[] _reusableBuffer; public void Write(char[] buffer, int len) { char[] charBuffer = buffer[..len]; var bytes = Encoding.UTF8.GetBytes(charBuffer, _reusableBuffer); byte[] bytesBuffer = _reusableBuffer[..bytes]; _stream.Write(bytesBuffer); } } view raw Scenario.step-3.cs hosted with ❤ by GitHub This code: buffer[..len] is actually translated to: char[] charBuffer= RuntimeHelpers.GetSubArray(buffer, Range.EndAt(len)); That will, of course, allocate. I had to change the code to be very explicit about the types that I wanted to use: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public class Scenario { Stream _stream; byte[] _reusableBuffer; public void Write(char[] buffer, int len) { Span<char> bufferSpan = buffer; Span<char> charBuffer = bufferSpan[..len]; Span<byte> reusableBufferSpan = _reusableBuffer; var bytes = Encoding.UTF8.GetBytes(charBuffer, reusableBufferSpan); Span<byte> bytesBuffer = reusableBufferSpan[..bytes]; _stream.Write(bytesBuffer); } } view raw Scenario.step-4.cs hosted with ❤ by GitHub This will not allocate, but if you note the changes in the code, you can see that the use of var in this case really tripped me up. Because of the number of overloads and automatic coercion of types that didn’t happen. For that matter, note that any slicing on arrays will generate a new array, including this code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters char[] buffer = ...; Span<char> span = buffer[0..10]; // hidden allocation here view raw Surprise.cs hosted with ❤ by GitHub This makes perfect sense when you realize what is going on and can still be a big surprise, I looked at the code a lot before I figured out what was going on, and that was with a profiler output that pinpointed the fault.