skip to content
Relatively General .NET

Optimizing old code: StreamBitArray refactoring

by Oren Eini

posted on: August 16, 2024

RavenDB is a pretty old codebase, hitting 15+ years in production recently. In order to keep it alive & well, we make sure to follow the rule of always leaving the code in a better shape than we found it. Today’s tale is about the StreamBitArray class, deep in the guts of Voron, RavenDB’s storage engine. The class itself isn’t really that interesting, it is just an implementation of a Bit Array that we have for a bitmap. We wrote it (based on Mono’s code, it looks like) very early in the history of RavenDB and have never really touched it since.The last time anyone touched it was 5 years ago (fixing the namespace), 7 years ago we created an issue from a TODO comment, etc. Most of the code dates back to 2013, actually. And even then it was moved from a different branch, so we lost the really old history. To be clear, that class did a full tour of duty. For over a decade, it has served us very well. We never found a reason to change it, never got a trace of it in the profiler, etc. As we chip away at various hurdles inside RavenDB, I ran into this class and really looked at it with modern sensibilities. I think that this makes a great test case for code refactoring from the old style to our modern one.Here is what the class looks like:Already, we can see several things that really bug me. That class is only used in one context, to manage the free pages bitmap for Voron. That means we create it whenever Voron frees a page. That can happen a lot, as you might imagine. A single bitmap here covers 2048 pages, so when we create an instance of this class we also allocate an array with 64 ints. In other words, we need to allocate 312 bytes for each page we free. That isn’t fun, and it actually gets worse. Here is a typical example of using this class:using (freeSpaceTree.Read(section, out Slice result)) { sba = !result.HasValue ? new StreamBitArray() : new StreamBitArray(result.CreateReader()); } sba.Set((int)(pageNumber % NumberOfPagesInSection), true); using (sba.ToSlice(tx.Allocator, out Slice val)) freeSpaceTree.Add(section, val);And inside the ToSlice() call, we have:public ByteStringContext.InternalScope ToSlice(ByteStringContext context, ByteStringType type, out Slice str) { var buffer = ToBuffer(); var scope = context.From(buffer, 0, buffer.Length, type, out ByteString byteString); str = new Slice(byteString); return scope; } private unsafe byte[] ToBuffer() { var tmpBuffer = new byte[(_inner.Length + 1)*sizeof (int)]; unsafe { fixed (int* src = _inner) fixed (byte* dest = tmpBuffer) { *(int*) dest = SetCount; Memory.Copy(dest + sizeof (int), (byte*) src, tmpBuffer.Length - 1); } } return tmpBuffer; }In other words, ToSlice() calls ToBuffer(), which allocates an array of bytes (288 bytes are allocated here), copies the data from the inner buffer to a new one (using fixed on the two arrays, which is a performance issue all in itself) and then calls a method to do the actual copy. Then in ToSlice() itself we allocate it again in native memory, which we then write to Voron, and then discard the whole thing. In short, somehow it turns out that freeing a page in Voron costs us ~1KB of memory allocations. That sucks, I have to say. And the only reasoning I have for this code is that it is old. Here is the constructor for this class as well:public StreamBitArray(ValueReader reader) { SetCount = reader.ReadLittleEndianInt32(); unsafe { fixed (int* i = _inner) { int read = reader.Read((byte*)i, _inner.Length * sizeof(int)); if (read < _inner.Length * sizeof(int)) throw new EndOfStreamException(); } } }This accepts a reader to a piece of memory and does a bunch of things. It calls a few methods, uses fixed on the array, etc., all to get the data from the reader to the class. That is horribly inefficient. Let’s write it from scratch and see what we can do. The first thing to notice is that this is a very short-lived class, it is only used inside methods and never held for long. This usage pattern tells me that it is a good candidate to be made into a struct, and as long as we do that, we might as well fix the allocation of the array as well. Note that I have a hard constraint, I cannot change the structure of the data on disk for backward compatibility reasons. So only in-memory changes are allowed. Here is my first attempt at refactoring the code:public unsafe struct StreamBitArray { private fixed uint _inner[64]; public int SetCount; public StreamBitArray() { SetCount = 0; Vector256<uint>.Zero.StoreUnsafe(ref _inner[0]); Vector256<uint>.Zero.StoreUnsafe(ref _inner[8]); Vector256<uint>.Zero.StoreUnsafe(ref _inner[16]); Vector256<uint>.Zero.StoreUnsafe(ref _inner[24]); Vector256<uint>.Zero.StoreUnsafe(ref _inner[32]); Vector256<uint>.Zero.StoreUnsafe(ref _inner[40]); Vector256<uint>.Zero.StoreUnsafe(ref _inner[48]); Vector256<uint>.Zero.StoreUnsafe(ref _inner[56]); } public StreamBitArray(byte* ptr) { var ints = (uint*)ptr; SetCount = (int)*ints; var a = Vector256.LoadUnsafe(ref ints[1]); var b = Vector256.LoadUnsafe(ref ints[9]); var c = Vector256.LoadUnsafe(ref ints[17]); var d = Vector256.LoadUnsafe(ref ints[25]); var e = Vector256.LoadUnsafe(ref ints[33]); var f = Vector256.LoadUnsafe(ref ints[41]); var g = Vector256.LoadUnsafe(ref ints[49]); var h = Vector256.LoadUnsafe(ref ints[57]); a.StoreUnsafe(ref _inner[0]); b.StoreUnsafe(ref _inner[8]); c.StoreUnsafe(ref _inner[16]); d.StoreUnsafe(ref _inner[24]); e.StoreUnsafe(ref _inner[32]); f.StoreUnsafe(ref _inner[40]); g.StoreUnsafe(ref _inner[48]); h.StoreUnsafe(ref _inner[56]); } }That looks like a lot of code, but let’s see what changes I brought to bear here. Using a struct instead of a class saves us an allocation.Using a fixed array means that we don’t have a separate allocation for the buffer.Using [SkipLocalsInit] means that we ask the JIT not to zero the struct. We do that directly in the default constructor.We are loading the data from the ptr in the second constructor directly.The fact that this is a struct and using a fixed array means that we can create a new instance of this without any allocations, we just need 260 bytes of stack space (the 288 we previously allocated also included object headers).Let’s look at the actual machine code that these two constructors generate. Looking at the default constructor, we have:StreamBitArray..ctor() L0000: push ebp L0001: mov ebp, esp L0003: vzeroupper L0006: xor eax, eax L0008: mov [ecx+0x100], eax L000e: vxorps ymm0, ymm0, ymm0 L0012: vmovups [ecx], ymm0 L0016: vmovups [ecx+0x20], ymm0 L001b: vmovups [ecx+0x40], ymm0 L0020: vmovups [ecx+0x60], ymm0 L0025: vmovups [ecx+0x80], ymm0 L002d: vmovups [ecx+0xa0], ymm0 L0035: vmovups [ecx+0xc0], ymm0 L003d: vmovups [ecx+0xe0], ymm0 L0045: vzeroupper L0048: pop ebp L0049: retThere is the function prolog and epilog, but the code of this method uses 4 256-bit instructions to zero the buffer. If we were to let the JIT handle this, it would use 128-bit instructions and a loop to do it. In this case, our way is better, because we know more than the JIT.As for the constructor accepting an external pointer, here is what this translates into:StreamBitArray..ctor(Byte*) L0000: push ebp L0001: mov ebp, esp L0003: vzeroupper L0006: mov eax, [edx] L0008: mov [ecx+0x100], eax L000e: vmovups ymm0, [edx+4] L0013: vmovups ymm1, [edx+0x24] L0018: vmovups ymm2, [edx+0x44] L001d: vmovups ymm3, [edx+0x64] L0022: vmovups ymm4, [edx+0x84] L002a: vmovups ymm5, [edx+0xa4] L0032: vmovups ymm6, [edx+0xc4] L003a: vmovups ymm7, [edx+0xe4] L0042: vmovups [ecx], ymm0 L0046: vmovups [ecx+0x20], ymm1 L004b: vmovups [ecx+0x40], ymm2 L0050: vmovups [ecx+0x60], ymm3 L0055: vmovups [ecx+0x80], ymm4 L005d: vmovups [ecx+0xa0], ymm5 L0065: vmovups [ecx+0xc0], ymm6 L006d: vmovups [ecx+0xe0], ymm7 L0075: vzeroupper L0078: pop ebp L0079: retThis code is exciting to me because we are also allowing instruction-level parallelism. We effectively allow the CPU to execute all the operations of reading and writing in parallel. Next on the chopping block is this method:public int FirstSetBit() { for (int i = 0; i < _inner.Length; i++) { if (_inner[i] == 0) continue; return i << 5 | HighestBitSet(_inner[i]); } return -1; } private static int HighestBitSet(int v) { v |= v >> 1; // first round down to one less than a power of 2 v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; return MultiplyDeBruijnBitPosition[(uint)(v * 0x07C4ACDDU) >> 27]; }We are using vector instructions to scan 8 ints at a time, trying to find the first one that is set. Then we find the right int and locate the first set bit there. Here is what the assembly looks like:StreamBitArray.FirstSetBit() L0000: push ebp L0001: mov ebp, esp L0003: vzeroupper L0006: xor edx, edx L0008: cmp [ecx], cl L000a: vmovups ymm0, [ecx+edx*4] L000f: vxorps ymm1, ymm1, ymm1 L0013: vpcmpud k1, ymm0, ymm1, 6 L001a: vpmovm2d ymm0, k1 L0020: vptest ymm0, ymm0 L0025: jne short L0039 L0027: add edx, 8 L002a: cmp edx, 0x40 L002d: jl short L000a L002f: mov eax, 0xffffffff L0034: vzeroupper L0037: pop ebp L0038: ret L0039: vmovmskps eax, ymm0 L003d: tzcnt eax, eax L0041: add eax, edx L0043: xor edx, edx L0045: tzcnt edx, [ecx+eax*4] L004a: shl eax, 5 L004d: add eax, edx L004f: vzeroupper L0052: pop ebp L0053: retIn short, the code is simpler, shorter, and more explicit about what it is doing. The machine code that is running there is much tighter. And I don’t have allocations galore. This particular optimization isn’t about showing better numbers in a specific scenario that I can point to. I don’t think we ever delete enough pages to actually see this in a profiler output in such an obvious way. The goal is to reduce allocations and give the GC less work to do, which has a global impact on the performance of the system.

.NET 9 Preview 7 is now available!

by .NET Team

posted on: August 15, 2024

Try out the latest features in .NET 9 Preview 7 across the .NET runtime, SDK, libraries, ASP.NET Core, Blazor, C#, .NET MAUI, and more!

Useful resources to write Roslyn Analyzers

by Gérald Barré

posted on: August 12, 2024

This post is part of the series 'Roslyn Analyzers'. Be sure to check out the rest of the blog posts of the series!Writing a Roslyn analyzerWriting language-agnostic Roslyn Analyzers using IOperationWorking with types in a Roslyn analyzerReferencing an analyzer from a projectPackaging a Roslyn Analy

Improving RavenDB's Node.js bulk insert performance

by Oren Eini

posted on: August 08, 2024

During a performance evaluation internally, we ran into a strange situation. Our bulk insert performance using the node.js API was significantly worse than the performance of other clients. In particular, when we compared that to the C# version, we saw that the numbers were significantly worse than expected.To be fair, this comparison is made between our C# client, which has been through the wringer in terms of optimization and attention to performance, and the Node.js client. The focus of the Node.js client was on correctness and usability. It isn’t fair to expect the same performance from Node.js and C#, after all. However, that difference in performance was annoying enough to make us take a deeper look into what was going on.Here is the relevant code:const store = new DocumentStore('http://localhost:8080', 'bulk'); store.initialize(); const bulk = store.bulkInsert(); for (let i = 0; i < 100_000_000; i++) { await bulk.store(new User('user' + i)); } await bulk.finish();As you can see, the Node.js numbers are respectable. Running at a rate of over 85,000 writes per second is nothing to sneeze at.But I also ran the exact same test with the C# client, and I got annoyed. The C# client was able to hit close to 100,000 more writes per second than the Node.js client. And in both cases, the actual limit was on the client side, not on the server side. For fun, I ran a few clients and hit 250,000 writes/second without really doing much. The last time we properly tested ingest performance for RavenDB we achieved 150,000 writes/second. So it certainly looks like we are performing significantly better.Going back to the Node.js version, I wanted to know what exactly was the problem that we had there. Why are we so much slower than the C# version? It’s possible that this is just the limits of the node.js platform, but you gotta check to know. Node.js has an --inspect flag that you can use, and Chrome has a built-in profiler (chrome://inspect) that can plug into that. Using the DevTools, you can get a performance profile of a Node.js process.I did just that and go the following numbers:That is… curious. Really curious, isn’t it?Basically, none of my code appears here at all, most of the time is spent dealing with the async machinery. If you look at the code above, you can see that we are issuing an await for each document stored.  The idea with bulk insert is that under the covers, we split the writing to an in-memory buffer and the flushing of the buffer to the network. In the vast majority of cases, we’ll not do any async operations in the store() call. If the buffer is full, we’ll need to flush it to the network, and that may force us to do an actual await operation. In Node.js, awaiting an async function that doesn’t actually perform any async operation appears to be super expensive.We threw around a bunch of ideas on how to resolve this issue. The problem is that Node.js has no equivalent to C#’s ValueTask. We also have a lot of existing code out there in the field that we must remain compatible with.Our solution to this dilemma was to add another function that you can call, like so:for (let i = 0; i < 100_000_000; i++) { const user = new User('user' + i); const id = "users/" + i; if (bulk.tryStoreSync(user, id) == false) { await bulk.store(user, id); } }The idea is that if you call tryStoreSync() we’ll try to do everything in memory, but it may not be possible (e.g. if we need to flush the buffer). In that case, you’ll need to call the async function store() explicitly. Given that the usual reason for using the dedicated API for bulk insert is performance, this looks like a reasonable thing to ask. Especially when you can see the actual performance results. We are talking about over 55%(!!!) improvement in the performance of bulk insert.It gets even better. That was just the mechanical fix to avoid generating a promise per operation. While we are addressing this performance issue, there are a few other low-hanging fruits that could improve the bulk insert performance in Node.js. For example, it turns out that we pay a hefty cost to generate the metadata for all those documents (runtime reflection cost, mostly). We can generate it once and be done with it, like so:const bulk = store.bulkInsert(); const metadata = { "@collection": "Users", "Raven-Node-Type": "User" }; for (let i = 0; i < 100_000_000; i++) { const user = new User('user' + i); const id = "users/" + i; if (bulk.tryStoreSync(user, id, metadata) == false) { await bulk.store(user, id, metadata); } } await bulk.finish();And this code in particular gives us:That is basically near enough to the C#’s speed that I don’t think we need to pay more attention to performance. Overall, that was time very well spent in making things go fast.

Adding .NET Aspire to your existing .NET apps

by Jon Galloway

posted on: August 07, 2024

.NET Aspire can really simplify local development for your existing apps, large or small. In this post, we'll look at how easy it is to make your current solutions better with just a few lines of code.