Page 64 • Relatively General .NET

When and How to Use Blazor Components

by Ardalis

posted on: May 19, 2023

When and How to Use Blazor Components Blazor is a powerful framework for building web applications using C# instead of JavaScript. One of…Keep Reading →

Concurrent Hosted Service Start and Stop in .NET 8

by Steve Gordon

posted on: May 17, 2023

In this post, I will describe a new feature of the Microsoft.Extensions.Hosting library coming in .NET 8 (available since preview 4) affecting hosted services. Let’s first begin with a brief recap of hosted services. The hosting library for .NET, used in both the ASP.NET Core project template and the Worker Service template, provides the capability […]

Using Octokit.GraphQL to interact with the GitHub discussions API

by Andrew Lock

posted on: May 16, 2023

In this post I describe a .NET script/tool I created to migrate the comments on my blog posts from my legacy Disqus account to GitHub discussions…

Generate large files using PowerShell

by Gérald Barré

posted on: May 15, 2023

If you need to create a large file with random data for testing purposes, you can use PowerShell or PowerShell Core to quickly generate it.PowerShellcopy$path = Join-Path $pwd "test.txt" $size = 1GB $content = New-Object byte[] $size (New-Object System.Random).NextBytes($content) # Set-Content is

Avoid Proliferating DbContext or IQueryable in .NET Apps

by Ardalis

posted on: May 10, 2023

Most .NET apps use EF Core and a DbContext for data access, but maintainability can suffer when the use of a DbContext or an IQueryable…Keep Reading →

Migrating comments from Disqus to giscus

by Andrew Lock

posted on: May 09, 2023

In this post I describe a .NET script/tool I created to migrate the comments on my blog posts from my legacy Disqus account to GitHub discussions…

How to Control Visual Studio from an external application

by Gérald Barré

posted on: May 08, 2023

There are multiple use cases where you need to get information from running instances of Visual Studio. For instance, if you create a git client, you may suggest the repositories that correspond to the opened solution, or you want to know if a file is not saved in the editor and show a warning. May

Bug chasing, narrowing down the scope

by Oren Eini

posted on: May 05, 2023

I just completed a major refactoring of a piece of code inside RavenDB that is responsible for how we manage sorted queries. The first two tiers of tests all passed, which is great. Now was the time to test how this change performed. I threw 50M records into RavenDB and indexed them. I did… not like the numbers I got back. It makes sense, since I was heavily refactoring to get a particular structure, I could think of a few ways to improve performance, but I like doing this based on profiler output. When running the same scenario under the profiler, the process crashed. That is… quite annoying, as you can imagine. In fact, I discovered a really startling issue. If I index the data and query on it, I get the results I expect. If I restart the process and run the same query, I get an ExecutionEngineException. Trying to debug those is a PITA. In this case, I’m 100% at fault, we are doing a lot of unsafe things to get better performance, and it appears that I messed up something along the way. But my only reproduction is a 50M records dataset. To give some context, this means 51GB of documents to be indexed and 18 GB of indexing. Indexing this in release mode takes about 20 minutes. In debug mode, it takes a lot longer. Trying to find an error there, especially one that can only happen after you restart the process is going to be a challenging task. But this isn’t my first rodeo. Part of good system design is knowing how to address just these sorts of issues. The indexing process inside RavenDB is single-threaded per index. That means that we can rule out a huge chunk of issues around race conditions. It also means that we can play certain tricks. Allow me to present you with the nicest tool for debugging that you can imagine: repeatable traces. Here is what this looks like in terms of code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters private readonly struct IndexOperationsDumper : IDisposable { #if GENERATE_INDEX_ACTION_TRACE private readonly FileStream _fs; private readonly BinaryWriter _bw; public IndexOperationsDumper(IndexFieldsMapping fieldsMapping) { _fs = File.OpenWrite("index.bin-log"); _fs.Position = _fs.Length; _bw = new BinaryWriter(_fs); if (_fs.Length == 0) { _bw.Write7BitEncodedInt(fieldsMapping.Count); for (int i = 0; i < fieldsMapping.Count; i++) { IndexFieldBinding indexFieldBinding = fieldsMapping.GetByFieldId(i); _bw.Write(indexFieldBinding.FieldName.ToString()); } } } public void Index(string id, Span<byte> data) { _bw.Write(id); _bw.Write7BitEncodedInt(data.Length); _bw.Write(data); } public void Commit() { _bw.Write("!Commit!"); } public void Dispose() { _bw.Dispose(); } #else public IndexOperationsDumper(IndexFieldsMapping fieldsMapping) { } public void Index(string id, Span<byte> data) { } public void Commit() { } public void Dispose() { } #endif } view raw IndexOperationsDumper.cs hosted with ❤ by GitHub In this case, you can see that this is a development only feature, so it is really bare-bones one. What it does is capture the indexing and commit operations on the system and write them to a file. I have another piece of similarly trivial code that reads and applies them, as shown below. Don’t bother to dig into that, the code itself isn’t really that interesting. What is important is that I have captured the behavior of the system and can now replay it at will. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters [Fact] public void ShouldNotCorrupt() { RequireFileBasedPager(); using var stream = File.OpenRead(@"/path/to/index.bin-log"); using var br = new BinaryReader(stream); using var bsc = new ByteStringContext(SharedMultipleUseFlag.None); using var builder = IndexFieldsMappingBuilder.CreateForWriter(false); var fieldsCount = br.Read7BitEncodedInt(); for (int i = 0; i < fieldsCount; i++) { var name = br.ReadString(); builder.AddBinding(i, name); } var fields = builder.Build(); int txns = 0; int items = 0; var wtx = Env.WriteTransaction(); try { var iw = new IndexWriter(wtx, fields); while (true) { string id; try { id = br.ReadString(); } catch (EndOfStreamException) { iw.Commit(); iw.Dispose(); wtx.Commit(); wtx.Dispose(); break; } if (id == "!Commit!") { FlushIndexAndRenewWriteTransaction(); continue; } int len = br.Read7BitEncodedInt(); var buffer = br.ReadBytes(len); iw.Index(id, buffer); items++; } using (var rtx = Env.ReadTransaction()) QueryAll(rtx); void FlushIndexAndRenewWriteTransaction() { iw.Commit(); iw.Dispose(); wtx.Commit(); wtx.Dispose(); StopDatabase(); StartDatabase(); using (var rtx = Env.ReadTransaction()) QueryAll(rtx); txns++; Console.WriteLine(txns + " With " + items); items = 0; wtx = Env.WriteTransaction(); iw = new IndexWriter(wtx, fields); } } finally { wtx.Dispose(); } StopDatabase(); StartDatabase(); using (var rtx = Env.ReadTransaction()) QueryAll(rtx); void QueryAll(Transaction rtx) { var matches = new long[1024]; for (int i = 0; i < fieldsCount; i++) { var searcher = new IndexSearcher(rtx, fields); var field = FieldMetadata.Build(fields.GetByFieldId(i).FieldName, default, i, default, default); var sort = searcher.OrderBy(searcher.AllEntries(), new OrderMetadata(field, true, MatchCompareFieldType.Sequence)); while (sort.Fill(matches) != 0) { } } } } view raw ShouldNotCorrupt.cs hosted with ❤ by GitHub The code itself isn’t much, but it does the job. What is more important, note that we have calls to StopDatabase() and StartDatabase(), I was able to reproduce the crash using this code. That was a massive win, since it dropped my search area from 50M documents to merely 1.2 million. The key aspect of this is that I now have a way to play around with things. In particular, instead of using the commit points in the trace, I can force a commit (and start / stop the database) every 10,000 items (by calling FlushIndexAndRenewWriteTransaction). When using that, I can reproduce this far faster. Here is the output when I run this in release mode: 1 With 0 2 With 10000 3 With 10000 4 With 10000 5 With 10000 6 With 10000 7 With 10000 8 With 10000 9 With 10000 10 With 10000 11 With 10000 Fatal error. Internal CLR error. (0x80131506) So now I dropped the search area to 120,000 items, which is pretty awesome. Even more important, when I run this in debug mode, I get this: 1 With 0 2 With 10000 Process terminated. Assertion failed. at Voron.Data.Containers.Container.Get(Low... So now I have a repro in 30,000 items, what is even better, a debug assertion was fired, so I have a really good lead into what is going on. The key challenge in this bug is that it is probably triggered as a result of a commit and an index of the next batch. There is a bunch of work that we do around batch optimizations that likely cause this sort of behavior. By being able to capture the input to the process and play with the batch size, we were able to reduce the amount of work required to generate a reproduction from 50M records to 30,000 and have a lead into what is going on. With that, I can now start applying more techniques to narrow down what is going on. But by far the most important aspect as far as I’m concerned is the feedback cycle. I can now hit F5 to run the code and encounter the problem in a few seconds. It looks like we hit a debug assertion because we keep a reference to an item that was already freed. That is really interesting, and now I can find out which item and then figure out why this is the case. And at each point, I can simply go one step back in the investigation, and reproduce the state, I have to hit F5 and wait a bit. This means that I can be far more liberal in how I figure out this bug. This is triggered by a query on the indexed data, and if I follow up the stack, I have: This is really interesting, I wonder… what happens if I query before I restart the database? With this structure, this is easy to do. This is actually a big relief. I had no idea why restarting the database would cause us to expose this bug. Another thing to note is that when I ran into the problem, I reproduced this on a query that sorted on a single field. In the test code, I’m testing on all fields, so that may be an asset in exposing this faster. Right now the problem reproduces on the id field, which is unique. That helps, because it removes a large swath of code that deals with multiple terms for an entry. The current stage is that I can now reproduce this issue without running the queries, and I know exactly where it goes wrong. And I can put a breakpoint on the exact location where this entry is created: By the way, note that I’m modifying the code instead of using a conditional breakpoint. This is because of the performance difference. For a conditional breakpoint, the debugger has to stop execution, evaluate the condition and decide what to do. If this is run a lot, it can have a huge impact on performance. Easier to modify the code. The fact that I can do that and hit F5 and get to the same state allows me to have a lot more freedom in the ergonomics of how I work. This allows me to discover that the entry in question was created during the second transaction. But the failure happens during the third, which is really interesting. More to the point, it means that I can now do this: With the idea that this will trigger the assert on the exact entry that cause the problem. This is a good idea, and I wish that it worked, but we are actually doing a non-trivial amount of work during the commit process, so now we have a negative feedback and another clue. This is happening in the commit phase of the indexing process. It’s not a big loss, I can do the same in the commit process as well. I have done just that and now I know that I have a problem when indexing the term: “tweets/1212163952102137856”. Which leads to this code: And at this point, I can now single step through this and figure out what is going on, I hope. When working on complex data structures, one of the things that you need to do is to allow to visualize them. Being able to manually inspect the internal structure of your data structures can save you a lot of debugging. As I mentioned, this isn’t my first rodeo. So when I narrowed it down to a specific location, I started looking into exactly what is going on. Beforehand, I need to explain a couple of terms (pun intended): tweets/1212163952102137856 – this is the entry that triggers the error. tweets/1212163846623727616 – this is the term that should be returned for 1679560 Here is what the structure looks like at the time of the insert: You can notice that the value here for the last page is the same as the one that we are checking for 1679560. To explain what is going on will take us down a pretty complex path that you probably don’t care about, but the situation is that we are keeping track of the id in two locations. Making sure to add and remove it in both locations as appropriate. However, at certain points, we may decide to shuffle things around inside the tree, and we didn’t sync that up properly with the rest of the system, leading to a dangling reference. Now that I know what is going on, I can figure out how to fix it. But the story of this post was mostly about how I figured it out, not the bug itself. The key aspect was to get to the point where I can reproduce this easily, so I can repeat it as many times that is needed to slowly inch closer to the solution.

Bug chasing, the process is more important than the result

by Oren Eini

posted on: May 04, 2023

I’m doing a pretty major refactoring inside of RavenDB right now. I was able to finish a bunch of work and submitted things to the CI server for testing. RavenDB has several layers of tests, which we run depending on context. During development, we’ll usually run the FastTests. About 2,300 tests are being run to validate various behaviors for RavenDB, and on my machine, they take just over 3 minutes to complete. The next tier is the SlowTests, which run for about 3 hours on the CI server and run about 26,000 tests. Beyond that we actually have a few more layers, like tests that are being run only on the nightly builds and stress tests, which can take several minutes each to complete. In short, the usual process is that you write the code and write the relevant tests. You also validate that you didn’t break anything by running the FastTests locally. Then we let CI pick up the rest of the work. At the last count, we had about 9 dedicated machines as CI agents and given our workload, an actual full test run of a PR may complete the next day. I’m mentioning all of that to explain that when I reviewed the build log for my PR, I found that there were a bunch of tests that failed. That was reasonable, given the scope of my changes. I sat down to grind through them, fixing them one at a time. Some of them were quite important things that I didn’t take into account, after all. For example, one of the tests failed because I didn’t account for sorting on a dynamic numeric field. Sorting on a numeric field worked, and a dynamic text field also worked. But dynamic numeric field didn’t. It’s the sort of thing that I would never think of, but we got the tests to cover us. But when I moved to the next test, it didn’t fail. I ran it again, and it still didn’t fail. I ran it in a loop, and it failed on the 5th iteration. That… sucked. Because it meant that I had a race condition in there somewhere. I ran the loop again, and it failed again on the 5th. In fact, in every iteration I tried, it would only fail on the 5th iteration. When trying to isolate a test failure like that, I usually run that in a loop, and hope that with enough iterations, I’ll get it to reproduce. Having it happen constantly on the 5th iteration was… really strange. I tried figuring out what was going on, and I realized that the test was generating 1000 documents using a random. The fact that I’m using Random is the reason it is non-deterministic, of course, except that this is the code inside my test base class: So this is properly initialized with a seed, so it will be consistent. Read the code again, do you see the problem? That is a static value. So there are two problems here: I’m getting the bad values on the fifth run in a consistent manner because that is the set of results that reproduce the error. This is a shared instance that may be called from multiple tests at once, leading to the wrong result because Random is not thread safe. Before fixing this issue so it would run properly, I decided to use an ancient debugging technique. It’s called printf(). In this case, I wrote out all the values that were generated by the test and wrote a new test to replay them. That one failed consistently. The problem was that it was still too big in scope. I iterated over this approach, trying to end up with a smaller section of the codebase that I could invoke to repeat this issue. That took most of the day. But the end result is a test like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters [Fact] public void CanAddAndRemoveItems() { using (var wtx = Env.WriteTransaction()) { var c = wtx.OpenContainer("test"); var items = Items; for (int i = 0; i < items.Length; i++) { switch (items[i].Item1) { case '+': Container.Allocate(wtx.LowLevelTransaction, c, items[i].Item2, out _); break; case '-': Container.Delete(wtx.LowLevelTransaction, c, items[i].Item2); break; } } wtx.Commit(); } } private (char, int)[] Items = new[] { ('+', 32), ('+', 64), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('-', 16416), ('+', 64), ('-', 16424), ('+', 96), ('-', 16432), ('+', 64), ('-', 16440), ('+', 64), ('-', 16448), ('+', 96), ('-', 16456), ('+', 96), ('-', 16464), ('+', 96), ('-', 16472), ('+', 64), ('-', 16480), ('+', 64), ('-', 16488), ('+', 96), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('-', 16528), ('+', 64), ('-', 16536), ('+', 64), ('-', 16584), ('+', 64), ('-', 16560), ('+', 64), ('-', 16520), ('+', 64), ('-', 16592), ('+', 64), ('-', 16544), ('+', 64), ('-', 16416), ('+', 96), ('-', 16424), ('+', 128), ('-', 16432), ('+', 96), ('-', 16440), ('+', 96), ('-', 16448), ('+', 128), ('-', 16456), ('+', 128), ('-', 16472), ('+', 128), ('-', 16480), ('+', 96), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('-', 16520), ('+', 96), ('-', 16568), ('+', 64), ('-', 16536), ('+', 96), ('-', 16552), ('+', 64), ('-', 16576), ('+', 64), ('-', 16416), ('+', 128), ('-', 16432), ('+', 128), ('-', 16440), ('+', 128), ('-', 16448), ('+', 160), ('-', 16456), ('+', 160), ('-', 16464), ('+', 160), ('-', 16472), ('+', 160), ('-', 16480), ('+', 128), ('-', 16488), ('+', 128), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('+', 32), ('-', 16592), ('+', 96), ('-', 16528), ('+', 96), ('-', 16584), ('+', 96), ('-', 16416), ('+', 160), ('-', 16424), ('+', 192), ('-', 16432), ('+', 160), ('-', 16440), ('+', 160), ('-', 16448), ('+', 192), ('-', 16456), ('+', 192), ('-', 16464), ('+', 192), ('-', 16472), ('+', 192), ('-', 16488), ('+', 160), ('+', 32), }; view raw SmallTest.cs hosted with ❤ by GitHub As you can see, in terms of the amount of code that it invokes, it is pretty minimal. Which is pretty awesome, since that allowed me to figure out what the problem was: I’ve been developing software professionally for over two decades at this point. I still get caught up with things like that, sigh.

Fight for every byte it takes

by Oren Eini

posted on: May 01, 2023

In this series so far, we reduced the storage cost of key/value lookups by a lot. And in the last post we optimized the process of encoding the keys and values significantly. This is great, but the toughest challenge is ahead of us, because as much as encoding efficiency matters, the absolute cost we have is doing lookups. This is the most basic operation, which we do billions of times a second. Any amount of effort we’ll spend here will be worth it. That said, let’s look at the decoding process we have right now. It was built to be understandable over all else, so it is a good start. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters private static (long Key, long Val) Decode(Span<byte> buffer, ushort offset) { var offsetInPage = ((offset & 0xFFF0) >> 3); var (keyLen, valLen) = (offset & 0xF) switch { 1 => (1, 1), 2 => (2, 2), 3 => (3, 3), 4 => (4, 4), 5 => (2, 1), 6 => (2, 3), 7 => (2, 4), 8 => (3, 2), 9 => (3, 4), 10 => (3, 5), 11 => (4, 2), 12 => (4, 3), 13 => (5, 4), 14 => (5, 5), 15 => (4, 5), _ => (buffer[offsetInPage] >> 4, buffer[offsetInPage++] & 0xF), // note, inc offsetInPage }; long k = 0; long v = 0; var entry = buffer.Slice(offsetInPage); entry[..keyLen].CopyTo(MemoryMarshal.AsBytes(new Span<long>(ref k))); entry[(keyLen)..(keyLen + valLen)].CopyTo(MemoryMarshal.AsBytes(new Span<long>(ref v))); return (k, v); } view raw Decode.cs hosted with ❤ by GitHub What this code does is to accept a buffer and an offset into the buffer. But the offset isn’t just a number, it is composed of two values. The first 12 bits contain the offset in the page, but since we use 2-byte alignment for the entry position, we can just assume a zero bit at the start. That is why we compute the actual offset in the page by clearing the first four bits and then shifting left by three bits. That extracts the actual offset to the file, (usually a 13 bits value) using just 12 bits. The first four bits in the offset are the indicator for the key and value lengths. There are 15 known values, which we computed based on probabilities and one value reserved to say: Rare key/val length combination, the actual sizes are stored as the first byte in the entry. Note that in the code, we handle that scenario by reading the key and value lengths (stored as two nibbles in the first byte) and incrementing the offset in the page. That means that we skip past the header byte in those rare situations. The rest of the code is basically copying the key and value bytes to the relevant variables, taking advantage of partial copy and little-endian encoding. The code in question takes 512 bytes and has 23 branches. In terms of performance, we can probably do much better, but the code is clear in what it is doing, at least. The first thing I want to try is to replace the switch statement with a lookup table, just like we did before. Here is what the new version looks like: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters static ReadOnlySpan<byte> DecodeTable => new byte[16] { (0 /*rare*/), (1 << 4 | 1), (2 << 4 | 2), (3 << 4 | 3), (4 << 4 | 4), (2 << 4 | 1), (2 << 4 | 3), (2 << 4 | 4), (3 << 4 | 2), (3 << 4 | 4), (3 << 4 | 5), (4 << 4 | 2), (4 << 4 | 3), (5 << 4 | 4), (5 << 4 | 5), (4 << 4 | 5), }; private static (long Key, long Val) Decode(Span<byte> buffer, ushort offset) { var offsetInPage = ((offset & 0xFFF0) >> 3); var indicator = DecodeTable[offset & 0xF]; if (indicator == 0) { indicator = buffer[offsetInPage++]; } var keyLen = indicator >> 4; var valLen = indicator & 0xF; long k = 0; long v = 0; var entry = buffer.Slice(offsetInPage); entry[..keyLen].CopyTo(MemoryMarshal.AsBytes(new Span<long>(ref k))); entry[(keyLen)..(keyLen + valLen)].CopyTo( MemoryMarshal.AsBytes(new Span<long>(ref v))); return (k, v); } view raw Decode.cs hosted with ❤ by GitHub The size of the function dropped by almost half and we have only 7 more branches involved. There are also a couple of calls to the memory copy routines that weren’t inlined. In the encoding phase, we reduced branches due to bound checks using raw pointers, and we skipped the memory calls routines by copying a fixed size value at varied offsets to be able to get the data properly aligned. In this case, we can’t really do the same. One thing that we have to be aware of is the following situation: In other words, we may have an entry that is at the end of the page, if we’ll try to read unconditionally 8 bytes, we may read past the end of the buffer. That is not something that we can do. In the Encode() case, we know that the caller gave us a buffer large enough to accommodate the largest possible size, so that isn’t an issue. That complicates things, sadly, but we can go the other way around. The Decode() function will always be called on an entry, and that is part of the page. The way we place entries means that we are starting at the top and moving down. The structure of the page means that we can never actually place an entry below the first 8 bytes of the page. That is where the header and the offsets array are going, after all. Given that, we can do an unconditional read backward from the entry. As you can see in the image below, we are reading some data that we don’t care about, but this is fine, we can fix it later, and without any branches. The end result is that we can have the following changes: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters private unsafe static (long Key, long Val) Decode(byte* buffer, ushort offset) { var offsetInPage = ((offset & 0xFFF0) >> 3); var indicator = DecodeTable[offset & 0xF]; if (indicator == 0) { indicator = buffer[offsetInPage++]; } var keyLen = indicator >> 4; var valLen = indicator & 0xF; long k = ReadBackward(buffer, offsetInPage, keyLen); long v = ReadBackward(buffer, offsetInPage + keyLen, valLen); return (k, v); long ReadBackward(byte* buffer, int offsetInPage, int len) { var shift = 8 - len; var l = Unsafe.ReadUnaligned<long>(ref buffer[offsetInPage - shift]); l >>>= shift * 8; return l; } } view raw Decode.cs hosted with ❤ by GitHub I changed the code to use a raw pointer, avoiding bound checks that we already reasoned about. Most interesting is the ReadBackward function. This is an inner function, and was properly inlined during JIT compilation, it implements the backward reading of the value. Here is what the assembly looks like: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters L0041: mov r10d, r9d L0044: neg r10d L0047: add r10d, 8 L004b: mov r11d, r8d L004e: sub r11d, r10d L0051: movsxd r11, r11d L0054: mov r11, [r11+rdx] L0058: shl r10d, 3 L005c: shrx r10, r11, r10 view raw ReadBackward.asm hosted with ❤ by GitHub With this in place, we are now at 133 bytes and a single branch operation. That is pretty awesome, but we can do better still. Consider the following code (explanations to follow): This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters static ReadOnlySpan<byte> DecodeTable => new byte[16] { ( 1 << 3), // special! ( 1 << 4 | 1), ( 2 << 4 | 2), ( 3 << 4 | 3), ( 4 << 4 | 4), ( 2 << 4 | 1), ( 2 << 4 | 3), ( 2 << 4 | 4), ( 3 << 4 | 2), ( 3 << 4 | 4), ( 3 << 4 | 5), ( 4 << 4 | 2), ( 4 << 4 | 3), ( 5 << 4 | 4), ( 5 << 4 | 5), ( 4 << 4 | 5), }; public unsafe static void Decode(byte* buffer, ushort offset, out long key, out long val) { var offsetInPage = ((offset & 0xFFF0) >> 3); var decoderByte = DecodeTable[offset & 0xF]; var offload = decoderByte & 8; byte decodeParts = (byte)((buffer[offsetInPage] << 8 | decoderByte) >> offload); int keyLen = (decodeParts >> 4); int valLen = (decodeParts & 0xF); offload >>= 3; key = ReadBackward(buffer, offsetInPage + offload, keyLen); val = ReadBackward(buffer, offsetInPage + offload + keyLen, valLen); long ReadBackward(byte* buffer, int offsetInPage, int len) { var shift = 8 - len; var l = Unsafe.ReadUnaligned<long>(ref buffer[offsetInPage - shift]); l >>>= shift * 8; return l; } } view raw Decode.cs hosted with ❤ by GitHub Note that the first element in the table here is different, it is now setting the 4th bit. This is because we are making use of that. The structure of the bytes in the table are two nibbles, but no other value in the table sets the 4th bit. That means that we can operate on that. Indeed, what we are doing is use the decoder byte to figure out what sort of shift we want. We have the byte from the table and the byte from the buffer. And we use the fact that masking this with 8 gives (just for this value) the value of 8. We can then use that to select the appropriate byte. If we have an offloaded byte, then we’ll shift the value by 8, getting the byte from the buffer. For any other value, we’ll get 0 as the shift index, resulting in us getting the value from the table. That gives us a function with zero branches, and 141 bytes. I spent a lot of time thinking about this, so now that we have those two approaches, let's benchmark them. The results were surprising: | Method | Mean | Error | StdDev | |------------------------ |-----------:|---------:|---------:| | DecodeBranchlessShifts | 2,107.1 ns | 20.69 ns | 18.34 ns | | DecodeBranchy | 936.2 ns | 1.89 ns | 1.68 ns | It turns out that the slightly smaller code with the branches is able to beat up the branchless code. When looking into what they are doing, I think that I can guess why. Branches aren’t a huge problem if they are predictable, and in our case, the whole point of all of this is that the offload process where we need to go to the entry to get the value is meant to be a rare event. In branchless code, on the other hand, you have to do something several times to avoid a branch (like shifting the value from the buffer up and maybe shifting it down, etc). You can’t really argue with a difference like that. We also tried an AVX version, to see if this would have better performance. It turns out that there is really no way for us to beat the version with the single branch. Everything else was at least twice as slow. At this point, I believe that we have a winner.