skip to content
Relatively General .NET

Filtering negative numbers, fast

by Oren Eini

posted on: September 12, 2023

In the previous post, we looked into what it would take to reduce the cost of filtering negative numbers. We got into the assembly and analyzed exactly what was going on. In terms of this directly, I don’t think that even hand-optimized assembly would take us further. Let’s see if there are other options that are available for us to get better speed. The first thing that pops to mind here is to do a loop unrolling. After all, we have a very tight loop, if we can do more work per loop iteration, we might get better performance, no? Here is my first version: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static int FilterCmp_Unroll(Span<long> items) { if (items.Length <= 0) return 0; int outputIdx = 0; int i = 0; ref var output = ref items[i]; for (; i + 4 < items.Length; i += 4) { var i0 = items[i + 0]; var i1 = items[i + 1]; var i2 = items[i + 2]; var i3 = items[i + 3]; AddToOutput(ref output, i0); AddToOutput(ref output, i1); AddToOutput(ref output, i2); AddToOutput(ref output, i3); } for (; i < items.Length; i++) { AddToOutput(ref output, items[i]); } return outputIdx; [MethodImpl(MethodImplOptions.AggressiveInlining)] void AddToOutput(ref long output, long input) { if (input < 0) return; Unsafe.Add(ref output, outputIdx++) = input; } } view raw FilterCmp_Unroll.cs hosted with ❤ by GitHub And here are the benchmark results: Method N Mean Error StdDev Ratio Code Size FilterCmp 23 274.6 ns 0.40 ns 0.35 ns 1.00 411 B FilterCmp_Unroll 23 257.5 ns 0.94 ns 0.83 ns 0.94 606 B               FilterCmp 1047 748.1 ns 2.91 ns 2.58 ns 1.00 411 B FilterCmp_Unroll 1047 702.5 ns 5.23 ns 4.89 ns 0.94 606 B               FilterCmp 1048599 501,545.2 ns 4,985.42 ns 4,419.45 ns 1.00 411 B FilterCmp_Unroll 1048599 446,311.1 ns 3,131.42 ns 2,929.14 ns 0.89 606 B               FilterCmp 33554455 29,637,052.2 ns 184,796.17 ns 163,817.00 ns 1.00 411 B FilterCmp_Unroll 33554455 29,275,060.6 ns 145,756.53 ns 121,713.31 ns 0.99 606 B That is quite a jump, 6% – 11% savings is no joke. Let’s look at what is actually going on at the assembly level and see if we can optimize this further. As expected, the code size is bigger, 264 bytes versus the 55 we previously got. But more importantly, we got the range check back, and a lot of them: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters lea r11d,[r8+1] ; var i01_idx = i+1 cmp r11d,ecx ; if i01_idx >= jae near ptr M02_L10 ; jump to range check view raw FilterCmp_Unroll.asm hosted with ❤ by GitHub The JIT isn’t able to reason about our first for loop and see that all our accesses are within bounds, which leads to doing a lot of range checks, and likely slows us down. Even with that, we are still showing significant improvements here. Let’s see what we can do with this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters for (; i + 4 < items.Length; i += 4) { AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 0)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 1)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 2)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 3)); } view raw ElidedRangeChecks.cs hosted with ❤ by GitHub With that, we expect to have no range checks and still be able to benefit from the unrolling. Method N Mean Error StdDev Ratio RatioSD Code Size FilterCmp 23 275.4 ns 2.31 ns 2.05 ns 1.00 0.00 411 B FilterCmp_Unroll 23 253.6 ns 2.59 ns 2.42 ns 0.92 0.01 563 B                 FilterCmp 1047 741.6 ns 5.95 ns 5.28 ns 1.00 0.00 411 B FilterCmp_Unroll 1047 665.5 ns 2.38 ns 2.22 ns 0.90 0.01 563 B                 FilterCmp 1048599 497,624.9 ns 3,904.39 ns 3,652.17 ns 1.00 0.00 411 B FilterCmp_Unroll 1048599 444,489.0 ns 2,524.45 ns 2,361.38 ns 0.89 0.01 563 B                 FilterCmp 33554455 29,781,164.3 ns 361,625.63 ns 320,571.70 ns 1.00 0.00 411 B FilterCmp_Unroll 33554455 29,954,093.9 ns 588,614.32 ns 916,401.59 ns 1.01 0.04 563 B That helped, by quite a lot, it seems, for most cases, the 32M items case, however, was slightly slower, which is quite a surprise. Looking at the assembly, I can see that we still have branches, like so: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters M02_L03: lea r9d,[r8+2] movsxd r9,r9d mov r9,[rdx+r9*8] test r9,r9 ; if items[i+2] < 0: jl short M02_L04 ; continue mov r10d,eax lea eax,[r10+1] movsxd r10,r10d mov [rdx+r10*8],r9 M02_L04: lea r9d,[r8+3] movsxd r9,r9d mov r9,[rdx+r9*8] test r9,r9 ; if items[i+3] < 0: jl short M02_L05 ; continue mov r10d,eax lea eax,[r10+1] movsxd r10,r10d mov [rdx+r10*8],r9 view raw branches.asm hosted with ❤ by GitHub And here is why this is the case: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters [MethodImpl(MethodImplOptions.AggressiveInlining)] void AddToOutput(ref long output, long input) { if (input < 0) return; Unsafe.Add(ref output, outputIdx++) = input; } view raw AddToOutput.cs hosted with ❤ by GitHub Now, can we do better here? It turns out that we can, by using a branchless version of the operation. Here is another way to write the same thing: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters [MethodImpl(MethodImplOptions.AggressiveInlining)] unsafe void AddToOutput(ref long output, long input) { bool isNegative = (input < 0); Unsafe.Add(ref output, outputIdx) = input; outputIdx += *(byte*)&isNegative; } view raw AddToOutput.Branchless.cs hosted with ❤ by GitHub What happens here is that we are unconditionally setting the value in the array, but only increment if the value is greater than or equal to zero. That saves us in branches and will likely result in less code. In fact, let’s see what sort of assembly the JIT will output: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters movsxd r9,r8d ; r9 = i mov r9,[rdx+r9*8] ; r9 = items[i] mov r10,r9 ; copy the item to r10 temporary not r10 ; flips the bits, important for the next instruction shr r10,3F ; r10 = r9 >> 63 (basically, if < 0, it's 1, otherwise 0) movsxd r11,eax ; r11 = outputIdx mov [rdx+r11*8],r9 ; items[outputIdx] = items[i] add eax,r10d ; outputIdx += (items[i] < 0 ? 0 : 1); lea r9d,[r8+1] ; i++ movsxd r9,r9d mov r9,[rdx+r9*8] ; r9 = items[+1] mov r10,r9 ; and the same here for the next round... not r10 shr r10,3F movsxd r11,eax mov [rdx+r11*8],r9 add eax,r10d movsxd r9,r8d mov r9,[rdx+r9*8] mov r10,r9 ; shr r10,3F movsxd r11,eax ; r11 is the outputIdx mov [rdx+r11*8],r9 ; items[outputIdx] add eax,r10d ; outputIdx += r10 (which is 1 i lea r9d,[r8+1] movsxd r9,r9d mov r9,[rdx+r9*8] mov r10,r9 shr r10,3F movsxd r11,eax mov [rdx+r11*8],r9 add eax,r10d view raw Unrolled.asm hosted with ❤ by GitHub What about the performance? I decided to pit the two versions (normal and branchless) head to head and see what this will give us: Method N Mean Error StdDev Ratio Code Size FilterCmp_Unroll 23 276.3 ns 4.13 ns 3.86 ns 1.00 411 B FilterCmp_Unroll_Branchleses 23 263.6 ns 0.95 ns 0.84 ns 0.96 547 B               FilterCmp_Unroll 1047 743.7 ns 9.41 ns 8.80 ns 1.00 411 B FilterCmp_Unroll_Branchleses 1047 733.3 ns 3.54 ns 3.31 ns 0.99 547 B               FilterCmp_Unroll 1048599 502,631.1 ns 3,641.47 ns 3,406.23 ns 1.00 411 B FilterCmp_Unroll_Branchleses 1048599 495,590.9 ns 335.33 ns 297.26 ns 0.99 547 B               FilterCmp_Unroll 33554455 29,356,331.7 ns 207,133.86 ns 172,966.15 ns 1.00 411 B FilterCmp_Unroll_Branchleses 33554455 29,709,835.1 ns 86,129.58 ns 71,922.10 ns 1.01 547 B Surprisingly enough, it looks like the branchless version is very slightly slower. That is a surprise, since I would expect reducing the branches to be more efficient. Looking at the assembly of those two, the branchless version is slightly bigger (10 bytes, not that meaningful). I think that the key here is that there is a 0.5% chance of actually hitting the branch, which is pretty low. That means that the branch predictor can likely do a really good job and we aren’t going to see any big benefits from the branchless version. That said… what would happen if we tested that with 5% negatives? That difference in behavior may cause us to see a different result. I tried that, and the results were quite surprising. In the case of the 1K and 32M items, we see a slightl cost for the branchless version (additional 1% – 4%) while for the 1M entries there is an 18% reduction in latency for the branchless version. I ran the tests again with a 15% change of negative, to see what would happen. In that case, we get: Method N Mean Error StdDev Ratio RatioSD Code Size FilterCmp_Unroll 23 273.5 ns 3.66 ns 3.42 ns 1.00 0.00 537 B FilterCmp_Unroll_Branchleses 23 280.2 ns 4.85 ns 4.30 ns 1.03 0.02 547 B                 FilterCmp_Unroll 1047 1,675.7 ns 29.55 ns 27.64 ns 1.00 0.00 537 B FilterCmp_Unroll_Branchleses 1047 1,676.3 ns 16.97 ns 14.17 ns 1.00 0.02 547 B                 FilterCmp_Unroll 1048599 2,206,354.4 ns 6,141.19 ns 5,444.01 ns 1.00 0.00 537 B FilterCmp_Unroll_Branchleses 1048599 1,688,677.3 ns 11,584.00 ns 10,835.68 ns 0.77 0.01 547 B                 FilterCmp_Unroll 33554455 205,320,736.1 ns 2,757,108.01 ns 2,152,568.58 ns 1.00 0.00 537 B FilterCmp_Unroll_Branchleses 33554455 199,520,169.4 ns 2,097,285.87 ns 1,637,422.86 ns 0.97 0.01 547 B As you can see, we have basically the same cost under 15% negatives for small values, a big improvement on the 1M scenario and not much improvement on the 32M scenario. All in all, that is very interesting information. Digging into the exact why and how of that means pulling a CPU instruction profiler and starting to look at where we have stalls, which is a bit further that I want to invest in this scenario. What if we’ll try to rearrange the code a little bit. The code looks like this (load the value and AddToOutput() immediately): AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 0)); What if we split it a little bit, so the code will look like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters for (; i + 4 < items.Length; i += 4) { var i0 = Unsafe.Add(ref itemsRef, i + 0); var i1 = Unsafe.Add(ref itemsRef, i + 1); var i2 = Unsafe.Add(ref itemsRef, i + 2); var i3 = Unsafe.Add(ref itemsRef, i + 3); AddToOutput(ref itemsRef, i0); AddToOutput(ref itemsRef, i1); AddToOutput(ref itemsRef, i2); AddToOutput(ref itemsRef, i3); } view raw ReShuffle.cs hosted with ❤ by GitHub The idea here is that we are trying to get the JIT / CPU to fetch the items before they are actually needed, so there would be more time for the memory to arrive. Remember that for the 1M scenario, we are dealing with 8MB of memory and for the 32M scenario, we have 256MB. Here is what happens when we look at the loop prolog, we can see that it is indeed first fetching all the items from memory, then doing the work: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters movsxd r9,r8d mov r9,[rdx+r9*8] lea r10d,[r8+1] movsxd r10,r10d mov r10,[rdx+r10*8] lea r11d,[r8+2] movsxd r11,r11d mov r11,[rdx+r11*8] lea esi,[r8+3] movsxd rsi,esi mov rsi,[rdx+rsi*8] test r9,r9 view raw shuffled.asm hosted with ❤ by GitHub In terms of performance, that gives us a small win (1% – 2% range) for the 1M and 32M entries scenario. The one last thing that I wanted to test is if we’ll unroll the loop even further, what would happen if we did 8 items per loop, instead of 4. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters for (; i + 8 < items.Length; i += 8) { Sse.Prefetch2(Unsafe.AsPointer(ref Unsafe.Add(ref itemsRef, i + 0))); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 0)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 1)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 2)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 3)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 4)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 5)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 6)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 7)); } view raw Roll8AndPreftech.cs hosted with ❤ by GitHub There is some improvement, (4% in the 1K scenario, 1% in the 32M scenario) but also slowdowns  (2% in the 1M scenario). I think that this is probably roughly the end of the line as far as we can get for scalar code. We already made quite a few strides in trying to parallelize the work the CPU is doing by just laying out the code as we would like it to be. We tried to control the manner in which it touches memory and in general, those are pretty advanced techniques. To close this post, I would like to take a look at the gains we got. I’m comparing the first version of the code, the last version we had on the previous post and the unrolled version for both branchy and branchless with 8 operations at once and memory prefetching. Method N Mean Error StdDev Ratio RatioSD Code Size FilterCmp 23 277.3 ns 0.69 ns 0.64 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 23 270.7 ns 0.42 ns 0.38 ns 0.98 0.00 397 B FilterCmp_Unroll_8 23 257.6 ns 1.45 ns 1.21 ns 0.93 0.00 672 B FilterCmp_Unroll_8_Branchless 23 259.9 ns 1.96 ns 1.84 ns 0.94 0.01 682 B                 FilterCmp 1047 754.3 ns 1.38 ns 1.22 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 1047 749.0 ns 1.81 ns 1.69 ns 0.99 0.00 397 B FilterCmp_Unroll_8 1047 647.2 ns 2.23 ns 2.09 ns 0.86 0.00 672 B FilterCmp_Unroll_8_Branchless 1047 721.2 ns 1.23 ns 1.09 ns 0.96 0.00 682 B                 FilterCmp 1048599 499,675.6 ns 2,639.97 ns 2,469.43 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 1048599 494,388.4 ns 600.46 ns 532.29 ns 0.99 0.01 397 B FilterCmp_Unroll_8 1048599 426,940.7 ns 1,858.57 ns 1,551.99 ns 0.85 0.01 672 B FilterCmp_Unroll_8_Branchless 1048599 483,940.8 ns 517.14 ns 458.43 ns 0.97 0.00 682 B                 FilterCmp 33554455 30,282,334.8 ns 599,306.15 ns 531,269.30 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 33554455 29,410,612.5 ns 29,583.56 ns 24,703.61 ns 0.97 0.02 397 B FilterCmp_Unroll_8 33554455 29,102,708.3 ns 42,824.78 ns 40,058.32 ns 0.96 0.02 672 B FilterCmp_Unroll_8_Branchless 33554455 29,761,841.1 ns 48,108.03 ns 42,646.51 ns 0.98 0.02 682 B The unrolled 8 version is the winner by far, in this scenario (0.5% negatives). Since that is the scenario we have in the real code, that is what I’m focusing on. Is there anything left to do here? My next step is to explore whether using vector instructions will be a good option for us.

Filtering negative numbers, fast

by Oren Eini

posted on: September 11, 2023

While working deep on the guts of RavenDB, I found myself with a seemingly simple task. Given a list of longs, I need to filter out all negative numbers as quickly as possible. The actual scenario is that we run a speculative algorithm, given a potentially large list of items, we check if we can fulfill the request in an optimal fashion. However, if that isn’t possible, we need to switch to a slower code path that does more work. Conceptually, this looks something like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters void Run(Span<long> entries, long switchOverPoint, Aggregation state) { long cost = 0; for (var i = 0; i < entries.Length; i++) { var (item, searchCost) = FetchCheaply(entries[i]); cost += searchCost; if(item is not null) { entries[i] = -entries[i]; // mark as processed if(state.Aggregate(item)) return; // we have enough results to bail early } if(cost > switchOverPoint) { // speculative execution failed, we need to do this the hard way RunManually(entries, state); return; } } } view raw Choices.cs hosted with ❤ by GitHub That is the setup for this story. The problem we have now is that we now need to filter the results we pass to the RunManually() method. There is a problem here, however. We marked the entries that we already used in the list by negating them. The issue is that RunManually() does not allow negative values, and its internal implementation is not friendly to ignoring those values. In other words, given a Span<long>, I need to write the code that would filter out all the negative numbers. Everything else about the list of numbers should remain the same (the order of elements, etc). From a coding perspective, this is as simple as: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public Span<long> FilterNegative(Span<long> entries) { return entries.ToArray().Where(l => l > 0).ToArray(); } view raw VeryBad.cs hosted with ❤ by GitHub Please note, just looking at this code makes me cringe a lot. This does the work, but it has an absolutely horrible performance profile. It allocates multiple arrays, uses a lambda, etc. We don’t actually care about the entries here, so we are free to modify them without allocating a new value. As such, let’s see what kind of code we can write to do this work in an efficient manner. Here is what I came up with: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static int FilterCmp(Span<long> items) { int output = 0; for (int i = 0; i < items.Length; i++) { if (items[i] < 0) continue; items[output++] = items[i]; } return output; } view raw Cmp.cs hosted with ❤ by GitHub The way this works is that we scan through the list, skipping writing the negative lists, so we effectively “move down” all the non-negative lists on top of the negative ones. This has a cost of O(N) and will modify the entire array, the final output is the number of valid items that we have there. In order to test the performance, I wrote the following harness: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters [RPlotExporter] [DisassemblyDiagnoser] public class SimdVsScalar { private long[] items; [Params(1024 + 23, 1024 * 1024 + 23, 32 * 1024 * 1024 + 23)] public int N; [GlobalSetup] public void Setup() { items = new long[N]; var r = new Random(2391); for (int i = 0; i < items.Length; i++) { items[i] = r.NextInt64(); } } private void SprinkleNegatives() { var r = new Random(13245); var negatives = Math.Max((int)(items.Length * 0.005), 1); for (int i = 0; i < negatives; i++) { var idx = r.Next(items.Length); items[idx] = -items[idx]; } } [Benchmark] public int FilterOr() { SprinkleNegatives(); return Filter.FilterOr(items); } [Benchmark] public int FilterCmp() { SprinkleNegatives(); return Filter.FilterCmp(items); } [Benchmark] public int Base() { SprinkleNegatives(); return items.Length; } } view raw Bench.cs hosted with ❤ by GitHub We compare 1K, 1M and 32M elements arrays, each of which has about 0.5% negative, randomly spread across the range. Because we modify the values directly, we need to sprinkle the negatives across the array on each call. In this case, I’m testing two options for this task, one that uses a direct comparison (shown above) and one that uses bitwise or, like so: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static int FilterOr(Span<long> items) { int output = 0; for (int i = 0; i < items.Length; i++) { if ((items[i] & ~long.MaxValue) != 0) continue; items[output++] = items[i]; } return output; } view raw FilterOr.cs hosted with ❤ by GitHub I’m testing the cost of sprinkling negatives as well, since that has to be done before each benchmark call (since we modify the array during the call, we need to “reset” its state for the next one). Given the two options, before we discuss the results, what would you expect to be the faster option? How would the size of the array matter here? I really like this example, because it is simple, there isn’t any real complexity in what we are trying to do. And there is a very straightforward implementation that we can use as our baseline. That also means that I get to analyze what is going on at a very deep level. You might have noticed the disassembler attribute on the benchmark code, we are going to dive deep into that. For the same reason, we aren’t using exactly 1K, 1M, or 32M arrays, but slightly higher than that, so we’ll have to deal with remainders later on. Let’s first look at what the JIT actually did here. Here is the annotated assembly for the FilterCmp function: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters ; Filter.FilterCmp(System.Span`1<Int64>) sub rsp,28 ; Reserve stack space mov rax,[rcx] ; rax now holds the pointer from the span mov edx,[rcx+8] ; edx now holds the length of the span xor ecx,ecx ; zero ecx (i) xor r8d,r8d ; zero r8d (output) test edx,edx ; if items.Length <= 0 jle short M02_L02 ; jump the epilog & return M02_L00: mov r9d,r8d mov r9,[rax+r9*8] ; r9 = items[i] test r9,r9 ; r9 < 0 jl short M02_L01 ; continue to next iteration ; items[output] -- range check lea r10d,[rcx+1] ; r10d = output+1 cmp ecx,edx ; check against items.length jae short M02_L03 ; jump to out of range exception mov ecx,ecx ; clear high bits in rcx (overflow from addition?) mov [rax+rcx*8],r9 ; items[output] = items[i] mov ecx,r10d ; output = r10d M02_L01: inc r8d ; i++ cmp r8d,edx ; i < item.Length jl short M02_L00 ; back to start of the loop M02_L02: mov eax,ecx ; return output add rsp,28 ret M02_L03: ; range check failure call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 69 view raw FilterCmp.asm hosted with ❤ by GitHub For the FilterOr, the code is pretty much the same, except that the key part is: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters - test r9,r9 - jl short M02_L01 + mov r10,8000000000000000 + test r10,r9 + jne short M02_L01 view raw CmpVsOr.diff hosted with ❤ by GitHub As you can see, the cmp option is slightly smaller, in terms of code size. In terms of performance, we have: Method N Mean FilterOr 1047 745.6 ns FilterCmp 1047 745.8 ns — – – FilterOr 1048599 497,463.6 ns FilterCmp 1048599 498,784.8 ns — – – FilterOr 33554455 31,427,660.7 ns FilterCmp 33554455 30,024,102.9 ns The costs are very close to one another, with Or being very slightly faster on low numbers, and Cmp being slightly faster on the larger sizes. Note that the difference level between them is basically noise. They have the same performance. The question is, can we do better here? Looking at the assembly, there is an extra range check in the main loop that the JIT couldn’t elide (the call to items[output++]). Can we do something about it, and would it make any difference in performance? Here is how I can remove the range check: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static int FilterCmp_NoRangeCheck(Span<long> items) { int outputIdx = 0; int i = 0; ref var output = ref items[i]; for (; i < items.Length; i++) { if (items[i] < 0) continue; ref var outputDest = ref Unsafe.Add(ref output, outputIdx++); outputDest = items[i]; } return outputIdx; } view raw FilterCmp_NoRangeCheck.cs hosted with ❤ by GitHub Here I’m telling the JIT: “I know what I’m doing”, and it shows. Let’s look at the assembly changes between those two methods, first the prolog: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters ; Filter.FilterCmp_NoRangeCheck(System.Span`1<Int64>) sub rsp,28 ; setup stack space mov rax,[rcx] ; rax is not the pointer for the span mov edx,[rcx+8] ; edx is the length of the span xor ecx,ecx ; zero ecx (i = 0) xor r8d,r8d ; zero r8d (outputIdx = 0) test edx,edx ; if items.Length == 0 je short M02_L03 ; jump to CORINFO_HELP_RNGCHKFAIL mov r9,rax ; output = items[0] test edx,edx ; if items.Length <= 0 jle short M02_L02 ; jump past the for loop view raw FilterCmp_NoRangeCheck.prolog.asm hosted with ❤ by GitHub Here you can see what we are actually doing here. Note the last 4 instructions, we have a range check for the items, and then we have another check for the loop. The first will get you an exception, the second will just skip the loop. In both cases, we test the exact same thing. The JIT had a chance to actually optimize that, but didn’t. Here is a funny scenario where adding code may reduce the amount of code generated. Let’s do another version of this method: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static int FilterCmp_NoRangeCheck(Span<long> items) { if (items.Length == 0) return 0; int outputIdx = 0; int i = 0; ref var output = ref items[i]; for (; i < items.Length; i++) { if (items[i] < 0) continue; ref var outputDest = ref Unsafe.Add(ref output, outputIdx++); outputDest = items[i]; } return outputIdx; } view raw FilterCmp_NoRangeCheck2.cs hosted with ❤ by GitHub In this case, I added a check to handle the scenario of items being empty. What can the JIT do with this now? It turns out, quite a lot. We dropped 10 bytes from the method, which is a nice result of our diet.  Here is the annotated version of the assembly: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters ; Filter.FilterCmp_NoRangeCheck(System.Span`1<Int64>) mov rdx,[rcx] ; rdx = pointer of items mov ecx,[rcx+8] ; ecx = length of items test ecx,ecx ; items.Length == 0 jne short M02_L00 ; if not true, jump ahead xor eax,eax ; return 0 ret M02_L00: xor eax,eax ; outputIdx = 0 xor r8d,r8d ; i = 0 mov r9,rdx ; output = items.pointer test ecx,ecx ; if items.Length <= 0 jle short M02_L03 ; jump to end function exit M02_L01: mov r10d,r8d mov r10,[r9+r10*8] ; r10 = items[i] test r10,r10 ; if items[i] < 0 jl short M02_L02 lea r11d,[rax+1] ; r11d = output +1 cdqe ; sign extent RAX (32 bits module on the addition) mov [rdx+rax*8],r10 ; copy r10 (items[i]) to items[r11d] mov eax,r11d ; outputIdx = r11d M02_L02: inc r8d ; i++ cmp r8d,ecx ; i < items.Length jl short M02_L01 ; do the loop again M02_L03: ret ; return ; Total bytes of code 59 view raw FilterCmp_NoRangeCheck.asm hosted with ❤ by GitHub A lot of the space savings in this case come from just not having to do a range check, but you’ll note that we still do an extra check there (lines 12..13), even though we already checked that. I think that the JIT knows that the value is not zero at this point, but has to consider that the value may be negative. If we’ll change the initial guard clause to: items.Length <= 0, what do you think will happen? At this point, the JIT is smart enough to just elide everything, we are at 55 bytes of code and it is a super clean assembly (not a sentence I ever thought I would use). I’ll spare you going through more assembly listing, but you can find the output here. And after all of that, where are we at? Method N Mean Error StdDev Ratio RatioSD Code Size FilterCmp 23 274.5 ns 1.91 ns 1.70 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 23 269.7 ns 1.33 ns 1.24 ns 0.98 0.01 397 B                 FilterCmp 1047 744.5 ns 4.88 ns 4.33 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 1047 745.8 ns 3.44 ns 3.22 ns 1.00 0.00 397 B                 FilterCmp 1048599 502,608.6 ns 3,890.38 ns 3,639.06 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 1048599 490,669.1 ns 1,793.52 ns 1,589.91 ns 0.98 0.01 397 B                 FilterCmp 33554455 30,495,286.6 ns 602,907.86 ns 717,718.92 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 33554455 29,952,221.2 ns 442,176.37 ns 391,977.84 ns 0.99 0.02 397 B There is a very slight benefit to the NoRangeCheck, but even when we talk about 32M items, we aren’t talking about a lot of time. The question what can we do better here?

Listing Windows virtual desktops using .NET

by Gérald Barré

posted on: September 11, 2023

In the blog post Per virtual desktop single-instance application, I introduced the IVirtualDesktopManager interface to detect if a window is on the current virtual desktop. In this post, I describe how to list the virtual desktops. This can be useful to provide a list of virtual desktops to the use

Building a Resilient Email Sending Method in .NET with SmtpClient, Retry Support, and the Outbox Pattern

by Ardalis

posted on: September 08, 2023

Introduction In the world of software applications, email sending functionalities are indispensable. From password resets to notifications…Keep Reading →

Introducing the Identity API endpoints

by Andrew Lock

posted on: September 05, 2023

Exploring the .NET 8 preview - Part 8

Optimizing a three-way merge

by Oren Eini

posted on: September 04, 2023

Deep inside of the Corax indexing engine inside of RavenDB there is the notion of a posting list. A posting list is just an ordered set of entry ids that contains a particular term. During the indexing process, we need to add and remove items from that posting list. This ends up being something like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters // All three parameters are sorted, the output is the merged list in // the existing parameter long[] Merge(long[] existing, long[] additions, long[] removals); view raw merge.cs hosted with ❤ by GitHub For fun, go and ask ChatGPT to write you the code for this task. You can assume that there are no duplicates between the removals and additions, and that adding an existing item is a no-op (so just one value would be in the end result). Here is a quick solution for this task (not actually tested that much, mind, but sufficient to understand what I’m trying to do): This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters static long[] Merge(long[] existing, long[] additions, long[] removals) { List<long> result = new List<long>(); int existingIndex = 0; int additionsIndex = 0; int removalsIndex = 0; for (; existingIndex < existing.Length; existingIndex++) { bool removed = false; for (; removalsIndex < removals.Length; removalsIndex++) { if (removals[removalsIndex] > existing[existingIndex]) break; // will remove in the future if (removals[removalsIndex] < existing[existingIndex]) continue; // not in the existing, can skip removed = true; break; } if (removed) continue; if (additionsIndex >= additions.Length) break; // no work remaining for (; additionsIndex < additions.Length; additionsIndex++) { if (additions[additionsIndex] > existing[existingIndex]) break; else if (additions[additionsIndex] < existing[existingIndex]) result.Add(additions[additionsIndex]); else { // it's equal, so we'll move to the next one and let the existing add it } } result.Add(existing[existingIndex]); } // add the remainders result.AddRange(additions[additionsIndex..]); result.AddRange(existing[existingIndex..]); return result.ToArray(); } view raw Merge.cs hosted with ❤ by GitHub If you look at this code in terms of performance, you’ll realize that this is quite expensive. In terms of complexity, this is actually pretty good, we iterate over the arrays just once, and the number of comparisons is also bounded to the lengths of the list. However, there is a big issue here, the number of branches that you have to deal with. Basically, every if and every for loop is going to add a tiny bit of latency to the system. This is because these are unpredictable branches, which are pretty nasty to deal with. It turns out that the values that we put in the posting list are actually always a multiple of 4, so the bottom 2 bits are always cleared. That means that we actually have a different way to deal with it. Here is the new logic: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters static long[] MergeOptimized(long[] existing, long[] additions, long[] removals) { for (int i = 0; i < removals.Length; i++) { removals[i] |= 1; } List<long> result = new List<long>(); result.AddRange(existing); result.AddRange(additions); result.AddRange(removals); result.Sort(); int outputIdx = 0; for (int i = 0; i < result.Count - 1; i++) { outputIdx += ToInt32((result[i + 1] & ~1) != result[i]); result[outputIdx] = result[i + 1]; outputIdx -= (int)(result[i + 1] & 1); } return result.Take(outputIdx).ToArray(); } view raw MergeOptimized.cs hosted with ❤ by GitHub This code was written with an eye to being able to explain the algorithm, mind, not performance. The idea goes like this. We flag the removals with a bit, then concatenate all the arrays together, sort them, and then do a single scan over the whole thing, removing duplicates and removals. In the real code, we are using raw pointers, not a List, so there are no access checks, etc. From an algorithmic perspective, this code makes absolutely no sense at all. We concatenate all the values together, then sort them (O(NlogN) operation) then scan it again?! How can that be faster than a single scan across all three arrays? The answer is simple, we have a really efficient sort primitive (vxsort) that is able to sort things really fast (GB/sec). There is a really good series of posts that explain how that is achieved. Since we consider sorting to be cheap, the rest of the work is just a single scan on the list, and there are no branches at all there. The code plays with the offset that we write into, figuring out whether we need to overwrite the current value (duplicate) or go back (removal), but in general it means that it can execute very quickly. This approach also has another really important aspect. Take a look at the actual code that we have in production. This is from about an hour worth of profiling a busy indexing session: And the more common code path: In both of them, you’ll notice something really important. There isn’t a call to sorting at all in here. In fact, when I search for the relevant function, I find: That is 25 ms out of over an hour. How can this be? As efficient as the sorting can be, we are supposed to be calling it a lot. Well, consider one scenario, what happens if: There are no removals All additions happen after the last existing item in the list In this case, I don’t need to do anything beyond concatenate the lists. I can skip the entire process entirely, just copy the existing and additions to the output and call it a day. Even when I do have a lot of removals and complicated merge processes, the code structure means that the CPU can get through this code very quickly. This isn’t super friendly for humans to read, but for the CPU, this is chump change.

Turn off monitors when locking the computer

by Gérald Barré

posted on: September 04, 2023

When I lock my computer, it's because I'm going to leave it for a while. I don't want to waste energy by keeping the monitors on and I don't like having a useless source of light. So, I want to turn them off automatically when I lock my computer.To turn off monitors on Windows, you can use the Send

Not all O(1) operations are considered equal

by Oren Eini

posted on: August 30, 2023

At some point in any performance optimization sprint, you are going to run into a super annoying problem: The dictionary. The reasoning is quite simple. One of the most powerful optimization techniques is to use a cache, which is usually implemented as a dictionary. Today’s tale is about a dictionary, but surprisingly enough, not about a cache. Let’s set up the background, I’m looking at optimizing a big indexing batch deep inside RavenDB, and here is my current focus: You can see that the RecordTermsForEntries take 4% of the overall indexing time. That is… a lot, as you can imagine. What is more interesting here is why. The simplified version of the code looks like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters private readonly Dictionary<long, List<RecordedTerm>> _termsPerEntryId = new(); private void RecordTermsForEntries(List<TermInEntryModification> entriesForTerm, long termContainerId) { for (int i = 0; i < entriesForTerm.Count; i++) { var entry = entriesForTerm[i]; ref var entryTerms = ref CollectionsMarshal.GetValueRefOrAddDefault(_termsPerEntryId, entry.EntryId, out var exists); if (exists == false) { entryTerms = new List<RecordedTerm>(); } entryTerms.Add( new RecordedTerm { TermContainerId = recordedTermContainerId } ); } } view raw RecordTermsForEntries.cs hosted with ❤ by GitHub Basically, we are registering, for each entry, all the terms that belong to it. This is complicated by the fact that we are doing the process in stages: Create the entries Process the terms for the entries Write the terms to persistent storage (giving them the recorded term id) Update the entries to record the term ids that they belong to The part of the code that we are looking at now is the last one, where we already wrote the terms to persistent storage and we need to update the entries. This is needed so when we read them, we’ll be able to find the relevant terms. At any rate, you can see that this method cost is absolutely dominated by the dictionary call. In fact, we are actually using an optimized method here to avoid doing a TryGetValue() and then Add() in case the value is not already in the dictionary. If we actually look at the metrics, this is actually kind of awesome. We are calling the dictionary almost 400 million times and it is able to do the work in under 200 nanoseconds per call. That is pretty awesome, but that still means that we have over 2% of our total indexing time spent doing lookups. Can we do better? In this case, absolutely. Here is how this works, instead of doing a dictionary lookup, we are going to store a list. And the entry will record the index of the item in the list. Here is what this looks like: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters private readonly List<List<RecordedTerm>> _termsPerEntryId = new(); private void RecordTermsForEntries(List<TermInEntryModification> entriesForTerm, long termContainerId) { for (int i = 0; i < entriesForTerm.Count; i++) { var entry = entriesForTerm[i]; if(entry.TermsPerEntryIndex == -1) { entry.TermsPerEntryIndex = _termsPerEntryId.Count; _termsPerEntryId.Add(new List<RecordedTerm>()); } ref var entryTerms = ref CollectionsMarshal.AsSpan(_termsPerEntryId)[_termsPerEntryId.TermContainerId]; entryTerms.Add( new RecordedTerm { TermContainerId = recordedTermContainerId } ); } } view raw RecordTermsForEntries2.cs hosted with ❤ by GitHub There isn’t much to this process, I admit. I was lucky that in this case, we were able to reorder things in such a way that skipping the dictionary lookup is a viable method. In other cases, we would need to record the index at the creation of the entry (effectively reserving the position) and then use that later. And the result is… That is pretty good, even if I say so myself. The cost went down from 3.6 microseconds per call to 1.3 microseconds. That is almost 3 folds improvement.

Using RavenDB from Cloudflare Workers

by Oren Eini

posted on: August 29, 2023

RavenDB is a multi-primary database, which means that it allows you to write to multiple nodes at the same time, without needing synchronization between them. This ability to run independently from the other nodes in the cluster (or even across clusters) makes RavenDB highly suitable for running on the edge. We have recently published a guide on using RavenDB from Cloudflare Workers, as well as a full template so you can get up to speed in a few minutes. The ability to run in a Cloudflare Worker (and use a nearby RavenDB server) means that your logic is running closer to the client, which can greatly reduce your overall latency and improve the overall user experience.

Form binding in minimal APIs

by Andrew Lock

posted on: August 29, 2023

Exploring the .NET 8 preview - Part 7