skip to content
Relatively General .NET

Challenge

by Oren Eini

posted on: September 19, 2023

The following bug cost me a bunch of time, can you see what I’m doing wrong? This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters private static int AddAndSort(Dictionary<long, long> freeListAdditions, Span<long> keys, Span<long> vals) { int index = 0; foreach (var (k, v) in freeListAdditions) { keys[index] = k; vals[index] = v; index++; } keys.Sort(vals); return index; } view raw nasty.cs hosted with ❤ by GitHub For fun, it’s so nasty because usually, it will accidentally work.

Accessing private members without reflection in C#

by Gérald Barré

posted on: September 18, 2023

Reflection allows you to access private members of a class. This is very useful when you want to access private members of a class that you don't own. However, it is slow and doesn't work well with Native AOT. In this post, I describe how to access private members without reflection in .NET 8.Let's

Filtering negative numbers, fast

by Oren Eini

posted on: September 15, 2023

In the previous post, I was able to utilize AVX to get some nice speedups. In general, I was able to save up to 57%(!) of the runtime in processing arrays of 1M items. That is really amazing, if you think about it. But my best effort only gave me a 4% improvement when using 32M items. I decided to investigate what is going on in more depth, and I came up with the following benchmark. Given that I want to filter negative numbers, what would happen if the only negative number in the array was the first one? In other words, let’s see what happens when we could write this algorithm as the following line of code: array[1..].CopyTo(array); The idea here is that we should measure the speed of raw memory copy and see how that compares to our code. Before we dive into the results, I want to make a few things explicit. We are dealing here with arrays of long, when I’m talking about an array with 1M items, I’m actually talking about an 8MB buffer, and for the 32M items, we are talking about 256MB of memory. I’m running these benchmarks on the following machine:     AMD Ryzen 9 5950X 16-Core Processor     Base speed:    3.40 GHz     L1 cache:    1.0 MB     L2 cache:    8.0 MB     L3 cache:    64.0 MB     Utilization    9%     Speed    4.59 GHz In other words, when we look at this, the 1M items (8MB) can fit into L2 (barely, but certainly backed by the L3 cache. For the 32M items (256MB), we are far beyond what can fit in the cache, so we are probably dealing with memory bandwidth issues. I wrote the following functions to test it out: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters [Benchmark] public int CopyTo() { items[0] = -items[0]; Span<long> itemsSpan = items; itemsSpan[1..].CopyTo(itemsSpan); return items.Length - 1; ; } [Benchmark] public unsafe int MemoryCopy() { items[0] = -items[0]; fixed(long * p = items) { Buffer.MemoryCopy(p, p + 1, items.Length * sizeof(long), (items.Length - 1) * sizeof(long)); } return items.Length - 1; ; } [Benchmark] public unsafe int MoveMemory() { items[0] = -items[0]; fixed (long* p = items) { MoveMemory(p + 1, p, items.Length * sizeof(long)); } return items.Length - 1; ; } [DllImport("Kernel32.dll", EntryPoint = "RtlMoveMemory", SetLastError = false)] static unsafe extern void MoveMemory(void* dest, void* src, int size); view raw CopyBench.cs hosted with ❤ by GitHub Let’s look at what I’m actually testing here. CopyTo() – using the span native copying is the most ergonomic way to do things, I think. MemoryCopy() – uses a built-in unsafe API in the framework. That eventually boils down to a heavily optimized routine, which… calls to Memove() if the buffer overlaps (as they do in this case). MoveMemory() – uses a pinvoke to call to the Windows API to do the moving of memory for us. Here are the results for the 1M case (8MB): Method N Mean Error StdDev Ratio FilterCmp 1048599 441.4 us 1.78 us 1.58 us 1.00 FilterCmp_Avx 1048599 141.1 us 2.70 us 2.65 us 0.32 CopyTo 1048599 872.8 us 11.27 us 10.54 us 1.98 MemoryCopy 1048599 869.7 us 7.29 us 6.46 us 1.97 MoveMemory 1048599 126.9 us 0.28 us 0.25 us 0.29 We can see some real surprises here. I’m using the FilterCmp (the very basic implementation) that I wrote. I cannot explain why CopyTo() and MemoryMove() are so slow. What is really impressive is that the FilterCmp_Avx() and MoveMemory() are so close in performance, and so much faster. To put it in another way, we are already at a stage where we are within shouting distance from the MoveMemory() performance. That is.. really impressive. That said, what happens with 32M (256MB) ? Method N Mean Error StdDev Ratio FilterCmp 33554455 22,763.6 us 157.23 us 147.07 us 1.00 FilterCmp_Avx 33554455 20,122.3 us 214.10 us 200.27 us 0.88 CopyTo 33554455 27,660.1 us 91.41 us 76.33 us 1.22 MemoryCopy 33554455 27,618.4 us 136.16 us 127.36 us 1.21 MoveMemory 33554455 20,152.0 us 166.66 us 155.89 us 0.89 Now we are faster in the FilterCmp_Avx than MoveMemory. That is… a pretty big wow, and a really nice close for this blog post series, right? Except that we won’t be stopping here. The way the task I set out works, we are actually filtering just the first item out, and then we are basically copying the memory. Let’s do some math: 256MB in 20.1ms means 12.4 GB/sec! On this system, I have the following memory setup:     64.0 GB     Speed:    2133 MHz     Slots used:    4 of 4     Form factor:    DIMM     Hardware reserved:    55.2 MB I’m using DDR4 memory, so I can expect a maximum speed of 17GB/sec. In theory, I might be able to get more if the memory is located on different DIMMs, but for the size in question, that is not likely. I’m going to skip the training montage of VTune, understanding memory architecture and figuring out what is actually going on. Let’s drop everything and look at what we have with just AVX vs. MoveMemory: Method N Mean Error StdDev Median Ratio FilterCmp_Avx 1048599 141.6 us 2.28 us 2.02 us 141.8 us 1.12 MoveMemory 1048599 126.8 us 0.25 us 0.19 us 126.8 us 1.00               FilterCmp_Avx 33554455 21,105.5 us 408.65 us 963.25 us 20,770.4 us 1.08 MoveMemory 33554455 20,142.5 us 245.08 us 204.66 us 20,108.2 us 1.00 The new baseline is MoveMemory, and in this run, we can see that the AVX code is slightly slower. It’s sadly not uncommon to see numbers shift by those ranges when we are testing such micro-optimizations, mostly because we are subject to so many variables that can affect performance. In this case, I dropped all the other benchmarks, which may have changed things. At any rate, using those numbers, we have 12.4GB/sec for MoveMemory() and 11.8GB/sec for the AVX version. The hardware maximum speed is 17GB/sec. So we are quite close to what can be done. For that matter, I would like to point out that the trivial code completed the task in 11GB/sec, so that is quite respectable and shows that the issue here is literally getting the memory fast enough to the CPU. Can we do something about that? I made a pretty small change to the AVX version, like so: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters +if (i + Vector256<long>.Count <= len) +{ + var next = Vector256.LoadUnsafe(ref Unsafe.Add(ref output, i)); + len -= Vector256<long>.Count; for (; i + Vector256<long>.Count <= len; i += Vector256<long>.Count) { - var v = Vector256.LoadUnsafe(ref Unsafe.Add(ref output, i)); + var v = next; + next = Vector256.LoadUnsafe(ref Unsafe.Add(ref output, i + Vector256<long>.Count)); ... view raw Avx_Next.diff hosted with ❤ by GitHub What are we actually doing here? Instead of loading the value and immediately using it, we are loading the next value, then we are executing the loop and when we iterate again, we will start loading the next value and process the current one. The idea is to parallelize load and compute at the instruction level. Sadly, that didn’t seem to do the trick. I saw a 19% additional cost for that version compared to the vanilla AVX one on the 8MB run and a 2% additional cost on the 256MB run. I then realized that there was one really important test that I had to also make, and wrote the following: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters [Benchmark()] public unsafe int MoveMemory() { items[0] = -items[0]; fixed (long* p = items) { MoveMemory(p + 1, p, items.Length * sizeof(long)); } return items.Length - 1; ; } [DllImport("Kernel32.dll", EntryPoint = "RtlMoveMemory", SetLastError = false)] static unsafe extern void MoveMemory(void* dest, void* src, int size); [Benchmark(Baseline = true)] public unsafe int FillMemory() { items[0] = -items[0]; fixed (long* p = items) { var b = (byte)Random.Shared.Next(); FillMemory(p,items.Length * sizeof(long), b); } return items.Length - 1; ; } [DllImport("Kernel32.dll", EntryPoint = "RtlFillMemory", SetLastError = false)] static unsafe extern void FillMemory(void* dest, IntPtr size, byte fill); view raw Final.cs hosted with ❤ by GitHub In other words, let's test the speed of moving memory and filling memory as fast as we possibly can. Here are the results: Method N Mean Error StdDev Ratio RatioSD Code Size MoveMemory 1048599 126.8 us 0.36 us 0.33 us 0.25 0.00 270 B FillMemory 1048599 513.5 us 10.05 us 10.32 us 1.00 0.00 351 B                 MoveMemory 33554455 20,022.5 us 395.35 us 500.00 us 1.26 0.02 270 B FillMemory 33554455 15,822.4 us 19.85 us 17.60 us 1.00 0.00 351 B This is really interesting, for a small buffer (8MB), MoveMemory is somehow faster. I don’t have a way to explain that, but it has been a pretty consistent result in my tests. For the large buffer (256MB), we are seeing results that make more sense to me. MoveMemory – 12.5 GB / sec FIllMemory – 15.8 GB / sec In other words, for MoveMemory, we are both reading and writing, so we are paying for memory bandwidth in both directions. For filling the memory, we are only writing, so we can get better performance (no need for reads). In other words, we are talking about hitting the real physical limits of what the hardware can do. There are all sorts of tricks that one can pull, but when we are this close to the limit, they are almost always context-sensitive and dependent on many factors. To conclude, here are my final results: Method N Mean Error StdDev Ratio RatioSD Code Size FilterCmp_Avx 1048599 307.9 us 6.15 us 12.84 us 0.99 0.05 270 B FilterCmp_Avx_Next 1048599 308.4 us 6.07 us 9.26 us 0.99 0.03 270 B CopyTo 1048599 1,043.7 us 15.96 us 14.93 us 3.37 0.11 452 B ArrayCopy 1048599 1,046.7 us 15.92 us 14.89 us 3.38 0.14 266 B UnsafeCopy 1048599 309.5 us 6.15 us 8.83 us 1.00 0.04 133 B MoveMemory 1048599 310.8 us 6.17 us 9.43 us 1.00 0.00 270 B                 FilterCmp_Avx 33554455 24,013.1 us 451.09 us 443.03 us 0.98 0.02 270 B FilterCmp_Avx_Next 33554455 24,437.8 us 179.88 us 168.26 us 1.00 0.01 270 B CopyTo 33554455 32,931.6 us 416.57 us 389.66 us 1.35 0.02 452 B ArrayCopy 33554455 32,538.0 us 463.00 us 433.09 us 1.33 0.02 266 B UnsafeCopy 33554455 24,386.9 us 209.98 us 196.42 us 1.00 0.01 133 B MoveMemory 33554455 24,427.8 us 293.75 us 274.78 us 1.00 0.00 270 B As you can see, just the AVX version is comparable or (slightly) beating the MoveMemory function. I tried things like prefetching memory, both the next item, the next cache item and from the next page, using non-temporal load and stores and many other things, but this is a pretty tough challenge. What is really interesting is that the original, very simple and obvious implementation, clocked at 11 GB/sec. After pulling pretty much all the stops and tricks, I was able to hit 12.5 GB / sec. I don’t think anyone can look / update / understand the resulting code in any way without going through deep meditation. That is not a bad result at all. And along the way, I learned quite a bit about how the lowest level of the machine architecture is working.

Filtering negative numbers, fast

by Oren Eini

posted on: September 13, 2023

In the previous post I discussed how we can optimize the filtering of negative numbers by unrolling the loop, looked into branchless code and in general was able to improve performance by up to 15% from the initial version we started with. We pushed as much as we could on what can be done using scalar code. Now it is the time to open a whole new world and see what we can do when we implement this challenge using vector instructions. The key problem with such tasks is that SIMD, AVX and their friends were designed by… an interesting process using a perspective that makes sense if you can see in a couple of additional dimensions. I assume that at least some of that is implementation constraints, but the key issue is that when you start using SIMD, you realize that you don’t have general-purpose instructions. Instead, you have a lot of dedicated instructions that are doing one thing, hopefully well, and it is your role to compose them into something that would make sense. Oftentimes, you need to turn the solution on its head in order to successfully solve it using SIMD. The benefit, of course, is that you can get quite an amazing boost in speed when you do this. The algorithm we use is basically to scan the list of entries and copy to the start of the list only those items that are positive. How can we do that using SIMD? The whole point here is that we want to be able to operate on multiple data, but this particular task isn’t trivial. I’m going to show the code first, then discuss what it does in detail: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static int FilterCmp_Avx(Span<long> items) { var len = items.Length; if (len <= 0) return 0; ref var permuteStart = ref Unsafe.AsRef(PermuteTable[0]); int outputIdx = 0; int i = 0; ref var output = ref items[i]; for (; i + Vector256<long>.Count <= len; i += Vector256<long>.Count) { var v = Vector256.LoadUnsafe(ref Unsafe.Add(ref output, i)); var bits = v.ExtractMostSignificantBits(); if (bits == 0) // do we have _any_ negatives here? { v.StoreUnsafe(ref Unsafe.Add(ref output, outputIdx)); outputIdx += Vector256<long>.Count; continue; } // complex case, we have to deal with some negatives var permute = Vector256.LoadUnsafe(ref Unsafe.Add(ref permuteStart, (int)bits * (sizeof(int) * 8))).AsInt32(); var m = Avx2.PermuteVar8x32(v.AsInt32(), permute).AsInt64(); m.StoreUnsafe(ref Unsafe.Add(ref output, outputIdx)); outputIdx += 4 - BitOperations.PopCount(bits); } // remainder, do that in a scalar fashion for (; i < len; i++) { ref var cur = ref Unsafe.Add(ref output, i); if (cur < 0) continue; Unsafe.Add(ref output, outputIdx++) = cur; } return outputIdx; } view raw FilterCmp_Avx.cs hosted with ❤ by GitHub We start with the usual check (if you’ll recall, that ensures that the JIT knows to elide some range checks, then we load the PremuteTable. For now, just assume that this is magic (and it is). The first interesting thing happens when we start iterating over the loop. Unlike before, now we do that in chunks of 4 int64 elements at a time. Inside the loop, we start by loading a vector of int64 and then we do the first odd thing. We call ExtractMostSignificantBits(), since the sign bit is used to mark whether a number if negative or not. That means that I can use a single instruction to get an integer with the bits set for all the negative numbers. That is particularly juicy for what we need, since there is no need for comparisons, etc. If the mask we got is all zeroes, it means that all the numbers we loaded to the vector are positives, so we can write them as-is to the output and move to the next part. Things get interesting when that isn’t the case. We load a permute value using some shenanigans (we’ll touch on that shortly) and call the PermuteVar8x32() method. The idea here is that we pack all the non-negative numbers to the start of the vector, then we write the vector to the output. The key here is that when we do that, we increment the output index only by the number of valid values.  The rest of this method just handles the remainder that does not fit into a vector. The hard part in this implementation was to figure out how to handle the scenario where we loaded some negative numbers. We need a way to filter them, after all. But there is no SIMD instruction that allows us to do so. Luckily, we have the Avx2.PermuteVar8x32() method to help here. To confuse things, we don’t actually want to deal with 8x32 values. We want to deal with 4x64 values. There is Avx2.Permute4x64() method, and it will work quite nicely, with a single caveat. This method assumes that you are going to pass it a constant value. We don’t have such a constant, we need to be able to provide that based on whatever the masked bits will give us. So how do we deal with this issue of filtering with SIMD? We need to move all the values we care about to the front of the vector. We have the method to do that, PermuteVar8x32() method, and we just need to figure out how to actually make use of this. PermuteVar8x32() accepts an input vector as well as a vector of the premutation you want to make. In this case, we are basing this on the 4 top bits of the 4 elements vector of int64. As such, there are a total of 16 options available to us. We have to deal with 32bits values rather than 64bits, but that isn’t that much of a problem. Here is the premutation table that we’ll be using: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static readonly int[] PermuteTableInts = new int[] { 0,1,2,3,4,5,6,7, // 0000 2,3,4,5,6,7,0,0, // 0001 0,1,4,5,6,7,0,0, // 0010 4,5,6,7,0,0,0,0, // 0011 0,1,2,3,6,7,0,0, // 0100 2,3,6,7,0,0,0,0, // 0101 0,1,6,7,0,0,0,0, // 0110 6,7,0,0,0,0,0,0, // 0111 0,1,2,3,4,5,0,0, // 1000 2,3,4,5,0,0,0,0, // 1001 0,1,4,5,0,0,0,0, // 1010 4,5,0,0,0,0,0,0, // 1011 0,1,2,3,0,0,0,0, // 1100 2,3,0,0,0,0,0,0, // 1101 0,1,0,0,0,0,0,0, // 1110 0,0,0,0,0,0,0,0, // 1111 }; view raw PermuteTableInts.cs hosted with ❤ by GitHub What you can see here is that when we have a 1 in the bits (shown in comments) we’ll not copy that to the vector. Let’s take a look at the entry of 0101, which may be caused by the following values [1,-2,3,-4]. When we look at the right entry at index #5 in the table: 2,3,6,7,0,0,0,0 What does this mean? It means that we want to put the 2nd int64 element in the source vector and move it as the first element of the destination vector, take the 3rd element from the source as the second element in the destination and discard the rest (marked as 0,0,0,0 in the table). This is a bit hard to follow because we have to compose the value out of the individual 32 bits words, but it works quite well. Or, at least, it would work, but not as efficiently. This is because we would need to load the PermuteTableInts into a variable and access it, but there are better ways to deal with it. We can also ask the JIT to embed the value directly. The problem is that the pattern that the JIT recognizes is limited to ReadOnlySpan<byte>, which means that the already non-trivial int32 table got turned into this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters // same as the one above, just in bytes for JIT optimizations public static ReadOnlySpan<byte> PermuteTable => new byte[] { 0,0,0,0,1,0,0,0,2,0,0,0,3,0,0,0,4,0,0,0,5,0,0,0,6,0,0,0,7,0,0,0, 2,0,0,0,3,0,0,0,4,0,0,0,5,0,0,0,6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,4,0,0,0,5,0,0,0,6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0, 4,0,0,0,5,0,0,0,6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,2,0,0,0,3,0,0,0,6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0, 2,0,0,0,3,0,0,0,6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,2,0,0,0,3,0,0,0,4,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0, 2,0,0,0,3,0,0,0,4,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,4,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 4,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,2,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 2,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, }; view raw PermuteTable.cs hosted with ❤ by GitHub This is the exact same data as before, but using ReadOnlySpan<byte> means that the JIT can package that inside the data section and treat it as a constant value. The code was heavily optimized, to the point where I noticed a JIT bug where these two versions of the code give different assembly output: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters int A(int i) { return i * 4 * 8; } int B(int i) { return i * (4 * 8); } view raw code.cs hosted with ❤ by GitHub Here is what we get out: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters Program.<<Main>$>g__A|0_0(Int32) L0000: mov eax, ecx L0002: shl eax, 2 L0005: shl eax, 3 L0008: ret Program.<<Main>$>g__B|0_1(Int32) L0000: mov eax, ecx L0002: shl eax, 5 L0005: ret view raw code.asm hosted with ❤ by GitHub This looks like an unintended consequence of Roslyn and the JIT each doing their (separate jobs), but not reaching the end goal. Constant folding looks like it is done mostly by Roslyn, but it does a scan like that from the left, so it wouldn’t convert $A * 4 * 8 to $A * 32. That is because it stopped evaluating the constants when it found a variable. When we add parenthesis, we isolate the value and now understand that we can fold it. Speaking of assembly, here is the annotated assembly version of the code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters ; Filter.FilterCmp_Avx(System.Span`1<Int64>) push rsi ; push rsi to stack - we'll be using a lot of state, so need to preserve it vzeroupper ; clear AVX state, since we don't know how the caller left it mov rdx,[rcx] ; rdx is now the pointer of the items' span mov ecx,[rcx+8] ; ecx is now the length of the span test ecx,ecx ; if items.Length > 0 jg short M02_L00 ; jump ahead to do work xor eax,eax ; return 0 vzeroupper ; not actually needed, I think pop rsi ret M02_L00: mov rax,1CA3A753F78 ; load permuteStart xor r8d,r8d ; outputIdx = 0 xor r9d,r9d ; i = 0 cmp ecx,4 ; if items.Length < 4 jl short M02_L04 ; jump the AVX portion loop to the next one M02_L01: movsxd r10,r9d ; r10 = current i value vmovdqu ymm0,ymmword ptr [rdx+r10*8] ; load vector register from output[i] (4 elements) vmovmskpd r10d,ymm0 ; move the top bits from 4 int64 elements to r10d (bits) test r10d,r10d ; if bits != 0 jne short M02_L02 ; jump to the complex piece of the code movsxd r10,r8d vmovdqu ymmword ptr [rdx+r10*8],ymm0 ; store vector to output[outputIdx] add r8d,4 ; outputIdx += 4 jmp short M02_L03 ; restart loop M02_L02: movsxd r11,r8d ; r11 - current outputIdx value mov esi,r10d ; esi = bits shl esi,5 ; bits * 4 * 8 - optimized movsxd rsi,esi vmovdqu ymm1,ymmword ptr [rax+rsi] ; load vector from premutation table vpermd ymm0,ymm1,ymm0 ; permute value vmovdqu ymmword ptr [rdx+r11*8],ymm0 ; write to output popcnt r10d,r10d ; count the bits neg r10d ; negate the value lea r8d,[r10+r8+4] ; add to outputIdx + 4 M02_L03: add r9d,4 ; for loop handling lea r10d,[r9+4] cmp r10d,ecx jle short M02_L01 ; jump to start of loop M02_L04: cmp r9d,ecx ; if there are no elements remaining jge short M02_L07 ; jump to end M02_L05: movsxd rax,r9d ; here is the scalar version, already discussed previously lea rax,[rdx+rax*8] mov rax,[rax] test rax,rax jl short M02_L06 mov r10d,r8d lea r8d,[r10+1] movsxd r10,r10d mov [rdx+r10*8],rax M02_L06: inc r9d cmp r9d,ecx jl short M02_L05 M02_L07: mov eax,r8d vzeroupper pop rsi ret ; Total bytes of code 179 view raw FilterCmp_Avx.asm hosted with ❤ by GitHub And after all of this work, where are we standing? Method N Mean Error StdDev Ratio RatioSD Code Size FilterCmp 23 285.7 ns 3.84 ns 3.59 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 23 272.6 ns 3.98 ns 3.53 ns 0.95 0.01 397 B FilterCmp_Unroll_8 23 261.4 ns 1.27 ns 1.18 ns 0.91 0.01 672 B FilterCmp_Avx 23 261.6 ns 1.37 ns 1.28 ns 0.92 0.01 521 B                 FilterCmp 1047 758.7 ns 1.51 ns 1.42 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 1047 756.8 ns 1.83 ns 1.53 ns 1.00 0.00 397 B FilterCmp_Unroll_8 1047 640.4 ns 1.94 ns 1.82 ns 0.84 0.00 672 B FilterCmp_Avx 1047 426.0 ns 1.62 ns 1.52 ns 0.56 0.00 521 B                 FilterCmp 1048599 502,681.4 ns 3,732.37 ns 3,491.26 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 1048599 499,472.7 ns 6,082.44 ns 5,689.52 ns 0.99 0.01 397 B FilterCmp_Unroll_8 1048599 425,800.3 ns 352.45 ns 312.44 ns 0.85 0.01 672 B FilterCmp_Avx 1048599 218,075.1 ns 212.40 ns 188.29 ns 0.43 0.00 521 B                 FilterCmp 33554455 29,820,978.8 ns 73,461.68 ns 61,343.83 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 33554455 29,471,229.2 ns 73,805.56 ns 69,037.77 ns 0.99 0.00 397 B FilterCmp_Unroll_8 33554455 29,234,413.8 ns 67,597.45 ns 63,230.70 ns 0.98 0.00 672 B FilterCmp_Avx 33554455 28,498,115.4 ns 71,661.94 ns 67,032.62 ns 0.96 0.00 521 B So it seems that the idea of using SIMD instruction has a lot of merit. Moving from the original code to the final version, we see that we can complete the same task in up to half the time. I’m not quite sure why we aren’t seeing the same sort of performance on the 32M, but I suspect that this is likely because we far exceed the CPU cache and we have to fetch it all from memory, so that is as fast as it can go. If you are interested in learning more, Lemire solves the same problem in SVE (SIMD for ARM) and Paul has a similar approach in Rust. If you can think of further optimizations, I would love to hear your ideas.

Filtering negative numbers, fast

by Oren Eini

posted on: September 12, 2023

In the previous post, we looked into what it would take to reduce the cost of filtering negative numbers. We got into the assembly and analyzed exactly what was going on. In terms of this directly, I don’t think that even hand-optimized assembly would take us further. Let’s see if there are other options that are available for us to get better speed. The first thing that pops to mind here is to do a loop unrolling. After all, we have a very tight loop, if we can do more work per loop iteration, we might get better performance, no? Here is my first version: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static int FilterCmp_Unroll(Span<long> items) { if (items.Length <= 0) return 0; int outputIdx = 0; int i = 0; ref var output = ref items[i]; for (; i + 4 < items.Length; i += 4) { var i0 = items[i + 0]; var i1 = items[i + 1]; var i2 = items[i + 2]; var i3 = items[i + 3]; AddToOutput(ref output, i0); AddToOutput(ref output, i1); AddToOutput(ref output, i2); AddToOutput(ref output, i3); } for (; i < items.Length; i++) { AddToOutput(ref output, items[i]); } return outputIdx; [MethodImpl(MethodImplOptions.AggressiveInlining)] void AddToOutput(ref long output, long input) { if (input < 0) return; Unsafe.Add(ref output, outputIdx++) = input; } } view raw FilterCmp_Unroll.cs hosted with ❤ by GitHub And here are the benchmark results: Method N Mean Error StdDev Ratio Code Size FilterCmp 23 274.6 ns 0.40 ns 0.35 ns 1.00 411 B FilterCmp_Unroll 23 257.5 ns 0.94 ns 0.83 ns 0.94 606 B               FilterCmp 1047 748.1 ns 2.91 ns 2.58 ns 1.00 411 B FilterCmp_Unroll 1047 702.5 ns 5.23 ns 4.89 ns 0.94 606 B               FilterCmp 1048599 501,545.2 ns 4,985.42 ns 4,419.45 ns 1.00 411 B FilterCmp_Unroll 1048599 446,311.1 ns 3,131.42 ns 2,929.14 ns 0.89 606 B               FilterCmp 33554455 29,637,052.2 ns 184,796.17 ns 163,817.00 ns 1.00 411 B FilterCmp_Unroll 33554455 29,275,060.6 ns 145,756.53 ns 121,713.31 ns 0.99 606 B That is quite a jump, 6% – 11% savings is no joke. Let’s look at what is actually going on at the assembly level and see if we can optimize this further. As expected, the code size is bigger, 264 bytes versus the 55 we previously got. But more importantly, we got the range check back, and a lot of them: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters lea r11d,[r8+1] ; var i01_idx = i+1 cmp r11d,ecx ; if i01_idx >= jae near ptr M02_L10 ; jump to range check view raw FilterCmp_Unroll.asm hosted with ❤ by GitHub The JIT isn’t able to reason about our first for loop and see that all our accesses are within bounds, which leads to doing a lot of range checks, and likely slows us down. Even with that, we are still showing significant improvements here. Let’s see what we can do with this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters for (; i + 4 < items.Length; i += 4) { AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 0)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 1)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 2)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 3)); } view raw ElidedRangeChecks.cs hosted with ❤ by GitHub With that, we expect to have no range checks and still be able to benefit from the unrolling. Method N Mean Error StdDev Ratio RatioSD Code Size FilterCmp 23 275.4 ns 2.31 ns 2.05 ns 1.00 0.00 411 B FilterCmp_Unroll 23 253.6 ns 2.59 ns 2.42 ns 0.92 0.01 563 B                 FilterCmp 1047 741.6 ns 5.95 ns 5.28 ns 1.00 0.00 411 B FilterCmp_Unroll 1047 665.5 ns 2.38 ns 2.22 ns 0.90 0.01 563 B                 FilterCmp 1048599 497,624.9 ns 3,904.39 ns 3,652.17 ns 1.00 0.00 411 B FilterCmp_Unroll 1048599 444,489.0 ns 2,524.45 ns 2,361.38 ns 0.89 0.01 563 B                 FilterCmp 33554455 29,781,164.3 ns 361,625.63 ns 320,571.70 ns 1.00 0.00 411 B FilterCmp_Unroll 33554455 29,954,093.9 ns 588,614.32 ns 916,401.59 ns 1.01 0.04 563 B That helped, by quite a lot, it seems, for most cases, the 32M items case, however, was slightly slower, which is quite a surprise. Looking at the assembly, I can see that we still have branches, like so: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters M02_L03: lea r9d,[r8+2] movsxd r9,r9d mov r9,[rdx+r9*8] test r9,r9 ; if items[i+2] < 0: jl short M02_L04 ; continue mov r10d,eax lea eax,[r10+1] movsxd r10,r10d mov [rdx+r10*8],r9 M02_L04: lea r9d,[r8+3] movsxd r9,r9d mov r9,[rdx+r9*8] test r9,r9 ; if items[i+3] < 0: jl short M02_L05 ; continue mov r10d,eax lea eax,[r10+1] movsxd r10,r10d mov [rdx+r10*8],r9 view raw branches.asm hosted with ❤ by GitHub And here is why this is the case: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters [MethodImpl(MethodImplOptions.AggressiveInlining)] void AddToOutput(ref long output, long input) { if (input < 0) return; Unsafe.Add(ref output, outputIdx++) = input; } view raw AddToOutput.cs hosted with ❤ by GitHub Now, can we do better here? It turns out that we can, by using a branchless version of the operation. Here is another way to write the same thing: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters [MethodImpl(MethodImplOptions.AggressiveInlining)] unsafe void AddToOutput(ref long output, long input) { bool isNegative = (input < 0); Unsafe.Add(ref output, outputIdx) = input; outputIdx += *(byte*)&isNegative; } view raw AddToOutput.Branchless.cs hosted with ❤ by GitHub What happens here is that we are unconditionally setting the value in the array, but only increment if the value is greater than or equal to zero. That saves us in branches and will likely result in less code. In fact, let’s see what sort of assembly the JIT will output: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters movsxd r9,r8d ; r9 = i mov r9,[rdx+r9*8] ; r9 = items[i] mov r10,r9 ; copy the item to r10 temporary not r10 ; flips the bits, important for the next instruction shr r10,3F ; r10 = r9 >> 63 (basically, if < 0, it's 1, otherwise 0) movsxd r11,eax ; r11 = outputIdx mov [rdx+r11*8],r9 ; items[outputIdx] = items[i] add eax,r10d ; outputIdx += (items[i] < 0 ? 0 : 1); lea r9d,[r8+1] ; i++ movsxd r9,r9d mov r9,[rdx+r9*8] ; r9 = items[+1] mov r10,r9 ; and the same here for the next round... not r10 shr r10,3F movsxd r11,eax mov [rdx+r11*8],r9 add eax,r10d movsxd r9,r8d mov r9,[rdx+r9*8] mov r10,r9 ; shr r10,3F movsxd r11,eax ; r11 is the outputIdx mov [rdx+r11*8],r9 ; items[outputIdx] add eax,r10d ; outputIdx += r10 (which is 1 i lea r9d,[r8+1] movsxd r9,r9d mov r9,[rdx+r9*8] mov r10,r9 shr r10,3F movsxd r11,eax mov [rdx+r11*8],r9 add eax,r10d view raw Unrolled.asm hosted with ❤ by GitHub What about the performance? I decided to pit the two versions (normal and branchless) head to head and see what this will give us: Method N Mean Error StdDev Ratio Code Size FilterCmp_Unroll 23 276.3 ns 4.13 ns 3.86 ns 1.00 411 B FilterCmp_Unroll_Branchleses 23 263.6 ns 0.95 ns 0.84 ns 0.96 547 B               FilterCmp_Unroll 1047 743.7 ns 9.41 ns 8.80 ns 1.00 411 B FilterCmp_Unroll_Branchleses 1047 733.3 ns 3.54 ns 3.31 ns 0.99 547 B               FilterCmp_Unroll 1048599 502,631.1 ns 3,641.47 ns 3,406.23 ns 1.00 411 B FilterCmp_Unroll_Branchleses 1048599 495,590.9 ns 335.33 ns 297.26 ns 0.99 547 B               FilterCmp_Unroll 33554455 29,356,331.7 ns 207,133.86 ns 172,966.15 ns 1.00 411 B FilterCmp_Unroll_Branchleses 33554455 29,709,835.1 ns 86,129.58 ns 71,922.10 ns 1.01 547 B Surprisingly enough, it looks like the branchless version is very slightly slower. That is a surprise, since I would expect reducing the branches to be more efficient. Looking at the assembly of those two, the branchless version is slightly bigger (10 bytes, not that meaningful). I think that the key here is that there is a 0.5% chance of actually hitting the branch, which is pretty low. That means that the branch predictor can likely do a really good job and we aren’t going to see any big benefits from the branchless version. That said… what would happen if we tested that with 5% negatives? That difference in behavior may cause us to see a different result. I tried that, and the results were quite surprising. In the case of the 1K and 32M items, we see a slightl cost for the branchless version (additional 1% – 4%) while for the 1M entries there is an 18% reduction in latency for the branchless version. I ran the tests again with a 15% change of negative, to see what would happen. In that case, we get: Method N Mean Error StdDev Ratio RatioSD Code Size FilterCmp_Unroll 23 273.5 ns 3.66 ns 3.42 ns 1.00 0.00 537 B FilterCmp_Unroll_Branchleses 23 280.2 ns 4.85 ns 4.30 ns 1.03 0.02 547 B                 FilterCmp_Unroll 1047 1,675.7 ns 29.55 ns 27.64 ns 1.00 0.00 537 B FilterCmp_Unroll_Branchleses 1047 1,676.3 ns 16.97 ns 14.17 ns 1.00 0.02 547 B                 FilterCmp_Unroll 1048599 2,206,354.4 ns 6,141.19 ns 5,444.01 ns 1.00 0.00 537 B FilterCmp_Unroll_Branchleses 1048599 1,688,677.3 ns 11,584.00 ns 10,835.68 ns 0.77 0.01 547 B                 FilterCmp_Unroll 33554455 205,320,736.1 ns 2,757,108.01 ns 2,152,568.58 ns 1.00 0.00 537 B FilterCmp_Unroll_Branchleses 33554455 199,520,169.4 ns 2,097,285.87 ns 1,637,422.86 ns 0.97 0.01 547 B As you can see, we have basically the same cost under 15% negatives for small values, a big improvement on the 1M scenario and not much improvement on the 32M scenario. All in all, that is very interesting information. Digging into the exact why and how of that means pulling a CPU instruction profiler and starting to look at where we have stalls, which is a bit further that I want to invest in this scenario. What if we’ll try to rearrange the code a little bit. The code looks like this (load the value and AddToOutput() immediately): AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 0)); What if we split it a little bit, so the code will look like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters for (; i + 4 < items.Length; i += 4) { var i0 = Unsafe.Add(ref itemsRef, i + 0); var i1 = Unsafe.Add(ref itemsRef, i + 1); var i2 = Unsafe.Add(ref itemsRef, i + 2); var i3 = Unsafe.Add(ref itemsRef, i + 3); AddToOutput(ref itemsRef, i0); AddToOutput(ref itemsRef, i1); AddToOutput(ref itemsRef, i2); AddToOutput(ref itemsRef, i3); } view raw ReShuffle.cs hosted with ❤ by GitHub The idea here is that we are trying to get the JIT / CPU to fetch the items before they are actually needed, so there would be more time for the memory to arrive. Remember that for the 1M scenario, we are dealing with 8MB of memory and for the 32M scenario, we have 256MB. Here is what happens when we look at the loop prolog, we can see that it is indeed first fetching all the items from memory, then doing the work: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters movsxd r9,r8d mov r9,[rdx+r9*8] lea r10d,[r8+1] movsxd r10,r10d mov r10,[rdx+r10*8] lea r11d,[r8+2] movsxd r11,r11d mov r11,[rdx+r11*8] lea esi,[r8+3] movsxd rsi,esi mov rsi,[rdx+rsi*8] test r9,r9 view raw shuffled.asm hosted with ❤ by GitHub In terms of performance, that gives us a small win (1% – 2% range) for the 1M and 32M entries scenario. The one last thing that I wanted to test is if we’ll unroll the loop even further, what would happen if we did 8 items per loop, instead of 4. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters for (; i + 8 < items.Length; i += 8) { Sse.Prefetch2(Unsafe.AsPointer(ref Unsafe.Add(ref itemsRef, i + 0))); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 0)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 1)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 2)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 3)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 4)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 5)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 6)); AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 7)); } view raw Roll8AndPreftech.cs hosted with ❤ by GitHub There is some improvement, (4% in the 1K scenario, 1% in the 32M scenario) but also slowdowns  (2% in the 1M scenario). I think that this is probably roughly the end of the line as far as we can get for scalar code. We already made quite a few strides in trying to parallelize the work the CPU is doing by just laying out the code as we would like it to be. We tried to control the manner in which it touches memory and in general, those are pretty advanced techniques. To close this post, I would like to take a look at the gains we got. I’m comparing the first version of the code, the last version we had on the previous post and the unrolled version for both branchy and branchless with 8 operations at once and memory prefetching. Method N Mean Error StdDev Ratio RatioSD Code Size FilterCmp 23 277.3 ns 0.69 ns 0.64 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 23 270.7 ns 0.42 ns 0.38 ns 0.98 0.00 397 B FilterCmp_Unroll_8 23 257.6 ns 1.45 ns 1.21 ns 0.93 0.00 672 B FilterCmp_Unroll_8_Branchless 23 259.9 ns 1.96 ns 1.84 ns 0.94 0.01 682 B                 FilterCmp 1047 754.3 ns 1.38 ns 1.22 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 1047 749.0 ns 1.81 ns 1.69 ns 0.99 0.00 397 B FilterCmp_Unroll_8 1047 647.2 ns 2.23 ns 2.09 ns 0.86 0.00 672 B FilterCmp_Unroll_8_Branchless 1047 721.2 ns 1.23 ns 1.09 ns 0.96 0.00 682 B                 FilterCmp 1048599 499,675.6 ns 2,639.97 ns 2,469.43 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 1048599 494,388.4 ns 600.46 ns 532.29 ns 0.99 0.01 397 B FilterCmp_Unroll_8 1048599 426,940.7 ns 1,858.57 ns 1,551.99 ns 0.85 0.01 672 B FilterCmp_Unroll_8_Branchless 1048599 483,940.8 ns 517.14 ns 458.43 ns 0.97 0.00 682 B                 FilterCmp 33554455 30,282,334.8 ns 599,306.15 ns 531,269.30 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 33554455 29,410,612.5 ns 29,583.56 ns 24,703.61 ns 0.97 0.02 397 B FilterCmp_Unroll_8 33554455 29,102,708.3 ns 42,824.78 ns 40,058.32 ns 0.96 0.02 672 B FilterCmp_Unroll_8_Branchless 33554455 29,761,841.1 ns 48,108.03 ns 42,646.51 ns 0.98 0.02 682 B The unrolled 8 version is the winner by far, in this scenario (0.5% negatives). Since that is the scenario we have in the real code, that is what I’m focusing on. Is there anything left to do here? My next step is to explore whether using vector instructions will be a good option for us.

Filtering negative numbers, fast

by Oren Eini

posted on: September 11, 2023

While working deep on the guts of RavenDB, I found myself with a seemingly simple task. Given a list of longs, I need to filter out all negative numbers as quickly as possible. The actual scenario is that we run a speculative algorithm, given a potentially large list of items, we check if we can fulfill the request in an optimal fashion. However, if that isn’t possible, we need to switch to a slower code path that does more work. Conceptually, this looks something like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters void Run(Span<long> entries, long switchOverPoint, Aggregation state) { long cost = 0; for (var i = 0; i < entries.Length; i++) { var (item, searchCost) = FetchCheaply(entries[i]); cost += searchCost; if(item is not null) { entries[i] = -entries[i]; // mark as processed if(state.Aggregate(item)) return; // we have enough results to bail early } if(cost > switchOverPoint) { // speculative execution failed, we need to do this the hard way RunManually(entries, state); return; } } } view raw Choices.cs hosted with ❤ by GitHub That is the setup for this story. The problem we have now is that we now need to filter the results we pass to the RunManually() method. There is a problem here, however. We marked the entries that we already used in the list by negating them. The issue is that RunManually() does not allow negative values, and its internal implementation is not friendly to ignoring those values. In other words, given a Span<long>, I need to write the code that would filter out all the negative numbers. Everything else about the list of numbers should remain the same (the order of elements, etc). From a coding perspective, this is as simple as: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public Span<long> FilterNegative(Span<long> entries) { return entries.ToArray().Where(l => l > 0).ToArray(); } view raw VeryBad.cs hosted with ❤ by GitHub Please note, just looking at this code makes me cringe a lot. This does the work, but it has an absolutely horrible performance profile. It allocates multiple arrays, uses a lambda, etc. We don’t actually care about the entries here, so we are free to modify them without allocating a new value. As such, let’s see what kind of code we can write to do this work in an efficient manner. Here is what I came up with: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static int FilterCmp(Span<long> items) { int output = 0; for (int i = 0; i < items.Length; i++) { if (items[i] < 0) continue; items[output++] = items[i]; } return output; } view raw Cmp.cs hosted with ❤ by GitHub The way this works is that we scan through the list, skipping writing the negative lists, so we effectively “move down” all the non-negative lists on top of the negative ones. This has a cost of O(N) and will modify the entire array, the final output is the number of valid items that we have there. In order to test the performance, I wrote the following harness: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters [RPlotExporter] [DisassemblyDiagnoser] public class SimdVsScalar { private long[] items; [Params(1024 + 23, 1024 * 1024 + 23, 32 * 1024 * 1024 + 23)] public int N; [GlobalSetup] public void Setup() { items = new long[N]; var r = new Random(2391); for (int i = 0; i < items.Length; i++) { items[i] = r.NextInt64(); } } private void SprinkleNegatives() { var r = new Random(13245); var negatives = Math.Max((int)(items.Length * 0.005), 1); for (int i = 0; i < negatives; i++) { var idx = r.Next(items.Length); items[idx] = -items[idx]; } } [Benchmark] public int FilterOr() { SprinkleNegatives(); return Filter.FilterOr(items); } [Benchmark] public int FilterCmp() { SprinkleNegatives(); return Filter.FilterCmp(items); } [Benchmark] public int Base() { SprinkleNegatives(); return items.Length; } } view raw Bench.cs hosted with ❤ by GitHub We compare 1K, 1M and 32M elements arrays, each of which has about 0.5% negative, randomly spread across the range. Because we modify the values directly, we need to sprinkle the negatives across the array on each call. In this case, I’m testing two options for this task, one that uses a direct comparison (shown above) and one that uses bitwise or, like so: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static int FilterOr(Span<long> items) { int output = 0; for (int i = 0; i < items.Length; i++) { if ((items[i] & ~long.MaxValue) != 0) continue; items[output++] = items[i]; } return output; } view raw FilterOr.cs hosted with ❤ by GitHub I’m testing the cost of sprinkling negatives as well, since that has to be done before each benchmark call (since we modify the array during the call, we need to “reset” its state for the next one). Given the two options, before we discuss the results, what would you expect to be the faster option? How would the size of the array matter here? I really like this example, because it is simple, there isn’t any real complexity in what we are trying to do. And there is a very straightforward implementation that we can use as our baseline. That also means that I get to analyze what is going on at a very deep level. You might have noticed the disassembler attribute on the benchmark code, we are going to dive deep into that. For the same reason, we aren’t using exactly 1K, 1M, or 32M arrays, but slightly higher than that, so we’ll have to deal with remainders later on. Let’s first look at what the JIT actually did here. Here is the annotated assembly for the FilterCmp function: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters ; Filter.FilterCmp(System.Span`1<Int64>) sub rsp,28 ; Reserve stack space mov rax,[rcx] ; rax now holds the pointer from the span mov edx,[rcx+8] ; edx now holds the length of the span xor ecx,ecx ; zero ecx (i) xor r8d,r8d ; zero r8d (output) test edx,edx ; if items.Length <= 0 jle short M02_L02 ; jump the epilog & return M02_L00: mov r9d,r8d mov r9,[rax+r9*8] ; r9 = items[i] test r9,r9 ; r9 < 0 jl short M02_L01 ; continue to next iteration ; items[output] -- range check lea r10d,[rcx+1] ; r10d = output+1 cmp ecx,edx ; check against items.length jae short M02_L03 ; jump to out of range exception mov ecx,ecx ; clear high bits in rcx (overflow from addition?) mov [rax+rcx*8],r9 ; items[output] = items[i] mov ecx,r10d ; output = r10d M02_L01: inc r8d ; i++ cmp r8d,edx ; i < item.Length jl short M02_L00 ; back to start of the loop M02_L02: mov eax,ecx ; return output add rsp,28 ret M02_L03: ; range check failure call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 69 view raw FilterCmp.asm hosted with ❤ by GitHub For the FilterOr, the code is pretty much the same, except that the key part is: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters - test r9,r9 - jl short M02_L01 + mov r10,8000000000000000 + test r10,r9 + jne short M02_L01 view raw CmpVsOr.diff hosted with ❤ by GitHub As you can see, the cmp option is slightly smaller, in terms of code size. In terms of performance, we have: Method N Mean FilterOr 1047 745.6 ns FilterCmp 1047 745.8 ns — – – FilterOr 1048599 497,463.6 ns FilterCmp 1048599 498,784.8 ns — – – FilterOr 33554455 31,427,660.7 ns FilterCmp 33554455 30,024,102.9 ns The costs are very close to one another, with Or being very slightly faster on low numbers, and Cmp being slightly faster on the larger sizes. Note that the difference level between them is basically noise. They have the same performance. The question is, can we do better here? Looking at the assembly, there is an extra range check in the main loop that the JIT couldn’t elide (the call to items[output++]). Can we do something about it, and would it make any difference in performance? Here is how I can remove the range check: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static int FilterCmp_NoRangeCheck(Span<long> items) { int outputIdx = 0; int i = 0; ref var output = ref items[i]; for (; i < items.Length; i++) { if (items[i] < 0) continue; ref var outputDest = ref Unsafe.Add(ref output, outputIdx++); outputDest = items[i]; } return outputIdx; } view raw FilterCmp_NoRangeCheck.cs hosted with ❤ by GitHub Here I’m telling the JIT: “I know what I’m doing”, and it shows. Let’s look at the assembly changes between those two methods, first the prolog: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters ; Filter.FilterCmp_NoRangeCheck(System.Span`1<Int64>) sub rsp,28 ; setup stack space mov rax,[rcx] ; rax is not the pointer for the span mov edx,[rcx+8] ; edx is the length of the span xor ecx,ecx ; zero ecx (i = 0) xor r8d,r8d ; zero r8d (outputIdx = 0) test edx,edx ; if items.Length == 0 je short M02_L03 ; jump to CORINFO_HELP_RNGCHKFAIL mov r9,rax ; output = items[0] test edx,edx ; if items.Length <= 0 jle short M02_L02 ; jump past the for loop view raw FilterCmp_NoRangeCheck.prolog.asm hosted with ❤ by GitHub Here you can see what we are actually doing here. Note the last 4 instructions, we have a range check for the items, and then we have another check for the loop. The first will get you an exception, the second will just skip the loop. In both cases, we test the exact same thing. The JIT had a chance to actually optimize that, but didn’t. Here is a funny scenario where adding code may reduce the amount of code generated. Let’s do another version of this method: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static int FilterCmp_NoRangeCheck(Span<long> items) { if (items.Length == 0) return 0; int outputIdx = 0; int i = 0; ref var output = ref items[i]; for (; i < items.Length; i++) { if (items[i] < 0) continue; ref var outputDest = ref Unsafe.Add(ref output, outputIdx++); outputDest = items[i]; } return outputIdx; } view raw FilterCmp_NoRangeCheck2.cs hosted with ❤ by GitHub In this case, I added a check to handle the scenario of items being empty. What can the JIT do with this now? It turns out, quite a lot. We dropped 10 bytes from the method, which is a nice result of our diet.  Here is the annotated version of the assembly: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters ; Filter.FilterCmp_NoRangeCheck(System.Span`1<Int64>) mov rdx,[rcx] ; rdx = pointer of items mov ecx,[rcx+8] ; ecx = length of items test ecx,ecx ; items.Length == 0 jne short M02_L00 ; if not true, jump ahead xor eax,eax ; return 0 ret M02_L00: xor eax,eax ; outputIdx = 0 xor r8d,r8d ; i = 0 mov r9,rdx ; output = items.pointer test ecx,ecx ; if items.Length <= 0 jle short M02_L03 ; jump to end function exit M02_L01: mov r10d,r8d mov r10,[r9+r10*8] ; r10 = items[i] test r10,r10 ; if items[i] < 0 jl short M02_L02 lea r11d,[rax+1] ; r11d = output +1 cdqe ; sign extent RAX (32 bits module on the addition) mov [rdx+rax*8],r10 ; copy r10 (items[i]) to items[r11d] mov eax,r11d ; outputIdx = r11d M02_L02: inc r8d ; i++ cmp r8d,ecx ; i < items.Length jl short M02_L01 ; do the loop again M02_L03: ret ; return ; Total bytes of code 59 view raw FilterCmp_NoRangeCheck.asm hosted with ❤ by GitHub A lot of the space savings in this case come from just not having to do a range check, but you’ll note that we still do an extra check there (lines 12..13), even though we already checked that. I think that the JIT knows that the value is not zero at this point, but has to consider that the value may be negative. If we’ll change the initial guard clause to: items.Length <= 0, what do you think will happen? At this point, the JIT is smart enough to just elide everything, we are at 55 bytes of code and it is a super clean assembly (not a sentence I ever thought I would use). I’ll spare you going through more assembly listing, but you can find the output here. And after all of that, where are we at? Method N Mean Error StdDev Ratio RatioSD Code Size FilterCmp 23 274.5 ns 1.91 ns 1.70 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 23 269.7 ns 1.33 ns 1.24 ns 0.98 0.01 397 B                 FilterCmp 1047 744.5 ns 4.88 ns 4.33 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 1047 745.8 ns 3.44 ns 3.22 ns 1.00 0.00 397 B                 FilterCmp 1048599 502,608.6 ns 3,890.38 ns 3,639.06 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 1048599 490,669.1 ns 1,793.52 ns 1,589.91 ns 0.98 0.01 397 B                 FilterCmp 33554455 30,495,286.6 ns 602,907.86 ns 717,718.92 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 33554455 29,952,221.2 ns 442,176.37 ns 391,977.84 ns 0.99 0.02 397 B There is a very slight benefit to the NoRangeCheck, but even when we talk about 32M items, we aren’t talking about a lot of time. The question what can we do better here?

Listing Windows virtual desktops using .NET

by Gérald Barré

posted on: September 11, 2023

In the blog post Per virtual desktop single-instance application, I introduced the IVirtualDesktopManager interface to detect if a window is on the current virtual desktop. In this post, I describe how to list the virtual desktops. This can be useful to provide a list of virtual desktops to the use