skip to content
Relatively General .NET

RavenDB version 6.0 is now live

by Oren Eini

posted on: October 02, 2023

Today marks a very long journey for RavenDB as we release version 6.0 into the wild. This completes a multi-year effort on our part, which I’m really excited to share with you. In version 6.0 the RavenDB team had a number of goals, with the primary one being, as always, providing a simple and powerful platform for your data and documents. The full list of changes in the release is here and you can already download the binaries of the new release to try it out. Below you’ll find more information about some of the highlights for the new release, but the executive summary for the technically inclined is: Deep integration with Kafka & RabbitMQ - allowing to send and consume messages directly from RavenDB. Sharding support allows you to run massively large databases easily and cheaply. Corax is a new indexing engine for RavenDB that can provide a 10x performance improvement for both indexing and queries. With this new release we are also announcing a pricing update for our commercial customers. You can find more details at the bottom of this post. Before we get to the gist of the changes, I want to spend a minute talking about this release. One of the reasons that we took so long to craft this release was an uncompromising eye toward ease of use, performance, and quality of experience. It may sound strange to talk about “quality of experience” when discussing a database engine, we aren’t talking about a weekend trip or a romantic getaway, after all. The reason RavenDB exists is that we found other databases to be complicated and delicate beasts, and we wanted to create a database that would be a joy to work with. Something that you could throw your data and queries to and would just work. With version 6.0, I’m happy to say that we managed to keep on doing just that while providing you with a host of new features and capabilities. Without further ado, allow me to tell you about the highlights of this release.   Sharding Sharding is the process of splitting your data among multiple nodes, to allow you to scale beyond the terabyte range. RavenDB actually had this feature, as far back as 2010, when the 1.0 version came out. I’m speaking about this today because we have decided to revamp the entire approach to sharding in the new release. Sharding in RavenDB version 6.0 consists of selecting the sharded option on database creation time, and everything else is handled by the cluster. Splitting the data across the nodes, balancing load and giving you the feeling that you are working on a unified database is all our responsibility. A chief goal of the new sharding feature in RavenDB was to make sure that you don’t have to be aware that you are running in a sharded environment. In fact, you can even switch the backend to a sharded database behind your application’s back, with no modifications to the application. Your systems can take advantage of the sharding feature, but they aren’t forced to. This makes it easier when you need to scale because you don’t need an expensive rewrite or stop the world migration to move to the new architecture. RavenDB, with or without sharding, offers you the same capabilities. Features such as indexing, map/reduce, ETL tasks, or subscriptions all work in the same way regardless of how you choose to store your data. A successful metric for us is that after you switch over to sharding, the only thing that you’ll notice is that you can now scale your systems even more. And, of course, you could do that easily, safely, and quickly. I recorded a webinar showcasing our new sharding approach that you can watch. Corax Indexing Engine Corax indexing engine is now available in version 6.0. We have been working on that since 2014, and as you can imagine, it hasn’t been a simple -to-solve challenge. Corax has a single task, to be able to do everything that we use Lucene for, but do that faster. In fact, to do things much faster. RavenDB has used Lucene as its indexing engine from the start, and Lucene has been the industry benchmark for search engines for a long time. In the context of RavenDB, we have optimized Lucene heavily for the past 15 years. When building Corax, it wasn’t sufficient to match what Lucene can do but to also significantly outperform it. With version 6.0, we now offer both Corax and Lucene as indexing engines for RavenDB. You can choose to use either engine (globally or per index). Upgrading from an older version of RavenDB, you’ll keep on using the Lucene engine but can create new indexes with Corax. Corax outperforms Lucene on a wide range of scenarios by an order of magnitude. This includes both indexing time and query latency. Corax also manages to be far faster than Lucene while utilizing less CPU and memory. One of the ways it does that is by preparing, in advance, the relevant data structures that Lucene needs to compute at runtime. Corax indexes tend to consume about 10% - 20% more disk space than Lucene, as a trade for offering lowered memory usage, far faster querying performance, and reduced indexing time. In the same manner as sharding, switching to the new indexing engine will get you performance improvements, with no difference in capabilities or features. Kafka & RabbitMQ Integration Kafka & RabbitMQ integration has been extended to version 6.0. RavenDB already supported writing to Kafka and RabbitMQ and we have now extended this capability to allow you to read messages from Kafka and RabbitMQ and turn them into documents. This capability provides complete support for queuing infrastructure integration. RavenDB can work with existing pipelines and easily expose its data for the rest of the organization as well as consume messages from the queue and create or update documents accordingly. Such documents are then available for querying, aggregation, etc. This makes it trivial to use a low-code solution to quickly and efficiently plug your system into Kafka and RabbitMQ. In the style of all RavenDB features, we take upon ourselves all the complexity inherent in such a solution. You only need to provide the connection string and the details about what you want to pull, and the full management of pulling or pushing data, handling distributed state, failover, and monitoring is the responsibility of the RavenDB cluster. Data Archiving Data archiving is a new capability in RavenDB, aiming to help users who deal with very large amounts of data. While a business wants to retain all its data forever, it typically works primarily with recent data. That is typically the data from the last year or two. As time goes by, old data accumulates, and going through irrelevant data consumes computing and memory resources from the database. RavenDB version 6.0 offers a way to mark a document with an attribute, marking when RavenDB should move that document to archive mode. Aside from telling RavenDB at what time we should archive a document, there is nothing further that you need to do. When the time comes to archive the document, RavenDB will change the document mode, which will have a number of effects. The document itself is retained, it is not deleted. If you aren’t already using document compression, the archived document is stored in a compressed manner, to reduce disk usage further. Subscriptions and indexes will ignore archive documents and unless explicitly requested, archived documents will remain on the disk and consume no memory resources. Archiving reduces the resources consumed by archived documents while keeping them available if you need to look them up. You can also construct indexing that would operate over archived and regular documents as well if you still want to allow queries over archived data. In this way, you can retain your entire dataset in the main system of record but still operate a significantly smaller chunk of that during normal operations. Performance, observability, and usability enhancements have also been on the menu for the version 6.0 release. There are far too many individual features and improvements to count them all in this format. I encourage you to check the full release details on the website. You are probably well aware of the drastic changes that happened in the business environment in the last few years. As part of adjusting to this new reality, we are updating our pricing to allow us to keep delivering top-notch solutions and support to our valued customers. The new pricing policy is effective starting from January 1st, 2024. What does this mean to you right now? Absolutely nothing until January 1st, 2024. All subscription renewals in 2023 keep their current price point. An Early Birds benefit for our existing customers: renew your 2024 subscription while locking your 2023 price point. * Available for Cloud and on-premises Customers. In the weeks ahead, we'll provide detailed updates about these changes and ensure a smooth transition. Our goal is to empower you to make optimized choices about your RavenDB subscription. Your satisfaction remains our priority. Our team is ready to assist you. Please reach out to us at sales@ravendb.net for any questions you have. Finally, I would like to thank the RavenDB team for their hard work in delivering this version. Our goal for this release was to deliver you a lot more while making sure that you’ll not be exposed to the underlying complexity of the problems that we are solving for you. I believe that as you try out the new release, you’ll see how successful we are in providing you with an excellent database platform to take your systems to new heights. And with that, I am left only with encouraging you to try out the new version. And let us know what you think about it.

How to test the logs from ILogger in .NET

by Gérald Barré

posted on: October 02, 2023

You sometimes need to validate that a log message is written to the log. In this post, I describe how to unit test ILogger in .NET Core.There are multiple strategies:You can use a mock framework like Moq to mock the ILogger interface and validate the right calls are made.You can store all logs in-m

Introduction to MassTransit: A Guide to Streamlined Messaging in C#

by Ardalis

posted on: September 28, 2023

If you're interested in building distributed, scalable, and robust applications in C#, you may have heard of MassTransit. This open-source…Keep Reading →

How to write logs from ILogger to xUnit.net ITestOutputHelper

by Gérald Barré

posted on: September 25, 2023

When a test fails, it's important to understand what happened. All information could be useful. If you are using ILogger in your application, you can use the logs to understand what happened. However, the logs are not displayed in the output of xUnit. You can change this behavior by using a custom

Challenge

by Oren Eini

posted on: September 19, 2023

The following bug cost me a bunch of time, can you see what I’m doing wrong? This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters private static int AddAndSort(Dictionary<long, long> freeListAdditions, Span<long> keys, Span<long> vals) { int index = 0; foreach (var (k, v) in freeListAdditions) { keys[index] = k; vals[index] = v; index++; } keys.Sort(vals); return index; } view raw nasty.cs hosted with ❤ by GitHub For fun, it’s so nasty because usually, it will accidentally work.

Should you use the .NET 8 Identity API endpoints?

by Andrew Lock

posted on: September 19, 2023

In this post I discuss the new Identity API endpoints and talk about some of the security and architectural issues they may introduce.…

Accessing private members without reflection in C#

by Gérald Barré

posted on: September 18, 2023

Reflection allows you to access private members of a class. This is very useful when you want to access private members of a class that you don't own. However, it is slow and doesn't work well with Native AOT. In this post, I describe how to access private members without reflection in .NET 8.Let's

Filtering negative numbers, fast

by Oren Eini

posted on: September 15, 2023

In the previous post, I was able to utilize AVX to get some nice speedups. In general, I was able to save up to 57%(!) of the runtime in processing arrays of 1M items. That is really amazing, if you think about it. But my best effort only gave me a 4% improvement when using 32M items. I decided to investigate what is going on in more depth, and I came up with the following benchmark. Given that I want to filter negative numbers, what would happen if the only negative number in the array was the first one? In other words, let’s see what happens when we could write this algorithm as the following line of code: array[1..].CopyTo(array); The idea here is that we should measure the speed of raw memory copy and see how that compares to our code. Before we dive into the results, I want to make a few things explicit. We are dealing here with arrays of long, when I’m talking about an array with 1M items, I’m actually talking about an 8MB buffer, and for the 32M items, we are talking about 256MB of memory. I’m running these benchmarks on the following machine:     AMD Ryzen 9 5950X 16-Core Processor     Base speed:    3.40 GHz     L1 cache:    1.0 MB     L2 cache:    8.0 MB     L3 cache:    64.0 MB     Utilization    9%     Speed    4.59 GHz In other words, when we look at this, the 1M items (8MB) can fit into L2 (barely, but certainly backed by the L3 cache. For the 32M items (256MB), we are far beyond what can fit in the cache, so we are probably dealing with memory bandwidth issues. I wrote the following functions to test it out: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters [Benchmark] public int CopyTo() { items[0] = -items[0]; Span<long> itemsSpan = items; itemsSpan[1..].CopyTo(itemsSpan); return items.Length - 1; ; } [Benchmark] public unsafe int MemoryCopy() { items[0] = -items[0]; fixed(long * p = items) { Buffer.MemoryCopy(p, p + 1, items.Length * sizeof(long), (items.Length - 1) * sizeof(long)); } return items.Length - 1; ; } [Benchmark] public unsafe int MoveMemory() { items[0] = -items[0]; fixed (long* p = items) { MoveMemory(p + 1, p, items.Length * sizeof(long)); } return items.Length - 1; ; } [DllImport("Kernel32.dll", EntryPoint = "RtlMoveMemory", SetLastError = false)] static unsafe extern void MoveMemory(void* dest, void* src, int size); view raw CopyBench.cs hosted with ❤ by GitHub Let’s look at what I’m actually testing here. CopyTo() – using the span native copying is the most ergonomic way to do things, I think. MemoryCopy() – uses a built-in unsafe API in the framework. That eventually boils down to a heavily optimized routine, which… calls to Memove() if the buffer overlaps (as they do in this case). MoveMemory() – uses a pinvoke to call to the Windows API to do the moving of memory for us. Here are the results for the 1M case (8MB): Method N Mean Error StdDev Ratio FilterCmp 1048599 441.4 us 1.78 us 1.58 us 1.00 FilterCmp_Avx 1048599 141.1 us 2.70 us 2.65 us 0.32 CopyTo 1048599 872.8 us 11.27 us 10.54 us 1.98 MemoryCopy 1048599 869.7 us 7.29 us 6.46 us 1.97 MoveMemory 1048599 126.9 us 0.28 us 0.25 us 0.29 We can see some real surprises here. I’m using the FilterCmp (the very basic implementation) that I wrote. I cannot explain why CopyTo() and MemoryMove() are so slow. What is really impressive is that the FilterCmp_Avx() and MoveMemory() are so close in performance, and so much faster. To put it in another way, we are already at a stage where we are within shouting distance from the MoveMemory() performance. That is.. really impressive. That said, what happens with 32M (256MB) ? Method N Mean Error StdDev Ratio FilterCmp 33554455 22,763.6 us 157.23 us 147.07 us 1.00 FilterCmp_Avx 33554455 20,122.3 us 214.10 us 200.27 us 0.88 CopyTo 33554455 27,660.1 us 91.41 us 76.33 us 1.22 MemoryCopy 33554455 27,618.4 us 136.16 us 127.36 us 1.21 MoveMemory 33554455 20,152.0 us 166.66 us 155.89 us 0.89 Now we are faster in the FilterCmp_Avx than MoveMemory. That is… a pretty big wow, and a really nice close for this blog post series, right? Except that we won’t be stopping here. The way the task I set out works, we are actually filtering just the first item out, and then we are basically copying the memory. Let’s do some math: 256MB in 20.1ms means 12.4 GB/sec! On this system, I have the following memory setup:     64.0 GB     Speed:    2133 MHz     Slots used:    4 of 4     Form factor:    DIMM     Hardware reserved:    55.2 MB I’m using DDR4 memory, so I can expect a maximum speed of 17GB/sec. In theory, I might be able to get more if the memory is located on different DIMMs, but for the size in question, that is not likely. I’m going to skip the training montage of VTune, understanding memory architecture and figuring out what is actually going on. Let’s drop everything and look at what we have with just AVX vs. MoveMemory: Method N Mean Error StdDev Median Ratio FilterCmp_Avx 1048599 141.6 us 2.28 us 2.02 us 141.8 us 1.12 MoveMemory 1048599 126.8 us 0.25 us 0.19 us 126.8 us 1.00               FilterCmp_Avx 33554455 21,105.5 us 408.65 us 963.25 us 20,770.4 us 1.08 MoveMemory 33554455 20,142.5 us 245.08 us 204.66 us 20,108.2 us 1.00 The new baseline is MoveMemory, and in this run, we can see that the AVX code is slightly slower. It’s sadly not uncommon to see numbers shift by those ranges when we are testing such micro-optimizations, mostly because we are subject to so many variables that can affect performance. In this case, I dropped all the other benchmarks, which may have changed things. At any rate, using those numbers, we have 12.4GB/sec for MoveMemory() and 11.8GB/sec for the AVX version. The hardware maximum speed is 17GB/sec. So we are quite close to what can be done. For that matter, I would like to point out that the trivial code completed the task in 11GB/sec, so that is quite respectable and shows that the issue here is literally getting the memory fast enough to the CPU. Can we do something about that? I made a pretty small change to the AVX version, like so: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters +if (i + Vector256<long>.Count <= len) +{ + var next = Vector256.LoadUnsafe(ref Unsafe.Add(ref output, i)); + len -= Vector256<long>.Count; for (; i + Vector256<long>.Count <= len; i += Vector256<long>.Count) { - var v = Vector256.LoadUnsafe(ref Unsafe.Add(ref output, i)); + var v = next; + next = Vector256.LoadUnsafe(ref Unsafe.Add(ref output, i + Vector256<long>.Count)); ... view raw Avx_Next.diff hosted with ❤ by GitHub What are we actually doing here? Instead of loading the value and immediately using it, we are loading the next value, then we are executing the loop and when we iterate again, we will start loading the next value and process the current one. The idea is to parallelize load and compute at the instruction level. Sadly, that didn’t seem to do the trick. I saw a 19% additional cost for that version compared to the vanilla AVX one on the 8MB run and a 2% additional cost on the 256MB run. I then realized that there was one really important test that I had to also make, and wrote the following: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters [Benchmark()] public unsafe int MoveMemory() { items[0] = -items[0]; fixed (long* p = items) { MoveMemory(p + 1, p, items.Length * sizeof(long)); } return items.Length - 1; ; } [DllImport("Kernel32.dll", EntryPoint = "RtlMoveMemory", SetLastError = false)] static unsafe extern void MoveMemory(void* dest, void* src, int size); [Benchmark(Baseline = true)] public unsafe int FillMemory() { items[0] = -items[0]; fixed (long* p = items) { var b = (byte)Random.Shared.Next(); FillMemory(p,items.Length * sizeof(long), b); } return items.Length - 1; ; } [DllImport("Kernel32.dll", EntryPoint = "RtlFillMemory", SetLastError = false)] static unsafe extern void FillMemory(void* dest, IntPtr size, byte fill); view raw Final.cs hosted with ❤ by GitHub In other words, let's test the speed of moving memory and filling memory as fast as we possibly can. Here are the results: Method N Mean Error StdDev Ratio RatioSD Code Size MoveMemory 1048599 126.8 us 0.36 us 0.33 us 0.25 0.00 270 B FillMemory 1048599 513.5 us 10.05 us 10.32 us 1.00 0.00 351 B                 MoveMemory 33554455 20,022.5 us 395.35 us 500.00 us 1.26 0.02 270 B FillMemory 33554455 15,822.4 us 19.85 us 17.60 us 1.00 0.00 351 B This is really interesting, for a small buffer (8MB), MoveMemory is somehow faster. I don’t have a way to explain that, but it has been a pretty consistent result in my tests. For the large buffer (256MB), we are seeing results that make more sense to me. MoveMemory – 12.5 GB / sec FIllMemory – 15.8 GB / sec In other words, for MoveMemory, we are both reading and writing, so we are paying for memory bandwidth in both directions. For filling the memory, we are only writing, so we can get better performance (no need for reads). In other words, we are talking about hitting the real physical limits of what the hardware can do. There are all sorts of tricks that one can pull, but when we are this close to the limit, they are almost always context-sensitive and dependent on many factors. To conclude, here are my final results: Method N Mean Error StdDev Ratio RatioSD Code Size FilterCmp_Avx 1048599 307.9 us 6.15 us 12.84 us 0.99 0.05 270 B FilterCmp_Avx_Next 1048599 308.4 us 6.07 us 9.26 us 0.99 0.03 270 B CopyTo 1048599 1,043.7 us 15.96 us 14.93 us 3.37 0.11 452 B ArrayCopy 1048599 1,046.7 us 15.92 us 14.89 us 3.38 0.14 266 B UnsafeCopy 1048599 309.5 us 6.15 us 8.83 us 1.00 0.04 133 B MoveMemory 1048599 310.8 us 6.17 us 9.43 us 1.00 0.00 270 B                 FilterCmp_Avx 33554455 24,013.1 us 451.09 us 443.03 us 0.98 0.02 270 B FilterCmp_Avx_Next 33554455 24,437.8 us 179.88 us 168.26 us 1.00 0.01 270 B CopyTo 33554455 32,931.6 us 416.57 us 389.66 us 1.35 0.02 452 B ArrayCopy 33554455 32,538.0 us 463.00 us 433.09 us 1.33 0.02 266 B UnsafeCopy 33554455 24,386.9 us 209.98 us 196.42 us 1.00 0.01 133 B MoveMemory 33554455 24,427.8 us 293.75 us 274.78 us 1.00 0.00 270 B As you can see, just the AVX version is comparable or (slightly) beating the MoveMemory function. I tried things like prefetching memory, both the next item, the next cache item and from the next page, using non-temporal load and stores and many other things, but this is a pretty tough challenge. What is really interesting is that the original, very simple and obvious implementation, clocked at 11 GB/sec. After pulling pretty much all the stops and tricks, I was able to hit 12.5 GB / sec. I don’t think anyone can look / update / understand the resulting code in any way without going through deep meditation. That is not a bad result at all. And along the way, I learned quite a bit about how the lowest level of the machine architecture is working.

Trunk-Based Development vs. Long-Lived Feature Branches: Which One is Right for Your Software Team?

by Ardalis

posted on: September 14, 2023

When it comes to effective software development strategies, two distinct approaches often lock horns: Trunk-Based Development and Long-Lived…Keep Reading →

Filtering negative numbers, fast

by Oren Eini

posted on: September 13, 2023

In the previous post I discussed how we can optimize the filtering of negative numbers by unrolling the loop, looked into branchless code and in general was able to improve performance by up to 15% from the initial version we started with. We pushed as much as we could on what can be done using scalar code. Now it is the time to open a whole new world and see what we can do when we implement this challenge using vector instructions. The key problem with such tasks is that SIMD, AVX and their friends were designed by… an interesting process using a perspective that makes sense if you can see in a couple of additional dimensions. I assume that at least some of that is implementation constraints, but the key issue is that when you start using SIMD, you realize that you don’t have general-purpose instructions. Instead, you have a lot of dedicated instructions that are doing one thing, hopefully well, and it is your role to compose them into something that would make sense. Oftentimes, you need to turn the solution on its head in order to successfully solve it using SIMD. The benefit, of course, is that you can get quite an amazing boost in speed when you do this. The algorithm we use is basically to scan the list of entries and copy to the start of the list only those items that are positive. How can we do that using SIMD? The whole point here is that we want to be able to operate on multiple data, but this particular task isn’t trivial. I’m going to show the code first, then discuss what it does in detail: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static int FilterCmp_Avx(Span<long> items) { var len = items.Length; if (len <= 0) return 0; ref var permuteStart = ref Unsafe.AsRef(PermuteTable[0]); int outputIdx = 0; int i = 0; ref var output = ref items[i]; for (; i + Vector256<long>.Count <= len; i += Vector256<long>.Count) { var v = Vector256.LoadUnsafe(ref Unsafe.Add(ref output, i)); var bits = v.ExtractMostSignificantBits(); if (bits == 0) // do we have _any_ negatives here? { v.StoreUnsafe(ref Unsafe.Add(ref output, outputIdx)); outputIdx += Vector256<long>.Count; continue; } // complex case, we have to deal with some negatives var permute = Vector256.LoadUnsafe(ref Unsafe.Add(ref permuteStart, (int)bits * (sizeof(int) * 8))).AsInt32(); var m = Avx2.PermuteVar8x32(v.AsInt32(), permute).AsInt64(); m.StoreUnsafe(ref Unsafe.Add(ref output, outputIdx)); outputIdx += 4 - BitOperations.PopCount(bits); } // remainder, do that in a scalar fashion for (; i < len; i++) { ref var cur = ref Unsafe.Add(ref output, i); if (cur < 0) continue; Unsafe.Add(ref output, outputIdx++) = cur; } return outputIdx; } view raw FilterCmp_Avx.cs hosted with ❤ by GitHub We start with the usual check (if you’ll recall, that ensures that the JIT knows to elide some range checks, then we load the PremuteTable. For now, just assume that this is magic (and it is). The first interesting thing happens when we start iterating over the loop. Unlike before, now we do that in chunks of 4 int64 elements at a time. Inside the loop, we start by loading a vector of int64 and then we do the first odd thing. We call ExtractMostSignificantBits(), since the sign bit is used to mark whether a number if negative or not. That means that I can use a single instruction to get an integer with the bits set for all the negative numbers. That is particularly juicy for what we need, since there is no need for comparisons, etc. If the mask we got is all zeroes, it means that all the numbers we loaded to the vector are positives, so we can write them as-is to the output and move to the next part. Things get interesting when that isn’t the case. We load a permute value using some shenanigans (we’ll touch on that shortly) and call the PermuteVar8x32() method. The idea here is that we pack all the non-negative numbers to the start of the vector, then we write the vector to the output. The key here is that when we do that, we increment the output index only by the number of valid values.  The rest of this method just handles the remainder that does not fit into a vector. The hard part in this implementation was to figure out how to handle the scenario where we loaded some negative numbers. We need a way to filter them, after all. But there is no SIMD instruction that allows us to do so. Luckily, we have the Avx2.PermuteVar8x32() method to help here. To confuse things, we don’t actually want to deal with 8x32 values. We want to deal with 4x64 values. There is Avx2.Permute4x64() method, and it will work quite nicely, with a single caveat. This method assumes that you are going to pass it a constant value. We don’t have such a constant, we need to be able to provide that based on whatever the masked bits will give us. So how do we deal with this issue of filtering with SIMD? We need to move all the values we care about to the front of the vector. We have the method to do that, PermuteVar8x32() method, and we just need to figure out how to actually make use of this. PermuteVar8x32() accepts an input vector as well as a vector of the premutation you want to make. In this case, we are basing this on the 4 top bits of the 4 elements vector of int64. As such, there are a total of 16 options available to us. We have to deal with 32bits values rather than 64bits, but that isn’t that much of a problem. Here is the premutation table that we’ll be using: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static readonly int[] PermuteTableInts = new int[] { 0,1,2,3,4,5,6,7, // 0000 2,3,4,5,6,7,0,0, // 0001 0,1,4,5,6,7,0,0, // 0010 4,5,6,7,0,0,0,0, // 0011 0,1,2,3,6,7,0,0, // 0100 2,3,6,7,0,0,0,0, // 0101 0,1,6,7,0,0,0,0, // 0110 6,7,0,0,0,0,0,0, // 0111 0,1,2,3,4,5,0,0, // 1000 2,3,4,5,0,0,0,0, // 1001 0,1,4,5,0,0,0,0, // 1010 4,5,0,0,0,0,0,0, // 1011 0,1,2,3,0,0,0,0, // 1100 2,3,0,0,0,0,0,0, // 1101 0,1,0,0,0,0,0,0, // 1110 0,0,0,0,0,0,0,0, // 1111 }; view raw PermuteTableInts.cs hosted with ❤ by GitHub What you can see here is that when we have a 1 in the bits (shown in comments) we’ll not copy that to the vector. Let’s take a look at the entry of 0101, which may be caused by the following values [1,-2,3,-4]. When we look at the right entry at index #5 in the table: 2,3,6,7,0,0,0,0 What does this mean? It means that we want to put the 2nd int64 element in the source vector and move it as the first element of the destination vector, take the 3rd element from the source as the second element in the destination and discard the rest (marked as 0,0,0,0 in the table). This is a bit hard to follow because we have to compose the value out of the individual 32 bits words, but it works quite well. Or, at least, it would work, but not as efficiently. This is because we would need to load the PermuteTableInts into a variable and access it, but there are better ways to deal with it. We can also ask the JIT to embed the value directly. The problem is that the pattern that the JIT recognizes is limited to ReadOnlySpan<byte>, which means that the already non-trivial int32 table got turned into this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters // same as the one above, just in bytes for JIT optimizations public static ReadOnlySpan<byte> PermuteTable => new byte[] { 0,0,0,0,1,0,0,0,2,0,0,0,3,0,0,0,4,0,0,0,5,0,0,0,6,0,0,0,7,0,0,0, 2,0,0,0,3,0,0,0,4,0,0,0,5,0,0,0,6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,4,0,0,0,5,0,0,0,6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0, 4,0,0,0,5,0,0,0,6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,2,0,0,0,3,0,0,0,6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0, 2,0,0,0,3,0,0,0,6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 6,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,2,0,0,0,3,0,0,0,4,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0, 2,0,0,0,3,0,0,0,4,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,4,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 4,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,2,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 2,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, }; view raw PermuteTable.cs hosted with ❤ by GitHub This is the exact same data as before, but using ReadOnlySpan<byte> means that the JIT can package that inside the data section and treat it as a constant value. The code was heavily optimized, to the point where I noticed a JIT bug where these two versions of the code give different assembly output: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters int A(int i) { return i * 4 * 8; } int B(int i) { return i * (4 * 8); } view raw code.cs hosted with ❤ by GitHub Here is what we get out: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters Program.<<Main>$>g__A|0_0(Int32) L0000: mov eax, ecx L0002: shl eax, 2 L0005: shl eax, 3 L0008: ret Program.<<Main>$>g__B|0_1(Int32) L0000: mov eax, ecx L0002: shl eax, 5 L0005: ret view raw code.asm hosted with ❤ by GitHub This looks like an unintended consequence of Roslyn and the JIT each doing their (separate jobs), but not reaching the end goal. Constant folding looks like it is done mostly by Roslyn, but it does a scan like that from the left, so it wouldn’t convert $A * 4 * 8 to $A * 32. That is because it stopped evaluating the constants when it found a variable. When we add parenthesis, we isolate the value and now understand that we can fold it. Speaking of assembly, here is the annotated assembly version of the code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters ; Filter.FilterCmp_Avx(System.Span`1<Int64>) push rsi ; push rsi to stack - we'll be using a lot of state, so need to preserve it vzeroupper ; clear AVX state, since we don't know how the caller left it mov rdx,[rcx] ; rdx is now the pointer of the items' span mov ecx,[rcx+8] ; ecx is now the length of the span test ecx,ecx ; if items.Length > 0 jg short M02_L00 ; jump ahead to do work xor eax,eax ; return 0 vzeroupper ; not actually needed, I think pop rsi ret M02_L00: mov rax,1CA3A753F78 ; load permuteStart xor r8d,r8d ; outputIdx = 0 xor r9d,r9d ; i = 0 cmp ecx,4 ; if items.Length < 4 jl short M02_L04 ; jump the AVX portion loop to the next one M02_L01: movsxd r10,r9d ; r10 = current i value vmovdqu ymm0,ymmword ptr [rdx+r10*8] ; load vector register from output[i] (4 elements) vmovmskpd r10d,ymm0 ; move the top bits from 4 int64 elements to r10d (bits) test r10d,r10d ; if bits != 0 jne short M02_L02 ; jump to the complex piece of the code movsxd r10,r8d vmovdqu ymmword ptr [rdx+r10*8],ymm0 ; store vector to output[outputIdx] add r8d,4 ; outputIdx += 4 jmp short M02_L03 ; restart loop M02_L02: movsxd r11,r8d ; r11 - current outputIdx value mov esi,r10d ; esi = bits shl esi,5 ; bits * 4 * 8 - optimized movsxd rsi,esi vmovdqu ymm1,ymmword ptr [rax+rsi] ; load vector from premutation table vpermd ymm0,ymm1,ymm0 ; permute value vmovdqu ymmword ptr [rdx+r11*8],ymm0 ; write to output popcnt r10d,r10d ; count the bits neg r10d ; negate the value lea r8d,[r10+r8+4] ; add to outputIdx + 4 M02_L03: add r9d,4 ; for loop handling lea r10d,[r9+4] cmp r10d,ecx jle short M02_L01 ; jump to start of loop M02_L04: cmp r9d,ecx ; if there are no elements remaining jge short M02_L07 ; jump to end M02_L05: movsxd rax,r9d ; here is the scalar version, already discussed previously lea rax,[rdx+rax*8] mov rax,[rax] test rax,rax jl short M02_L06 mov r10d,r8d lea r8d,[r10+1] movsxd r10,r10d mov [rdx+r10*8],rax M02_L06: inc r9d cmp r9d,ecx jl short M02_L05 M02_L07: mov eax,r8d vzeroupper pop rsi ret ; Total bytes of code 179 view raw FilterCmp_Avx.asm hosted with ❤ by GitHub And after all of this work, where are we standing? Method N Mean Error StdDev Ratio RatioSD Code Size FilterCmp 23 285.7 ns 3.84 ns 3.59 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 23 272.6 ns 3.98 ns 3.53 ns 0.95 0.01 397 B FilterCmp_Unroll_8 23 261.4 ns 1.27 ns 1.18 ns 0.91 0.01 672 B FilterCmp_Avx 23 261.6 ns 1.37 ns 1.28 ns 0.92 0.01 521 B                 FilterCmp 1047 758.7 ns 1.51 ns 1.42 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 1047 756.8 ns 1.83 ns 1.53 ns 1.00 0.00 397 B FilterCmp_Unroll_8 1047 640.4 ns 1.94 ns 1.82 ns 0.84 0.00 672 B FilterCmp_Avx 1047 426.0 ns 1.62 ns 1.52 ns 0.56 0.00 521 B                 FilterCmp 1048599 502,681.4 ns 3,732.37 ns 3,491.26 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 1048599 499,472.7 ns 6,082.44 ns 5,689.52 ns 0.99 0.01 397 B FilterCmp_Unroll_8 1048599 425,800.3 ns 352.45 ns 312.44 ns 0.85 0.01 672 B FilterCmp_Avx 1048599 218,075.1 ns 212.40 ns 188.29 ns 0.43 0.00 521 B                 FilterCmp 33554455 29,820,978.8 ns 73,461.68 ns 61,343.83 ns 1.00 0.00 411 B FilterCmp_NoRangeCheck 33554455 29,471,229.2 ns 73,805.56 ns 69,037.77 ns 0.99 0.00 397 B FilterCmp_Unroll_8 33554455 29,234,413.8 ns 67,597.45 ns 63,230.70 ns 0.98 0.00 672 B FilterCmp_Avx 33554455 28,498,115.4 ns 71,661.94 ns 67,032.62 ns 0.96 0.00 521 B So it seems that the idea of using SIMD instruction has a lot of merit. Moving from the original code to the final version, we see that we can complete the same task in up to half the time. I’m not quite sure why we aren’t seeing the same sort of performance on the 32M, but I suspect that this is likely because we far exceed the CPU cache and we have to fetch it all from memory, so that is as fast as it can go. If you are interested in learning more, Lemire solves the same problem in SVE (SIMD for ARM) and Paul has a similar approach in Rust. If you can think of further optimizations, I would love to hear your ideas.