In the code of the simdcomp library there is a 25KLOC file that does evil things to SIMD registers to get bit packing to work. When I looked at the code the first few dozen times, I had a strong desire to run away screaming. Luckily, this isn’t just some pile of complicated code, but well-thought-out set of functions that are meant to provide optimal code for specific tasks. In particular, the code is specialized for each bit width that you may want to bit pack (0 .. 32). Even better, no one actually sat down to write it out by hand, there is a Python script that would generate the code. The first step was to understand what exactly is going on in the code, and then see how we can translate that to C#. Even just a few years ago, that would have been either an impossible dream or required the use of a native library (with the associated PInvoke overhead). However, .NET today has a very robust intrinsics support, and vectors / SIMD instructions are natively supported. I actually had a tougher challenge, since I want this code to run on x64 and ARM64 instances. The original code is x64 only, of course. The nice thing about SIMD support for .NET is that most of that can be done in a cross platform manner, with the JIT deciding what instructions will actually be emitted. There is still a very strong correlation between the vectorized code and the instructions that are being emitted, which means that I get both great control of what is going on and the appropriate abstraction. I was actually able to implement the whole thing without dropping to architecture-specific code, which makes me very happy. Before we get any deeper, let’s take a simple challenge. We want to take an array of 32 integers and pack each one of them in 4 bits into an array of 16 bytes. Here is what the code will look like: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public void Pack32IntsWith4Bits(int* items, byte* output) { var v = (Vector128<int>*)items; var mask = Vector128.Create(0x0f); var mixed = v[0] & mask | Vector128.ShiftLeft(v[1], 4) & mask | Vector128.ShiftLeft(v[2], 8) & mask | Vector128.ShiftLeft(v[3], 12) & mask | Vector128.ShiftLeft(v[4], 16) & mask | Vector128.ShiftLeft(v[5], 20) & mask | Vector128.ShiftLeft(v[6], 24) & mask | Vector128.ShiftLeft(v[7], 28) & mask ; mixed.AsByte().Store(output); } view raw pack.cs hosted with ❤ by GitHub This is a bit dense, but let’s see if I can explain what is going on here. We load from the array a vector (4 items) at the 0, 4, 8, 12, 16, 20, 24, and 28 intervals. For each one of those, we shift the values by the required offset and or all of them together. Note that this means that the first item’s four bits go in the first nibble, but the second item’s bits go in the fifth nibble, etc. The idea is that we are operating on 4 items at a time, reducing the total number of operations we have to perform. It may be easier to understand if you see those changes visually: What is happening here, however, is that we are able to do this transformation in very compact code. That isn’t just an issue of high-level code, let’s take a look at the assembly instructions that this generates: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters Pack32IntsWith4Bits(Int32*, Byte*) L0000: vzeroupper L0003: vmovupd xmm0, [rdx] L0007: vpand xmm0, xmm0, [0x1ed3a030110] L000f: vmovupd xmm1, [rdx+0x10] L0014: vpslld xmm1, xmm1, 4 L0019: vpand xmm1, xmm1, [0x1ed3a030110] L0021: vpor xmm0, xmm0, xmm1 L0025: vmovupd xmm1, [rdx+0x20] L002a: vpslld xmm1, xmm1, 8 L002f: vpand xmm1, xmm1, [0x1ed3a030110] L0037: vpor xmm0, xmm0, xmm1 L003b: vmovupd xmm1, [rdx+0x30] L0040: vpslld xmm1, xmm1, 0xc L0045: vpand xmm1, xmm1, [0x1ed3a030110] L004d: vpor xmm0, xmm0, xmm1 L0051: vmovupd xmm1, [rdx+0x40] L0056: vpslld xmm1, xmm1, 0x10 L005b: vpand xmm1, xmm1, [0x1ed3a030110] L0063: vpor xmm0, xmm0, xmm1 L0067: vmovupd xmm1, [rdx+0x50] L006c: vpslld xmm1, xmm1, 0x14 L0071: vpand xmm1, xmm1, [0x1ed3a030110] L0079: vpor xmm0, xmm0, xmm1 L007d: vmovupd xmm1, [rdx+0x60] L0082: vpslld xmm1, xmm1, 0x18 L0087: vpand xmm1, xmm1, [0x1ed3a030110] L008f: vpor xmm0, xmm0, xmm1 L0093: vmovupd xmm1, [rdx+0x70] L0098: vpslld xmm1, xmm1, 0x1c L009d: vpand xmm1, xmm1, [0x1ed3a030110] L00a5: vpor xmm0, xmm0, xmm1 L00a9: vmovdqu [r8], xmm0 L00ae: ret view raw pack.asm hosted with ❤ by GitHub I’m going to assume that you aren’t well versed with assembly, so let’s explain what is going on. This code contains zero branches, it does four reads from memory, mask the elements, shift them and or them together. The relevant instructions are: vmovupd – read 4 integers to the register vpand – binary and with a value (masking) vpslld – shift to the left vpor – binary or vmovdqu – write 16 bytes to memory There are no branches, nothing to complicate the code at all. This is about as tight as you can get, at the level of machine instructions. Of particular interest here is the C# code. The entire thing is basically a couple of lines of code, and I could express the whole thing as a single expression in a readable manner. Let’s look at the C code, to compare what is going on: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters static void pack32_4bits(const uint32_t *_in, __m128i *out) { const __m128i *in = (const __m128i *)(_in); __m128i OutReg, InReg; const __m128i mask = _mm_set1_epi32((1U << 4) - 1); uint32_t outer; InReg = _mm_and_si128(_mm_loadu_si128(in), mask); OutReg = InReg; InReg = _mm_and_si128(_mm_loadu_si128(in + 1), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 4)); InReg = _mm_and_si128(_mm_loadu_si128(in + 2), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 8)); InReg = _mm_and_si128(_mm_loadu_si128(in + 3), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 12)); InReg = _mm_and_si128(_mm_loadu_si128(in + 4), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 16)); InReg = _mm_and_si128(_mm_loadu_si128(in + 5), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 20)); InReg = _mm_and_si128(_mm_loadu_si128(in + 6), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 24)); InReg = _mm_and_si128(_mm_loadu_si128(in + 7), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 28)); _mm_storeu_si128(out, OutReg); } view raw pack.c hosted with ❤ by GitHub Note that both should generate the same machine code, but being able to rely on operator overloading means that I can now get a far more readable code. From that point, the remaining task was to re-write the Python script so it would generate C# code, not C. In the next post I’m going to be talking more about the constraints we have and what we are actually trying to solve with this approach.