skip to content
Relatively General .NET

Tracking click on anchors in an HTML page

by Gérald Barré

posted on: June 26, 2023

There are many ways to track when a user clicks on a link in an HTML page. Most of them require loading a JS script and registering the click event for all <a> elements. More recently, the ping attribute was introduced to the <a> element. It allows to specify a URL to ping when the link

Using encryption to verify a license key

by Oren Eini

posted on: June 23, 2023

A long while ago I had this project, which allows you to define software licenses that you can distribute. The basic idea is pretty simple, we want to be able to define a key (needs to be short and copy/pastable) that we’ll be able to provide to our code that is running in a separate environment. In particular, we have to deal with the issue of not being connected to a central server. That sounds like a really hard problem, but it turns out that it is a pretty simple solution, if we use public key cryptography. Typically you’ll utilize public key cryptography for encryption, but you can also use that for signing. The idea is that we can use the ability to sign the key using a private key, then validate it (potentially offline) using the public key. Licensing can be complex, so we are going to punt all of that to someone else. In this post I’m going to just discuss how we can sign a piece of data and then verify it. We’ll start by generating the keys, this is an action that you should only do once: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters using System; using System.Security.Cryptography; var key = ECDsa.Create(); var pem = key.ExportPkcs8PrivateKeyPem(); Console.WriteLine(pem); var publicPem = key.ExportSubjectPublicKeyInfoPem(); Console.WriteLine(publicPem); view raw GenerateKeys.cs hosted with ❤ by GitHub Here is the output of this code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters -----BEGIN PRIVATE KEY----- MIHuAgEAMBAGByqGSM49AgEGBSuBBAAjBIHWMIHTAgEBBEIBmjSDHRRDkvP25ne4 +bG0mU2L9gedX7nuXHKoSiwMvqQ9E70cyfMwuTH/9Ck7ykll8U4+mqnWV2BWluAk xaHtfLWhgYkDgYYABAFbfheDkNDoks25uW8EBPgSGZnwM7zzeXc3BbLNbbAEGLjy SHiC64cHdSqSDjENiLEobrGOfpgRxkmMHJBCMzXmOwBXoDBGrtVz5Y2MPwXb9f8i ZZEeVR3QlZtv51LboC6JPZFHV8afK3sh1UcUY2bVb4xNPKv75ws3CeX+O6bTBg22 Jw== -----END PRIVATE KEY----- -----BEGIN PUBLIC KEY----- MIGbMBAGByqGSM49AgEGBSuBBAAjA4GGAAQBW34Xg5DQ6JLNublvBAT4EhmZ8DO8 83l3NwWyzW2wBBi48kh4guuHB3Uqkg4xDYixKG6xjn6YEcZJjByQQjM15jsAV6Aw Rq7Vc+WNjD8F2/X/ImWRHlUd0JWbb+dS26AuiT2RR1fGnyt7IdVHFGNm1W+MTTyr ++cLNwnl/jum0wYNtic= -----END PUBLIC KEY----- view raw keys.pem hosted with ❤ by GitHub You can now embed the public key in your software, while keeping the private key hidden. With that in place, we can now sign a license. But what is a license? At its most basic form, it is a set of attributes that describe what your license allows. As such, we can use a Dictionary<string,string> to define the capabilities of the license. With that in place, we can write very simple code to generate the license text: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters string Sign(string privateKeyPEM, Dictionary<string, string> license) { using var sign = ECDsa.Create(); sign.ImportFromPem(privateKeyPEM); var text = JsonConvert.SerializeObject(license, Formatting.None); var output = sign.SignData(Encoding.UTF8.GetBytes(text), HashAlgorithmName.SHA256); var outputBase64 = Convert.ToBase64String(output); license["@Sig"] = outputBase64; return JsonConvert.SerializeObject(license, Formatting.Indented); } view raw Sign.cs hosted with ❤ by GitHub And here is the result: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters { "Name": "John Doe", "Location": "Europe", "@Sig": "AGA3s9Vf/LI8t5uxP+cRofNOEGaBaqGLfur68kGSu0gYPxPDhDkDcc+fneAhAMSjaFJUKTHUtE+4OXimgwqRkEsVAKeETE5GWPoejyLTITapabHa5KLY7XOcehgm9pGm5IPBgIB2NgfSizLIGRrjjoa9t6ho2jDpmUnMF4HuGJQd2Dej" } view raw signed.json hosted with ❤ by GitHub The last part to deal with is verifying that the license is valid, we can do that using the following function: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters Dictionary<string, string> Verify(string publicKeyPEM, string license) { var dic = JsonConvert.DeserializeObject<Dictionary<string, string>>(license); if (dic.Remove("@Sig", out var sigText) == false) return null; Span<byte> signature = stackalloc byte[512]; if (Convert.TryFromBase64String(sigText, signature, out var writter) == false) return null; signature = signature[..writter]; using var verify = ECDsa.Create(); verify.ImportFromPem(publicKeyPEM); var textToVerify = JsonConvert.SerializeObject(dic, Formatting.None); if (verify.VerifyData(Encoding.UTF8.GetBytes(textToVerify), signature, HashAlgorithmName.SHA256) == false) return null; return dic; } view raw Verify.cs hosted with ❤ by GitHub If the license options are the same as the one we signed and the cryptographic signature is a match, we can safely return the license options. Otherwise, we’ll return null. As I said, this is pretty simple code all around. And doesn’t do much, but it means that you can let the framework carry most of the load. Features such as enabling specific capabilities for a license, expiration, etc are all possible. Define an ExpirationDate property and have your software react to that, for example. A few words about this implementation, however. I’m relying heavily on the .NET framework, obviously. But beyond just using the cryptographic primitives, we also have to take into account a bunch of other aspects. For example, I’m not bothering to normalize the license. Instead, I rely on the fact that the .NET Dictionary will iterate over keys in insertion order. Note that any change in the actual data will result in a verification fails. This is a pretty simple design, but it ensures that you cannot “crack” the algorithm used to generate the license keys. Of course, users can still always just patch the isValidLicense function, instead .

Integer compression

by Oren Eini

posted on: June 21, 2023

After this saga, I wanted to close the series with some numbers about the impact of this algorithm. If you’ll recall, I started this whole series discussing variable-sized integers. I was using this list of numbers to compare the values. There are 44,956 items in the list. Algorithm Size Raw 359.648 Varint 224,780 Delta + Varint    45,970 Delta + Varint + Gzip             5,273 FastPFor      24,717 You can see some interesting details about the various options. Delta + Varint + Gzip is a really good setup, if you can assume that the output pattern has a high degree of repetition. In some cases, that is possible, but that isn’t a generic property. Aside from that, FastPFor is winning hands down in terms of the output size of the data. There is also another important aspect. The size of the output here is too big to fit in a single 8KB buffer, so I’ll need to use multiple for that. This is not something that you can really do with GZip. Here is the cost across multiple buffers: This particular list would fit into 4 pages: 14,080 entries at 8,048 bytes 14,080 entries at 8,063 bytes 15,616 entries at 8,030 bytes   1,180 entries at    644 bytes But the compression ratio is only part of that. Let’s talk about the performance aspect. On my machine, you can run the encoding process in 1,115 ticks and decoding in just 462 ticks. To give some context, that means that you can do the encoding at a rate of ~400,000,000 / second and decoding at a rate of about 1 billion per second. My perf team also asked me to mention that they haven’t gotten the chance to look at the code yet, so things are likely to improve. The entire premise of FastPFor inside of RavenDB relies on these fast decoding & encoding times. It makes it super cheap to iterate over a vast amount of numbers, and the B+Tree layout we have means that updating a posting list’s page is trivial. Read it, merge, and encode it back. It’s fast enough that there is really no other place for meaningful optimizations / complexity.

Integer compression

by Oren Eini

posted on: June 20, 2023

In the previous post, I discussed FastPFor encoding, now I’m going to focus on how we deal with decoding. Here is the decode struct: Note that this is a struct for performance reasons. We expect that we’ll have a lot of usage here, so saving the allocation here ends up paying high dividends.  Before we’ll start, I want to pay special attention to the fields on the decoder: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public unsafe struct FastPForDecoder { private byte* _input; private byte* _metadata; private readonly byte* _end; private uint* _exceptions; private fixed int _exceptionOffsets[32]; private Vector256<long> _prev; } view raw FastPForDecoder.fields.cs hosted with ❤ by GitHub Of particular interest is the _exceptionOffsets array. If you aren’t familiar with it, this is a fixed-size array on the struct. Here is the constructor for the decoder: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public FastPForDecoder(byte* input, int size) { if (size <= sizeof(PForHeader)) throw new ArgumentOutOfRangeException(nameof(size)); _input = input; _end = input + size; ref var header = ref Unsafe.AsRef<PForHeader>(input); _metadata = input + header.MetadataOffset; _exceptions = null; var exceptionsBufferSize = 1024; var exceptionBufferOffset = 0; _exceptions = (uint*)NativeMemory.Alloc((uint)(exceptionsBufferSize * sizeof(uint))); var exception = _input + header.ExceptionsOffset; for (int i = 2; i <= 32; i++) { if ((header.ExceptionsBitmap & (1 << (i - 1))) == 0) continue; var count = Unsafe.Read<ushort>(exception); exception += sizeof(ushort); if (count + exceptionBufferOffset > exceptionsBufferSize) { exceptionsBufferSize *= 2; _exceptions = (uint*)NativeMemory.Realloc(_exceptions, (uint)(exceptionsBufferSize * sizeof(uint))); } BitPacking.UnpackSegmented(exception, count, _exceptions + exceptionBufferOffset, (uint)i); _exceptionOffsets[i] = exceptionBufferOffset; exceptionBufferOffset += count; exception += BitPacking.RequireSizeSegmented(count, i); } _input += sizeof(PForHeader); _prev = Vector256.Create(header.Baseline); } view raw FastPForDecoder.ctor.cs hosted with ❤ by GitHub We are being provided with the encoded buffer and its size. What is actually happening here? We start by allocating memory for the exceptions buffer, then scan the exceptions bitmap and extract the exceptions data to the buffer. We use a single buffer for all of the exceptions, and the _exceptionsOffsets is an indication of where each bit width exception is currently at. Finally, we set the _prev to be a vector of the baseline. That is the counterpoint to how we dealt with the values during the encoding. Like the encoding process, we have to deal with three states: Big deltas (for the next block) Variable integer (at the end) Packed block of 256 numbers We’ll start with dealing with the first two options: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public int Read(long* output, int outputCount) { List<long> bigDeltaOffsets = null; var buffer = stackalloc uint[256]; int read = 0; while (_metadata < _end && read < outputCount) { Debug.Assert(read + 256 <= outputCount, "We assume a minimum of 256 free spaces"); var numOfBits = *_metadata++; switch (numOfBits) { case FastPForEncoder.BiggerThanMaxMarker: (bigDeltaOffsets ??= new List<long>()).Add((long)_metadata); // we don't need to worry about block fit, because we are ensured that we have at least // 256 items to read into the output here, and these marker are for the next blcok // batch location + 16 bytes high bits of the delta _metadata += 17; continue; case FastPForEncoder.VarIntBatchMarker: var countOfVarIntBatch = *_metadata++; var prevScalar = _prev.GetElement(3); for (int i = 0; i < countOfVarIntBatch; i++) { var cur = ReadVarInt(_input, out var offset); _input += offset; cur += prevScalar; output[read++] = cur; prevScalar = cur; } _prev = Vector256.Create(prevScalar); continue; case > 32 and < FastPForEncoder.BiggerThanMaxMarker: throw new ArgumentOutOfRangeException("Unknown bits amount: " + numOfBits); } view raw Read_Start.cs hosted with ❤ by GitHub If we have a big delta, we record that in the bigDeltaOffsets. Note that we are playing an annoying game here. I’m actually storing the metadata position in the delta offsets. I can cast it to long and then to pointer because I’m ensured that the data is fixed during the operation. For the varint scenario, the last item in the batch, we don’t have to do much, we compute the running total of the values as we read them from the input, that’s about it. Things get a lot more interesting when we start dealing with the actual compressed blocks: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters var numOfExceptions = *_metadata++; SimdBitPacking<NoTransform>.Unpack256(0, _input, buffer, numOfBits); _input += numOfBits * Vector256<byte>.Count; if (numOfExceptions > 0) { var maxNumOfBits = *_metadata++; var bitsDiff = maxNumOfBits - numOfBits; if (bitsDiff == 1) { var mask = 1u << numOfBits; for (int i = 0; i < numOfExceptions; i++) { var idx = *_metadata++; buffer[idx] |= mask; } } else { ref var offset = ref _exceptionOffsets[bitsDiff]; for (int i = 0; i < numOfExceptions; i++) { var remains = _exceptions[offset++]; var idx = *_metadata++; buffer[idx] |= remains << numOfBits; } } } var expectedBufferIndex = -1; if (bigDeltaOffsets != null && bigDeltaOffsets.Count > 0) { expectedBufferIndex = *(byte*)bigDeltaOffsets[0]; } view raw Read_Decode.cs hosted with ❤ by GitHub Here we simply decode the values to the buffer, then we need to apply the exceptions. You can see that we saved some space by not storing exceptions of 1 bit (since we know what the value would be). For exceptions of different sizes, see how we consume the exception from _exceptionOffsets for each used exception. I’m using a ref variable here, so the offset++ operation will increment the value in the array. That ensures that we have to keep very little state in the decoding process as it moves forward. Remember that we require that the output buffer for the decoded numbers be at least 256 values, to ensure that we can fit them all in. That doesn’t mean that we have enough space to fit everything. So we may be called multiple times and need to resume from where we left off. Finally, we set the expectedBufferIndex if we have a big delta offset. We’ll shortly see what we’ll do with this. Remember that at this point, the buffer we use has int32 deltas in it, not the actual final numbers. In order to get the final values, we need to compute the sum of the deltas, this is handled here: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters for (int i = 0; i + Vector256<uint>.Count <= 256; i += Vector256<uint>.Count) { var (a, b) = Vector256.Widen(Vector256.Load(buffer + i)); if (expectedBufferIndex == i) { a |= GetDeltaHighBits(); } PrefixSumAndStoreToOutput(a, ref _prev); if (expectedBufferIndex == i + 4) { b |= GetDeltaHighBits(); } PrefixSumAndStoreToOutput(b, ref _prev); } view raw Read_Sum.cs hosted with ❤ by GitHub What do we do here? We load a vector of 8 integers (32 bits) and widen that to two vectors of 4 integers (64 bits) each. We then check whether this is the expected buffer index for big delta offsets, if so, we’ll or the high parts of the number back to the vector. This is handled here: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters Vector256<ulong> GetDeltaHighBits() { var ptr = (byte*)bigDeltaOffsets![0]; bigDeltaOffsets.RemoveAt(0); var highBitsDelta = Vector128.Load(ptr) .AsInt32().ToVector256(); if (bigDeltaOffsets.Count > 0) { expectedBufferIndex = *(byte*)bigDeltaOffsets[0]; } else { expectedBufferIndex = -1; } // the last 4 elements are known zero, so no need to AND with zero highBitsDelta = Vector256.Shuffle(highBitsDelta, Vector256.Create(7, 0, 7, 1, 7, 2, 7, 3)); return highBitsDelta.AsUInt64(); } view raw GetDeltaHighBits.cs hosted with ❤ by GitHub There are quite a few things that are happening here all at once. We first read the current delta offset from the metadata, and read 16 bytes into a Vector128. We then translate that vector to a vector of 256 bits. That would basically add zeros to the vector. We set the next index to check and then we shuffle the vector on numbers to arrange it on the high parts of the vector as longs. Let’s say that we had the numbers [1,2,3,4] in the metadata, we read them into the Vector128, like so: Vector128 = [1,2,3,4] We then extended that to a Vector256, like so: Vector256 = [1,2,3,4,0,0,0,0] Finally, we shuffle the values, (in this case, 7 is known to be zero) so we have: highBitsDelta = [0,1,0,2,0,3,0,4] We convert that to a vector of longs (with the same bits pattern) and then or that with the values from the packed deltas. We have to do the expectedBufferIndex twice on each iteration, but we expect that to be rare, so the branch predictor is going to be really good in predicting this. Finally, we have the part where we compute the prefix sum, here: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters void PrefixSumAndStoreToOutput(Vector256<ulong> curUl, ref Vector256<long> prev) { var cur = curUl.AsInt64(); // doing prefix sum here: https://en.algorithmica.org/hpc/algorithms/prefix/ cur += Vector256.Shuffle(cur, Vector256.Create(0, 0, 1, 2)) & Vector256.Create(0, -1, -1, -1); cur += Vector256.Shuffle(cur, Vector256.Create(0, 0, 0, 1)) & Vector256.Create(0, 0, -1, -1); cur += prev; prev = Vector256.Shuffle(cur, Vector256.Create(3, 3, 3, 3)); cur.Store(output + read); read += Vector256<long>.Count; } view raw Read_PrefixSum.cs hosted with ❤ by GitHub This looks like black magic, but let’s break it apart into individual pieces.  Let’s assume that we started out with [25,27,30, 35] – using our approach, we have Baseline: 21 and the deltas: [4,2,3,5]. We start with prev being set to the baseline value on all elements in the vector: prev = [21,21,21,21] And cur is set to deltas: cur = [4,2,3,5] On line 5, we shuffle and BitwiseAnd the value with a mask, let’s see what we get: Line 5 – shuffled: [4,4,2,3] LIne 5- masked: [0,4,2,3] We add that to cur, giving us: cur = [4 + 0, 4 + 2, 3 + 2, 5 + 3] = [4,6,5,8] On line 7, we do another shuffle & mask: Line 7 – shuffled: [4,4,4,6] Line 7 – masked: [0,0,4,6] And adding that to cur, will give us: cur = [4+0, 6+ 0, 5 + 4, 8+ 6] = [4,6,9,14] We now add the prev vector, giving us: cur = [4 + 21, 6 + 21, 9 + 21, 14 + 21] = [25, 27, 30, 35] We computed the sum again, hurray! We then set all elements of prev to the last value of cur: prev = [35,35,35,35] And now we are ready for the next element to compute. And… this is pretty much it, to be honest. We keep reading until we are out of space in the buffer or consumed all the metadata. We are able to decode the data really fast, and it only took a 10 parts (so far) series of blog posts to explain. In the next post, I’m going to discuss what we can see from actually using this with real numbers.

Validating nested DataAnnotation IOptions recursively with MiniValidation

by Andrew Lock

posted on: June 20, 2023

In this post…

Integer compression

by Oren Eini

posted on: June 19, 2023

In the previous post I outlined the requirements we have for FastPFor in RavenDB. Now I want to dig into the actual implementation. Here is the shape of the class in question: The process starts when we start encoding the items. Each call to encode is going to process a different posting list, and it is expected that a single instance will process many posting lists during a single indexing run. Here is the internal state of this class: Here is how we start the process: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public int Encode(long* entries, int count) { AssertIsSorted(entries, count); _entries = entries; _count = count; _offset = 0; _metadataPos = 0; for (int k = 0; k < _exceptions.Length; k++) { _exceptions[k]?.Clear(); _exceptionsStart[k] = 0; } _metadata.Clear(); if (_entriesOutputSize < count) { var newCount = Math.Max(256, BitOperations.RoundUpToPowerOf2((uint)count)); _entriesOutput = (byte*)NativeMemory.Realloc(_entriesOutput, newCount * sizeof(long)); _entriesOutputSize = newCount; } view raw Encode_Init.cs hosted with ❤ by GitHub As you can see, there isn’t much here. We initialize the state of exceptions and exceptionsStarts (will discuss this in more detail later in this post) and then we ensure that we have a big enough buffer to encode the data. Note that we retain a point to the original entries that we were passed, that will be important later as well. Here is the code of the encoding process: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters var totalSize = sizeof(PForHeader); int i = 0; var entriesAsInt = (uint*)_entriesOutput; var entriesIn = _entries; var prev = Vector256.Create(*entriesIn); var max = Vector256.Create<long>(uint.MaxValue); for (; i + 256 <= _count; i += 256) { var blockStart = entriesAsInt; int j = 0; for (; j < 256; j += Vector256<long>.Count) { var cur = Vector256.Load(entriesIn + i + j); var mixed = Vector256.Shuffle(cur, Vector256.Create(0, 0, 1, 2)) & Vector256.Create(0, -1, -1, -1) | Vector256.Shuffle(prev, Vector256.Create(3, 3, 3, 3)) & Vector256.Create(-1, 0, 0, 0); prev = cur; var delta = cur - mixed; if (Vector256.GreaterThanAny(delta, max)) { HandleDeltaGreaterThanMax(j, delta); } var deltaInts = Vector256.Shuffle(delta.AsUInt32(), Vector256.Create(0u,2,4,6,0,0,0,0)); deltaInts.Store(entriesAsInt); // we write 8 values, but increment by 4, so we'll overwrite it next op entriesAsInt += Vector256<long>.Count; } totalSize += ProcessBlock(blockStart); } view raw Encode_Core.cs hosted with ❤ by GitHub I create a couple of vector variables, one to hold the first item in the list and the other to hold uint.MaxValue. Then I start processing the buffers in blocks of 256 items. A good resource for using vectors and SIMD in C# can be found here. The classic FastPFor is using vectors of 128 bits (although there are implementations of up to 512 bits). I chose to use Vector256 for the implementation. Today, pretty much any x64 machine is going to have native support for AVX2 (shipped since 2011!). On ARM, they are implemented using Vector128, so they are still vectorized there as well. What this means is that we can process 256 numbers as a block. But what are we doing with them? Look at the inner look, where you can see some interesting details. We are using shuffle instructions as well as masking to shift the current value and add the previous one. In other words, let’s assume that we have the following input array: 1,2,3,4,5,6,7,8 – using longs, so Vector256 will contain 4 elements. On line 5, we create the prev vector and assign it to have 4 copies of the first item in the array: prev = [1,1,1,1] On line 13, we load the 4 elements from the array into cur, giving us: cur = [1,2,3,4] On line 14, we shuffle both the prev and the cur values, here are intermediate states: curShuffled = [0,1,2,3] – note that we have 0 at the start because of the mask prevShuffled = [1,0,0,0] – the last three items are zero because of the mask. We then or those two intermediates together, giving us: mixed = [1,1,2,3] We subtract that from cur to get the delta. The idea is that we massively reduce the number of operations that we have to perform to compute the delta of the values. On lines 19..22 we check if any of the deltas is greater than uint.MaxValue, we’ll dig into that in a bit. The last part of the loop is when we deal with detlaInts, where we are doing some interesting things. We are pulling the low bits from all four elements in the vector to the start of the vector. Then we write the deltas to the output entries buffer. However, we write 8 values, but we increment the position by 4, so we’ll actually overwrite the last 4 items in the next loop iteration. The entries buffer is sized based on the entries buffer, so we are ensured to have enough space, and the double writes give us better performance. The end result of the inner loop is that we are performing running delta over the items, converting them from longs to ints. Then we can call the ProcessBlock() method to actually do the code of the FastPFor algorithm: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters private int ProcessBlock(uint* currentEntries) { var (bestB, maxB, exceptionCount) = FindBestBitWidths(currentEntries); _metadata.Add((byte)bestB); _metadata.Add((byte)exceptionCount); if (exceptionCount > 0) { _metadata.Add((byte)maxB); uint maxVal = 1u << bestB; for (int j = 0; j < 256; j++) { if (currentEntries[j] >= maxVal) { var exList = _exceptions[maxB - bestB] ??= new(); exList.Add(currentEntries[j] >>> bestB); _metadata.Add((byte)j); } } } return bestB * Vector256<byte>.Count; } view raw ProcessBlock.cs hosted with ❤ by GitHub I already explained how this works in this post. It is important to understand that we have multiple data channels that we operate on here: The entries output, where we write the packed bits Metadata, where we describe each block Exceptions, where we store the exception data The encoding process runs through all of them, but isn’t actually going to be writing them anywhere, that happens later. We process the entries in the array in blocks of 256 numbers each, but what happens when we get to the end? The last block size may not be a multiple of 256, after all. For the last item, we use simple delta encoding plus variable-sized integers. Nothing really fancy is needed at this point, since that is very small. Finally, we run over all the exceptions data that we created during the encoding process and compute their size. The final result we have is the total size we would need to store all the encoded data as a single buffer. What about handling numbers whose delta is greater than uint.MaxValue? This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters void HandleDeltaGreaterThanMax(int batchOffset, Vector256<long> delta) { // we expect this to be *rare*, so not trying to optimize this in any special way Span<uint> deltaHighBuffer = stackalloc uint[8]; var highAndLow = Vector256.Shuffle(delta.AsUInt32(), Vector256.Create(1u, 3, 5, 7, 0, 2, 4, 6)); highAndLow.StoreUnsafe(ref deltaHighBuffer[0]); _metadata.Add(BiggerThanMaxMarker); _metadata.Add((byte)batchOffset); Span<byte> asBytes = MemoryMarshal.AsBytes(deltaHighBuffer); for (int j = 0; j < 16; j++) { _metadata.Add(asBytes[j]); } } view raw HandleDeltaGreaterThanMax.cs hosted with ❤ by GitHub You can see that we expect this to be rare, so we handle that to be on the safe side, but aren’t trying to be too optimal about it. We simply take the high portion of the current delta and write it to the metadata. That is wasteful in terms of space, but it is simple and doesn’t require complex code. We’ll see how that is being used during the decoding process in the next post. Finally, it is now time to actually write things out to the buffer. We have to be careful here, since we need to support partial writes. Let’s see how that works: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public (int Count, int SizeUsed) Write(byte* output, int outputSize) { Debug.Assert(outputSize <= ushort.MaxValue, "Output buffer too large, we use ushort for offsets and don't want to overflow"); var sizeUsed = sizeof(PForHeader); if (sizeUsed > outputSize) return default; ref var header = ref Unsafe.AsRef<PForHeader>(output); var baseline = _entries[Math.Max(0, _offset - 1)]; header.Baseline = baseline; var exceptionsCounts = stackalloc int[33]; var startingMetadataPosition = _metadataPos; var exceptionsRequiredSize = 0; var entriesOutput = (uint*)_entriesOutput; var oldOffset = _offset; view raw Write_Init.cs hosted with ❤ by GitHub We define a header for the encoded data, which looks like this: The header contains the initial value of the encoded data, the exceptions bitmap (discussed here) and the exception and metadata offsets. Because we always limit the size of the buffer to a maximum of 8KB, I used ushort here. The first value we set in the header is the baseline, here we need to explain a bit. When we encode the data, we start from the very first item in the list, and go forward. The problem with doing things in this manner is that when we need to split the encoded data, we need to use the baseline that is the previous value that we encoded with. For that reason, we use either the first item in the entries array or the item from the end of the previous block that we consumed. Next, we need to deal with the exceptions, as a reminder, we have a set of exceptions that are shared across blocks. When we do a partial write, we need to keep track of how many exceptions we consumed from each of the exceptions array. The most interesting part comes when we start iterating over the metadata values, which tells us how to deal with the data: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters while (_metadataPos < _metadata.Count) { var batchMetadataStart = _metadataPos; var numOfBits = _metadata[_metadataPos++]; switch (numOfBits) { case BiggerThanMaxMarker: // nothing to do, doesn't impact writes _metadataPos += 17; // batch offset + 16 bytes continue; case VarIntBatchMarker: var sizeOfVarIntBatch = _metadata[_metadataPos++]; var varintSize = WriteLastBatchAsVarIntDelta(sizeOfVarIntBatch, output + sizeUsed, outputSize - sizeUsed); var expectedMetadataSize = _metadataPos - startingMetadataPosition; if (varintSize == 0 || // couldn't fit the var int buffer sizeUsed + varintSize + exceptionsRequiredSize + expectedMetadataSize > outputSize) // wouldn't be able to fit the exceptions & metadata { _metadataPos = batchMetadataStart; goto AfterLoop; } _offset += sizeOfVarIntBatch; sizeUsed += varintSize; continue; case > 32 and < BiggerThanMaxMarker: throw new ArgumentOutOfRangeException("Invalid bits value: " + numOfBits); } view raw Write_Loop.cs hosted with ❤ by GitHub Here we start with the exceptional cases, of a metadata value that is greater than the max delta or the varint batch marker (which occurs in the end). You can see that there isn’t much there, except that we verify that we have enough space for the written buffers. For fun, I’m also using a goto here, this is needed to break out of the loop, since I cannot use a break statement in a switch case. The more interesting aspect happens when we process full blocks of compressed 256 numbers: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters var numOfExceptions = _metadata[_metadataPos++]; var reqSize = numOfBits * Vector256<byte>.Count; ref var exceptionCountRef = ref exceptionsCounts[0]; var amountToAddToException = 0; if (numOfExceptions > 0) { var maxNumOfBits = _metadata[_metadataPos++]; var exceptionIndex = maxNumOfBits - numOfBits; if (exceptionIndex > 1) { var oldCount = exceptionsCounts[exceptionIndex]; amountToAddToException = numOfExceptions; exceptionCountRef = ref exceptionsCounts[exceptionIndex]; if (oldCount == 0) { exceptionsRequiredSize += sizeof(ushort); // size for the number of items here } exceptionsRequiredSize -= BitPacking.RequireSizeSegmented(oldCount, exceptionIndex); exceptionsRequiredSize += BitPacking.RequireSizeSegmented(numOfExceptions + oldCount, exceptionIndex); } } _metadataPos += numOfExceptions; var metaSize = _metadataPos - startingMetadataPosition; var finalSize = (sizeUsed + reqSize + exceptionsRequiredSize + metaSize); if (finalSize > outputSize) { _metadataPos = batchMetadataStart; break; } SimdBitPacking<MaskEntries>.Pack256(0, entriesOutput + _offset, output + sizeUsed, numOfBits, new MaskEntries((1u << numOfBits) - 1)); sizeUsed += reqSize; _offset += 256; exceptionCountRef += amountToAddToException; view raw Write_Block.cs hosted with ❤ by GitHub This looks like a lot of code, but what it is actually doing is to mostly checking that we have enough space in the buffer for the block we are about to write. Of particular interest to us here is that we may have enough space to write the block, but we also need to ensure that we have space for the metadata and exceptions data that is associated with the current and all previous blocks. If we have enough space, we continue to write more blocks to the output, or we exit if we are about to run out of room. After the loop, we have the header and compressed integers data in the output buffer, but we still need to write the metadata and exceptions data, this is handled here: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters AfterLoop: uint bitmap = 0; header.ExceptionsOffset = checked((ushort)sizeUsed); for (int numOfBits = 2; numOfBits <= 32; numOfBits++) { var count = exceptionsCounts[numOfBits]; if (count == 0) continue; bitmap |= 1u << numOfBits - 1; Unsafe.Write(output + sizeUsed, (ushort)count); sizeUsed += sizeof(ushort); var span = CollectionsMarshal.AsSpan(_exceptions[numOfBits]); var exceptionStart = _exceptionsStart[numOfBits]; span = span[exceptionStart..(exceptionStart + count)]; fixed (uint* b = span) { int size = BitPacking.PackSegmented(b, span.Length, output + sizeUsed, (uint)numOfBits); sizeUsed += size; } _exceptionsStart[numOfBits] += count; } header.ExceptionsBitmap = bitmap; header.MetadataOffset = checked((ushort)sizeUsed); var metadataSize = (_metadataPos - startingMetadataPosition); var metadataSpan = CollectionsMarshal.AsSpan(_metadata); var metadataBlockRange = metadataSpan[startingMetadataPosition..(startingMetadataPosition + metadataSize)]; metadataBlockRange.CopyTo(new Span<byte>(output + sizeUsed, metadataSize)); sizeUsed += metadataSize; return (_offset - oldOffset, sizeUsed); view raw Write_AfterLoop.cs hosted with ❤ by GitHub We compute the bitmap of all the exceptions, writing them out as packed segments (basically, 256 blocks of packed bits). Note that we have to account for how many of them we actually write out to the output, since we may have stopped in the middle. Then we can simply write out the metadata buffer and return the total size we took. I find it very amusing that we use the same bit packing routines to store the numbers and the exceptions data as well. Overall, most of the code isn’t particularly interesting, but there are a few highlights that I would like to point out: We compute the deltas and the narrowing from int64 to int32 on the buffer during the Encode() We do the actual bit packing during the Write() There is some argument to make here that we can do the bit packing during the Encode() as well, but it is unlikely to matter, I believe. In the next post, I’m going to be diving into how we implement decoding for FastPFor. That is actually of far more interest to me, since that happens a lot more often. It’s going to be interesting…

How to use RuntimeHelpers.IsReferenceOrContainsReferences to micro-optimize collections

by Gérald Barré

posted on: June 19, 2023

RuntimeHelpers.IsReferenceOrContainsReferences<T>() is a method that indicates if T is a reference type or contains a reference type. One use case for this method is to micro-optimize collections when removing items. Indeed, if T is a reference type, you need to remove the reference to the it

Integer compression

by Oren Eini

posted on: June 16, 2023

In this series so far, I explored several ways that we can implement integer compression. I focused on the FastPFor algorithm and dove deeply into how it works. In the last post, I showed how we can use the SIMD intrinsics to generate vectorized code that can process a lot of data very quickly. With this baseline established, we need to understand the use case we have in RavenDB for integer compression. We are actually using this for the classic scenario, posting lists as part of the indexing engine. However, there are quite a few differences in how we want to actually utilize it. Let’s take Lucene as a good example, since that is one of the classic examples for a search engine. From the ground up, Lucene is meant to be used in batching scenarios. That has a lot of impact on the design. In particular, it means that a posting list for Lucene is something that you write once in a big batch of operations. For RavenDB, however, we explicitly reject his approach. We want to allow to do on-the-fly indexing and we cannot afford to pay the cost of merging many small writes to big files over time. The FastPFor algorithm is meant to get a lot of data, encode that in a compressed form and then allow you to iterate over the results as fast as possible. But RavenDB’s needs are different. At the most basic level, RavenDB thinks about things in 8KB pages. A posting list for RavenDB is composed of the following: Single – just a single value (common when indexing unique values) Small – can fit in under 4KB Large – requires more than 4KB of data, in which case we split it on 8KB boundaries. What’s the point here? The idea is that when we need to update a posting list (add or remove items), we have one of those three options: Single – we just update the value, duh! Small – FastPFor is fast, right? It is cheap enough to unpack the compressed values, merge with the new items (or the removals) and compress again. Large – here we are talking about a lot of values, potentially tens of millions or higher. How do we handle updates in this case? The answer is that we are going to be making use of a B+ Tree, with some caveats. The basic idea is that we have two types of pages in the tree. A branch page is the classic B+ Tree concept, which points to leaf pages below it that contain some range of values. However, the leaf pages, on the other hand, are very different. Each leaf page is just an 8KB buffer that holds the compressed data, nothing more. When we want to update the values in the posting list, we find the relevant page that needs to be updated, then we just unpack, merge & compress the values. It’s actually cheaper to go this route than any other alternative. There is another important aspect. Lucene (and most other indexing systems) assume that your posting list is using 32 bits integers. That limits the number of items in an index to 2.1 billion (up to 4.2 billion, for some implementations). I don’t like that limit, not at today’s scale. So we internally use 64bits integers in our posting lists. That… however, is actually too big. In practice, we use 54 bits for the indexed entries. That gives us a maximum number of 18,014,398,509,481,984 items and at a range of 100,000 items per second, we are good for 5,712 years.  Those 10 bits allow us to encode additional information about the actual entry that is being indexed, and reduce the number of external lookups we need to do. Taking 10 bits from the entry id means that we are going to hit the 32 bits limit at around 2 million entries, which is really low. Then there is another problem, if I have a posting list I want to compress that has 100,000 items in it, I would need to compress that into multiple pages, but figuring out how to split the compressed stream in a way that makes it usable to work with is not trivial at all. In short, we have two major concerns with using FastPFor in RavenDB, int64 and paged writes. There is another aspect that I didn’t mention, however. FastPFor will take an array of numbers and compress them. If you want to also use delta compression, that is something that you need to add on top of that. Now that we understand the constraints we have to operate around, let’s talk about the kind of API that I would like to build: What is going on here?  Well, the basic idea is that you’ll create an encoder as part of the indexing process. For each posting list that you want to process, you’ll call the Encode() method, passing it the data to be encoded. The Encode() method will return the required size for encoding the entire list as a single operation. Note that it does not actually write the data out (indeed, it has nowhere to write it). The calling code is then able to make decisions based on the returned size. If they have a buffer big enough to hold the encoded data, they can pass it to the Write() method. Things get more interesting when we don’t have a suitable buffer. Remember that we have a limit of 4KB for small posting lists and 8KB per page for large posting lists. The idea with Write() is that it will write as much as it can to the output, allowing you to use multiple buffers to store the compressed posting list. In terms of work being done, the Encode() phase copies the post list entries to its own buffer, preparing it to be written out. The process of actually writing the data to the output buffer is mostly concerned with ensuring that we don’t exceed the size of the provided buffer. This approach means that we only have to pay the encoding code once, which is quite nice. Moreover, the number of checks that we have to make to ensure that we can fit the data into the buffer is also greatly reduced. What about decoding? That is handled using: Here we have a very interesting observation, the Encoder is a class, but the Decoder is a struct. We assume that decoding is far more frequent, so we use a struct to avoid an unnecessary memory allocation. In order to ensure high performance, the decoder makes assumptions about the provided buffers. We will always accept a buffer that has a minimum size, for example. With all of this out of the way, I’m going to dedicate the next post to discussing the actual implementation and semantics we have for FastPFor encoding.

Integer compression

by Oren Eini

posted on: June 15, 2023

In the code of the simdcomp library there is a 25KLOC file that does evil things to SIMD registers to get bit packing to work. When I looked at the code the first few dozen times, I had a strong desire to run away screaming. Luckily, this isn’t just some pile of complicated code, but well-thought-out set of functions that are meant to provide optimal code for specific tasks. In particular, the code is specialized for each bit width that you may want to bit pack (0 .. 32). Even better, no one actually sat down to write it out by hand, there is a Python script that would generate the code. The first step was to understand what exactly is going on in the code, and then see how we can translate that to C#. Even just a few years ago, that would have been either an impossible dream or required the use of a native library (with the associated PInvoke overhead). However, .NET today has a very robust intrinsics support, and vectors / SIMD instructions are natively supported. I actually had a tougher challenge, since I want this code to run on x64 and ARM64 instances. The original code is x64 only, of course. The nice thing about SIMD support for .NET is that most of that can be done in a cross platform manner, with the JIT deciding what instructions will actually be emitted. There is still a very strong correlation between the vectorized code and the instructions that are being emitted, which means that I get both great control of what is going on and the appropriate abstraction. I was actually able to implement the whole thing without dropping to architecture-specific code, which makes me very happy. Before we get any deeper, let’s take a simple challenge. We want to take an array of 32 integers and pack each one of them in 4 bits into an array of 16 bytes. Here is what the code will look like: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public void Pack32IntsWith4Bits(int* items, byte* output) { var v = (Vector128<int>*)items; var mask = Vector128.Create(0x0f); var mixed = v[0] & mask | Vector128.ShiftLeft(v[1], 4) & mask | Vector128.ShiftLeft(v[2], 8) & mask | Vector128.ShiftLeft(v[3], 12) & mask | Vector128.ShiftLeft(v[4], 16) & mask | Vector128.ShiftLeft(v[5], 20) & mask | Vector128.ShiftLeft(v[6], 24) & mask | Vector128.ShiftLeft(v[7], 28) & mask ; mixed.AsByte().Store(output); } view raw pack.cs hosted with ❤ by GitHub This is a bit dense, but let’s see if I can explain what is going on here. We load from the array a vector (4 items) at the 0, 4, 8, 12, 16, 20, 24, and 28 intervals. For each one of those, we shift the values by the required offset and or all of them together. Note that this means that the first item’s four bits go in the first nibble, but the second item’s bits go in the fifth nibble, etc. The idea is that we are operating on 4 items at a time, reducing the total number of operations we have to perform. It may be easier to understand if you see those changes visually: What is happening here, however, is that we are able to do this transformation in very compact code. That isn’t just an issue of high-level code, let’s take a look at the assembly instructions that this generates: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters Pack32IntsWith4Bits(Int32*, Byte*) L0000: vzeroupper L0003: vmovupd xmm0, [rdx] L0007: vpand xmm0, xmm0, [0x1ed3a030110] L000f: vmovupd xmm1, [rdx+0x10] L0014: vpslld xmm1, xmm1, 4 L0019: vpand xmm1, xmm1, [0x1ed3a030110] L0021: vpor xmm0, xmm0, xmm1 L0025: vmovupd xmm1, [rdx+0x20] L002a: vpslld xmm1, xmm1, 8 L002f: vpand xmm1, xmm1, [0x1ed3a030110] L0037: vpor xmm0, xmm0, xmm1 L003b: vmovupd xmm1, [rdx+0x30] L0040: vpslld xmm1, xmm1, 0xc L0045: vpand xmm1, xmm1, [0x1ed3a030110] L004d: vpor xmm0, xmm0, xmm1 L0051: vmovupd xmm1, [rdx+0x40] L0056: vpslld xmm1, xmm1, 0x10 L005b: vpand xmm1, xmm1, [0x1ed3a030110] L0063: vpor xmm0, xmm0, xmm1 L0067: vmovupd xmm1, [rdx+0x50] L006c: vpslld xmm1, xmm1, 0x14 L0071: vpand xmm1, xmm1, [0x1ed3a030110] L0079: vpor xmm0, xmm0, xmm1 L007d: vmovupd xmm1, [rdx+0x60] L0082: vpslld xmm1, xmm1, 0x18 L0087: vpand xmm1, xmm1, [0x1ed3a030110] L008f: vpor xmm0, xmm0, xmm1 L0093: vmovupd xmm1, [rdx+0x70] L0098: vpslld xmm1, xmm1, 0x1c L009d: vpand xmm1, xmm1, [0x1ed3a030110] L00a5: vpor xmm0, xmm0, xmm1 L00a9: vmovdqu [r8], xmm0 L00ae: ret view raw pack.asm hosted with ❤ by GitHub I’m going to assume that you aren’t well versed with assembly, so let’s explain what is going on. This code contains zero branches, it does four reads from memory, mask the elements, shift them and or them together. The relevant instructions are: vmovupd – read 4 integers to the register vpand – binary and with a value (masking) vpslld – shift to the left vpor – binary or vmovdqu – write 16 bytes to memory There are no branches, nothing to complicate the code at all. This is about as tight as you can get, at the level of machine instructions. Of particular interest here is the C# code. The entire thing is basically a couple of lines of code, and I could express the whole thing as a single expression in a readable manner. Let’s look at the C code, to compare what is going on: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters static void pack32_4bits(const uint32_t *_in, __m128i *out) { const __m128i *in = (const __m128i *)(_in); __m128i OutReg, InReg; const __m128i mask = _mm_set1_epi32((1U << 4) - 1); uint32_t outer; InReg = _mm_and_si128(_mm_loadu_si128(in), mask); OutReg = InReg; InReg = _mm_and_si128(_mm_loadu_si128(in + 1), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 4)); InReg = _mm_and_si128(_mm_loadu_si128(in + 2), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 8)); InReg = _mm_and_si128(_mm_loadu_si128(in + 3), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 12)); InReg = _mm_and_si128(_mm_loadu_si128(in + 4), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 16)); InReg = _mm_and_si128(_mm_loadu_si128(in + 5), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 20)); InReg = _mm_and_si128(_mm_loadu_si128(in + 6), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 24)); InReg = _mm_and_si128(_mm_loadu_si128(in + 7), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 28)); _mm_storeu_si128(out, OutReg); } view raw pack.c hosted with ❤ by GitHub Note that both should generate the same machine code, but being able to rely on operator overloading means that I can now get a far more readable code. From that point, the remaining task was to re-write the Python script so it would generate C# code, not C. In the next post I’m going to be talking more about the constraints we have and what we are actually trying to solve with this approach.

Integer compression

by Oren Eini

posted on: June 14, 2023

As I mentioned, I spent quite a lot of time trying to understand the mechanism behind how the FastPFor algorithm works. A large part of that was the fact that I didn’t initially distinguish the separate steps in the process. Finding the bits' widths, packing the bits, metadata and exception management are all integral to how this works, and I tried to grok it all in one shot. I kept getting lost in 14KLOC files with dense SIMD instructions. When looking at the code, it is also important to keep in mind that this is not a library that you are meant to pick up & use. This is a research project. As such, there was enough work done to show the results, but all the spit & polish that are associated with making a library ready for general consumption weren’t done. A simple example of that is that the code just assumes that the output buffer provided will be large enough (with no way to estimate upfront) and will overflow the buffer if that happens. Perfectly acceptable for research code, hell no for production usage, naturally. In the same manner, the repository contains a lot of code. Most of that is being used as part of comparing the FastPFor algorithm to various other options. That can make it hard to understand what are the core parts of the algorithm and what is effectively chuff. To be honest, the hardest part for me, however, was figuring out the memory allocation patterns that are going on in here. There is a high usage of C++ containers, with implicit allocations and hidden memory management. Here is a piece of code that I had to struggle to understand for quite some time, for example: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters template <class STLContainer> static uint32_t *packmeupwithoutmasksimd(STLContainer &source, uint32_t *out, const uint32_t bit) { const uint32_t size = static_cast<uint32_t>(source.size()); *out = size; out++; if (source.size() == 0) return out; source.resize((source.size() + 32 - 1) / 32 * 32); uint32_t j = 0; for (; j + 128 <= size; j += 128) { usimdpackwithoutmask(&source[j], reinterpret_cast<__m128i *>(out), bit); out += 4 * bit; } for (; j < size; j += 32) { fastpackwithoutmask(&source[j], out, bit); out += bit; } out -= (j - size) * bit / 32; source.resize(size); return out; } view raw packmeupwithoutmasksimd.cpp hosted with ❤ by GitHub This does bit-packing using SIMD without a mask (the naming convention is something that you really have to get used to, I don’t think I encountered justlowercase before). Note the call to fastpackwithoutmask() in there, it assumes that the provided buffer has exactly 32 elements. A lot of the code in the project assumes that you are passed a buffer of a certain size and operate on that directly. That produces great results in terms of efficiency and code density. But when I read this code it was “obvious” to me that there is an out-of-band work here. If the provided buffer isn’t perfectly aligned on 32 boundary, that method will write read or write beyond the end of the buffer. Look at line 9 there, where the source container is being resized to ensure that this is fine, since we increase the container size to be exactly 32 elements aligned. We revert this operation later in line 20. For some SIMD instructions, it matters the alignment of the buffers we are working on. This is handled using a custom allocator on the containers, which is not usually something that I would pay attention to. In short, from my perspective, there is a lot of hidden behavior in non-trivial C++ code usage there that masks the actual behavior of the algorithm in question. If you are looking at FastPFor (and if you care enough to look at this series of posts, you probably should), take into consideration that this code shouldn’t be applied as is, but should probably be adapted for your needs. In the next post, I’m going to start discussing how we did just that for RavenDB.