skip to content
Relatively General .NET

Production postmortem

by Oren Eini

posted on: July 03, 2023

We got a support call from a client, in the early hours of the morning, they were getting out-of-memory errors from their database and were understandably perturbed by that. They are running on a cloud system, so the first inclination of the admin when seeing the problem was deploying the server on a bigger instance, to at least get things running while they investigate. Doubling and then quadrupling the amount of memory that the system has had no impact. A few minutes after the system booted, it would raise an error about running out of memory. Except that it wasn’t actually running out of memory. A scenario like that, when we give more memory to the system and still have out-of-memory errors can indicate a leak or unbounded process of some kind. That wasn’t the case here. In all system configurations (including the original one), there was plenty of additional memory in the system. Something else was going on. When our support engineer looked at the actual details of the problem, it was quite puzzling. It looked something like this: System.OutOfMemoryException: ENOMEM on Failed to munmap at Sparrow.Server.Platform.Posix.Syscall.munmap(IntPtr start, UIntPtr length); That error made absolutely no sense, as you can imagine. We are trying to release memory, not allocate it. Common sense says that you can’t really fail when you are freeing memory. After all, how can you run out of memory? I’m trying to give you some, damn it! It turns out that this model is too simplistic. You can actually run out of memory when trying to release it. The issue is that it isn’t you that is running out of memory, but the kernel. Here we are talking specifically about the Linux kernel, and how it works. Obviously a very important aspect of the job of the kernel is managing the system memory, and to do that, the kernel itself needs memory. For managing the system memory, the kernel uses something called VMA (virtual memory area). Each VMA has its own permissions and attributes. In general, you never need to be aware of this detail. However, there are certain pathological cases, where you need to set up different permissions and behaviors on a lot of memory areas. In the case we ran into, RavenDB was using an encrypted database. When running on an encrypted database, RavenDB ensures that all plain text data is written to memory that is locked (cannot be stored on disk / swapped out). A side effect of that is that this means that for every piece of memory that we lock, the kernel needs to create its own VMA. Since each of them is operated on independently of the others. The kernel is using VMAs to manage its own map of the memory. and eventually, the number of the items in the map exceeds the configured value. In this case, the munmap call released a portion of the memory back, which means that the kernel needs to split the VMA to separate pieces. But the number of items is limited, this is controlled by the vm.max_map_count value. The default is typically 65530, but database systems often require a lot more of those. The default value is conservative, mind. Adjusting the configuration would alleviate this problem, since that will give us sufficient space to operate normally.

How to cancel GitHub workflows when pushing new commits on a branch

by Gérald Barré

posted on: July 03, 2023

When you push multiple commits to a branch, you may want to cancel in-progress workflows and run only the latest workflow. This can be useful to reduce costs if you use paid runners. On free runners, GitHub limits the number of concurrent jobs. So, it can also release resources for other workflows.

RavenDB Docker image updates for the v6.0 release

by Oren Eini

posted on: June 30, 2023

We are going to be making some changes to our RavenDB 6.0 docker image, you can already access them using the nightly builds: docker pull ravendb/ravendb-nightly:6.0-ubuntu-latest The purpose of those changes is to make it easier to run RavenDB in your environment and to make it a more natural fit to a Linux system.  The primary reason we made this change is that we wanted to enable running RavenDB containers as non-root users. Most of the other changes are internal, primarily about file paths and how we are actually installing RavenDB on the container instance. We now share a single installation process across all Linux systems, which makes is easier to support and manage. This does have an impact on users' migration from RavenDB 5.4 docker images, but the initialization process should migrate them seamlessly. Note that if you have an existing 5.4 docker instance and you want to update that to 6.0 and run as non-root, you may need to explicitly grant permissions to the RavenDB data folder for the RavenDB user (uid: 999). As usual, I would like to invite you to take the system for a spin. We would really love your feedback.

How to access the Roslyn compilation from a VS extension

by Gérald Barré

posted on: June 29, 2023

The following code is tested on Visual Studio 2022 17.7 and uses the Community toolkit for Visual Studio extensions. Don't forget to install the item templates to create a new extension project.After creating a new extension, you need to reference the Microsoft.VisualStudio.LanguageServices package

Generating sequential numbers in a distributed manner

by Oren Eini

posted on: June 26, 2023

On its face, we have a simple requirement: Generate sequential numbers Ensure that there can be no gaps Do that in a distributed manner Generating the next number in the sequence is literally as simple as ++, so surely that is a trivial task, right? The problem is with the second requirement. The need to ensure that there are no gaps comes often when dealing with things like invoices. The tax authorities are really keen on “show me all your invoices”, and if there are gaps in the numbers, you may have to provide Good Answers. You may think that the third one, running in a distributed environment, is the tough challenge, but that isn’t actually the case. If we are running in a single location, that is fairly easy. Run the invoice id generation as a transaction, and you are done. But the normal methods of doing that are usually wrong in edge cases. Let’s assume that we use an Oracle database, which uses the following mechanism to generate the new invoice id: invoice_seq.NEXTVAL Or SQL Server with an identity column: CREATE TABLE invoices ( invoice_id INT IDENTITY(1,1) PRIMARY KEY, ... ); In both cases, we may insert a new value to the invoices table, consuming an invoice id. At some later point in time, we may roll back the transaction. Care to guess what happens then? You have INVOICE #1000 and INVOICE #1002, but nothing in between. In fact, no way to even tell what happened, usually. In other words, identity, sequence, serial, or autonumber – regardless of what database platform you use, are not suitable for generating gapless numbers. The reasoning is quite simple. Assume that you have two concurrent transactions, which generate two new invoices at roughly the same time. You commit the later one before the first one, and roll back the first. You now have: Invoice #1 Invoice #2 … Invoice #1000 Invoice #1002 However, you don’t have Invoice #1001, and you cannot roll back the sequence value there, because if you do so, it will re-generate the #1002 on the next call. Instead, for gapless numbers, we need to create this as a dedicated part of the transaction. So there would be a record in our system that contains the NextInvoiceId, which will be incremented as part of the new invoice creation. In order to ensure that there are no gaps, you need to ensure that the NextInvoideId record increment is handled as a user operation, not a database operation. In other words, in SQL Server, that is a row in a table, that you manually increment as part of adding a new invoice. Here is what this will look like: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters CREATE PROCEDURE CreateNewInvoice @customer_email VARCHAR(50), -- Other invoice parameters... AS BEGIN DECLARE @next_id INT; UPDATE next_gapless_ids WHERE owner = 'invoices' SET @next_id = invoice_id = invoice_id + 1; -- Insert the new invoice with the generated ID INSERT INTO invoices (invoice_id, customer_email, ...) VALUES (@next_id, @customer_email, ...); END view raw new_invoice.sql hosted with ❤ by GitHub As you can see, we increment the row directly. So it will be rolledback as well. The downside here is that we can no longer create two invoices concurrently. The second transaction would have to wait on the lock on the row in the next_gapless_ids table. All of that happens inside a single database server. What happens when we are running in a distributed environment? The answer in this case, is the exact same thing. You need a transaction, a distributed one, using a consensus algorithm. Here is how you can achieve this using RavenDB’s cluster wide transactions, which use the Raft protocol behind the scenes: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters while (true) { using (var session = store.OpenSession(new SessionOptions { TransactionMode = TransactionMode.ClusterWide })) { var gaplessId = session.Load<GapLessId>("gapless/invoices"); var nextId = gaplessId.Value++; var invoice = new Invoice { InvoiceId = nextId, // other properties }; session.Store(invoice, "invoices/" + nextId); try { session.SaveChanges(); break; } catch (ConcurrencyException) { continue; // re-try } } } view raw GaplessRavenDB.cs hosted with ❤ by GitHub The idea is simple, we have a transaction that modifies the gapless ids document and creates a new invoice at the same time. We have to handle a concurrency exception if two transactions try to create a new invoice at the same time (because they both want to use the same invoice id value), but in essence this is pretty much exactly the same behavior as when you are running on a single node. In other words, to ensure the right behavior, you need to use a transaction. And if you need a distributed transaction, that is just a flag away with RavenDB.

Tracking click on anchors in an HTML page

by Gérald Barré

posted on: June 26, 2023

There are many ways to track when a user clicks on a link in an HTML page. Most of them require loading a JS script and registering the click event for all <a> elements. More recently, the ping attribute was introduced to the <a> element. It allows to specify a URL to ping when the link

Using encryption to verify a license key

by Oren Eini

posted on: June 23, 2023

A long while ago I had this project, which allows you to define software licenses that you can distribute. The basic idea is pretty simple, we want to be able to define a key (needs to be short and copy/pastable) that we’ll be able to provide to our code that is running in a separate environment. In particular, we have to deal with the issue of not being connected to a central server. That sounds like a really hard problem, but it turns out that it is a pretty simple solution, if we use public key cryptography. Typically you’ll utilize public key cryptography for encryption, but you can also use that for signing. The idea is that we can use the ability to sign the key using a private key, then validate it (potentially offline) using the public key. Licensing can be complex, so we are going to punt all of that to someone else. In this post I’m going to just discuss how we can sign a piece of data and then verify it. We’ll start by generating the keys, this is an action that you should only do once: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters using System; using System.Security.Cryptography; var key = ECDsa.Create(); var pem = key.ExportPkcs8PrivateKeyPem(); Console.WriteLine(pem); var publicPem = key.ExportSubjectPublicKeyInfoPem(); Console.WriteLine(publicPem); view raw GenerateKeys.cs hosted with ❤ by GitHub Here is the output of this code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters -----BEGIN PRIVATE KEY----- MIHuAgEAMBAGByqGSM49AgEGBSuBBAAjBIHWMIHTAgEBBEIBmjSDHRRDkvP25ne4 +bG0mU2L9gedX7nuXHKoSiwMvqQ9E70cyfMwuTH/9Ck7ykll8U4+mqnWV2BWluAk xaHtfLWhgYkDgYYABAFbfheDkNDoks25uW8EBPgSGZnwM7zzeXc3BbLNbbAEGLjy SHiC64cHdSqSDjENiLEobrGOfpgRxkmMHJBCMzXmOwBXoDBGrtVz5Y2MPwXb9f8i ZZEeVR3QlZtv51LboC6JPZFHV8afK3sh1UcUY2bVb4xNPKv75ws3CeX+O6bTBg22 Jw== -----END PRIVATE KEY----- -----BEGIN PUBLIC KEY----- MIGbMBAGByqGSM49AgEGBSuBBAAjA4GGAAQBW34Xg5DQ6JLNublvBAT4EhmZ8DO8 83l3NwWyzW2wBBi48kh4guuHB3Uqkg4xDYixKG6xjn6YEcZJjByQQjM15jsAV6Aw Rq7Vc+WNjD8F2/X/ImWRHlUd0JWbb+dS26AuiT2RR1fGnyt7IdVHFGNm1W+MTTyr ++cLNwnl/jum0wYNtic= -----END PUBLIC KEY----- view raw keys.pem hosted with ❤ by GitHub You can now embed the public key in your software, while keeping the private key hidden. With that in place, we can now sign a license. But what is a license? At its most basic form, it is a set of attributes that describe what your license allows. As such, we can use a Dictionary<string,string> to define the capabilities of the license. With that in place, we can write very simple code to generate the license text: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters string Sign(string privateKeyPEM, Dictionary<string, string> license) { using var sign = ECDsa.Create(); sign.ImportFromPem(privateKeyPEM); var text = JsonConvert.SerializeObject(license, Formatting.None); var output = sign.SignData(Encoding.UTF8.GetBytes(text), HashAlgorithmName.SHA256); var outputBase64 = Convert.ToBase64String(output); license["@Sig"] = outputBase64; return JsonConvert.SerializeObject(license, Formatting.Indented); } view raw Sign.cs hosted with ❤ by GitHub And here is the result: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters { "Name": "John Doe", "Location": "Europe", "@Sig": "AGA3s9Vf/LI8t5uxP+cRofNOEGaBaqGLfur68kGSu0gYPxPDhDkDcc+fneAhAMSjaFJUKTHUtE+4OXimgwqRkEsVAKeETE5GWPoejyLTITapabHa5KLY7XOcehgm9pGm5IPBgIB2NgfSizLIGRrjjoa9t6ho2jDpmUnMF4HuGJQd2Dej" } view raw signed.json hosted with ❤ by GitHub The last part to deal with is verifying that the license is valid, we can do that using the following function: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters Dictionary<string, string> Verify(string publicKeyPEM, string license) { var dic = JsonConvert.DeserializeObject<Dictionary<string, string>>(license); if (dic.Remove("@Sig", out var sigText) == false) return null; Span<byte> signature = stackalloc byte[512]; if (Convert.TryFromBase64String(sigText, signature, out var writter) == false) return null; signature = signature[..writter]; using var verify = ECDsa.Create(); verify.ImportFromPem(publicKeyPEM); var textToVerify = JsonConvert.SerializeObject(dic, Formatting.None); if (verify.VerifyData(Encoding.UTF8.GetBytes(textToVerify), signature, HashAlgorithmName.SHA256) == false) return null; return dic; } view raw Verify.cs hosted with ❤ by GitHub If the license options are the same as the one we signed and the cryptographic signature is a match, we can safely return the license options. Otherwise, we’ll return null. As I said, this is pretty simple code all around. And doesn’t do much, but it means that you can let the framework carry most of the load. Features such as enabling specific capabilities for a license, expiration, etc are all possible. Define an ExpirationDate property and have your software react to that, for example. A few words about this implementation, however. I’m relying heavily on the .NET framework, obviously. But beyond just using the cryptographic primitives, we also have to take into account a bunch of other aspects. For example, I’m not bothering to normalize the license. Instead, I rely on the fact that the .NET Dictionary will iterate over keys in insertion order. Note that any change in the actual data will result in a verification fails. This is a pretty simple design, but it ensures that you cannot “crack” the algorithm used to generate the license keys. Of course, users can still always just patch the isValidLicense function, instead .

Integer compression

by Oren Eini

posted on: June 21, 2023

After this saga, I wanted to close the series with some numbers about the impact of this algorithm. If you’ll recall, I started this whole series discussing variable-sized integers. I was using this list of numbers to compare the values. There are 44,956 items in the list. Algorithm Size Raw 359.648 Varint 224,780 Delta + Varint    45,970 Delta + Varint + Gzip             5,273 FastPFor      24,717 You can see some interesting details about the various options. Delta + Varint + Gzip is a really good setup, if you can assume that the output pattern has a high degree of repetition. In some cases, that is possible, but that isn’t a generic property. Aside from that, FastPFor is winning hands down in terms of the output size of the data. There is also another important aspect. The size of the output here is too big to fit in a single 8KB buffer, so I’ll need to use multiple for that. This is not something that you can really do with GZip. Here is the cost across multiple buffers: This particular list would fit into 4 pages: 14,080 entries at 8,048 bytes 14,080 entries at 8,063 bytes 15,616 entries at 8,030 bytes   1,180 entries at    644 bytes But the compression ratio is only part of that. Let’s talk about the performance aspect. On my machine, you can run the encoding process in 1,115 ticks and decoding in just 462 ticks. To give some context, that means that you can do the encoding at a rate of ~400,000,000 / second and decoding at a rate of about 1 billion per second. My perf team also asked me to mention that they haven’t gotten the chance to look at the code yet, so things are likely to improve. The entire premise of FastPFor inside of RavenDB relies on these fast decoding & encoding times. It makes it super cheap to iterate over a vast amount of numbers, and the B+Tree layout we have means that updating a posting list’s page is trivial. Read it, merge, and encode it back. It’s fast enough that there is really no other place for meaningful optimizations / complexity.

Integer compression

by Oren Eini

posted on: June 20, 2023

In the previous post, I discussed FastPFor encoding, now I’m going to focus on how we deal with decoding. Here is the decode struct: Note that this is a struct for performance reasons. We expect that we’ll have a lot of usage here, so saving the allocation here ends up paying high dividends.  Before we’ll start, I want to pay special attention to the fields on the decoder: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public unsafe struct FastPForDecoder { private byte* _input; private byte* _metadata; private readonly byte* _end; private uint* _exceptions; private fixed int _exceptionOffsets[32]; private Vector256<long> _prev; } view raw FastPForDecoder.fields.cs hosted with ❤ by GitHub Of particular interest is the _exceptionOffsets array. If you aren’t familiar with it, this is a fixed-size array on the struct. Here is the constructor for the decoder: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public FastPForDecoder(byte* input, int size) { if (size <= sizeof(PForHeader)) throw new ArgumentOutOfRangeException(nameof(size)); _input = input; _end = input + size; ref var header = ref Unsafe.AsRef<PForHeader>(input); _metadata = input + header.MetadataOffset; _exceptions = null; var exceptionsBufferSize = 1024; var exceptionBufferOffset = 0; _exceptions = (uint*)NativeMemory.Alloc((uint)(exceptionsBufferSize * sizeof(uint))); var exception = _input + header.ExceptionsOffset; for (int i = 2; i <= 32; i++) { if ((header.ExceptionsBitmap & (1 << (i - 1))) == 0) continue; var count = Unsafe.Read<ushort>(exception); exception += sizeof(ushort); if (count + exceptionBufferOffset > exceptionsBufferSize) { exceptionsBufferSize *= 2; _exceptions = (uint*)NativeMemory.Realloc(_exceptions, (uint)(exceptionsBufferSize * sizeof(uint))); } BitPacking.UnpackSegmented(exception, count, _exceptions + exceptionBufferOffset, (uint)i); _exceptionOffsets[i] = exceptionBufferOffset; exceptionBufferOffset += count; exception += BitPacking.RequireSizeSegmented(count, i); } _input += sizeof(PForHeader); _prev = Vector256.Create(header.Baseline); } view raw FastPForDecoder.ctor.cs hosted with ❤ by GitHub We are being provided with the encoded buffer and its size. What is actually happening here? We start by allocating memory for the exceptions buffer, then scan the exceptions bitmap and extract the exceptions data to the buffer. We use a single buffer for all of the exceptions, and the _exceptionsOffsets is an indication of where each bit width exception is currently at. Finally, we set the _prev to be a vector of the baseline. That is the counterpoint to how we dealt with the values during the encoding. Like the encoding process, we have to deal with three states: Big deltas (for the next block) Variable integer (at the end) Packed block of 256 numbers We’ll start with dealing with the first two options: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public int Read(long* output, int outputCount) { List<long> bigDeltaOffsets = null; var buffer = stackalloc uint[256]; int read = 0; while (_metadata < _end && read < outputCount) { Debug.Assert(read + 256 <= outputCount, "We assume a minimum of 256 free spaces"); var numOfBits = *_metadata++; switch (numOfBits) { case FastPForEncoder.BiggerThanMaxMarker: (bigDeltaOffsets ??= new List<long>()).Add((long)_metadata); // we don't need to worry about block fit, because we are ensured that we have at least // 256 items to read into the output here, and these marker are for the next blcok // batch location + 16 bytes high bits of the delta _metadata += 17; continue; case FastPForEncoder.VarIntBatchMarker: var countOfVarIntBatch = *_metadata++; var prevScalar = _prev.GetElement(3); for (int i = 0; i < countOfVarIntBatch; i++) { var cur = ReadVarInt(_input, out var offset); _input += offset; cur += prevScalar; output[read++] = cur; prevScalar = cur; } _prev = Vector256.Create(prevScalar); continue; case > 32 and < FastPForEncoder.BiggerThanMaxMarker: throw new ArgumentOutOfRangeException("Unknown bits amount: " + numOfBits); } view raw Read_Start.cs hosted with ❤ by GitHub If we have a big delta, we record that in the bigDeltaOffsets. Note that we are playing an annoying game here. I’m actually storing the metadata position in the delta offsets. I can cast it to long and then to pointer because I’m ensured that the data is fixed during the operation. For the varint scenario, the last item in the batch, we don’t have to do much, we compute the running total of the values as we read them from the input, that’s about it. Things get a lot more interesting when we start dealing with the actual compressed blocks: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters var numOfExceptions = *_metadata++; SimdBitPacking<NoTransform>.Unpack256(0, _input, buffer, numOfBits); _input += numOfBits * Vector256<byte>.Count; if (numOfExceptions > 0) { var maxNumOfBits = *_metadata++; var bitsDiff = maxNumOfBits - numOfBits; if (bitsDiff == 1) { var mask = 1u << numOfBits; for (int i = 0; i < numOfExceptions; i++) { var idx = *_metadata++; buffer[idx] |= mask; } } else { ref var offset = ref _exceptionOffsets[bitsDiff]; for (int i = 0; i < numOfExceptions; i++) { var remains = _exceptions[offset++]; var idx = *_metadata++; buffer[idx] |= remains << numOfBits; } } } var expectedBufferIndex = -1; if (bigDeltaOffsets != null && bigDeltaOffsets.Count > 0) { expectedBufferIndex = *(byte*)bigDeltaOffsets[0]; } view raw Read_Decode.cs hosted with ❤ by GitHub Here we simply decode the values to the buffer, then we need to apply the exceptions. You can see that we saved some space by not storing exceptions of 1 bit (since we know what the value would be). For exceptions of different sizes, see how we consume the exception from _exceptionOffsets for each used exception. I’m using a ref variable here, so the offset++ operation will increment the value in the array. That ensures that we have to keep very little state in the decoding process as it moves forward. Remember that we require that the output buffer for the decoded numbers be at least 256 values, to ensure that we can fit them all in. That doesn’t mean that we have enough space to fit everything. So we may be called multiple times and need to resume from where we left off. Finally, we set the expectedBufferIndex if we have a big delta offset. We’ll shortly see what we’ll do with this. Remember that at this point, the buffer we use has int32 deltas in it, not the actual final numbers. In order to get the final values, we need to compute the sum of the deltas, this is handled here: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters for (int i = 0; i + Vector256<uint>.Count <= 256; i += Vector256<uint>.Count) { var (a, b) = Vector256.Widen(Vector256.Load(buffer + i)); if (expectedBufferIndex == i) { a |= GetDeltaHighBits(); } PrefixSumAndStoreToOutput(a, ref _prev); if (expectedBufferIndex == i + 4) { b |= GetDeltaHighBits(); } PrefixSumAndStoreToOutput(b, ref _prev); } view raw Read_Sum.cs hosted with ❤ by GitHub What do we do here? We load a vector of 8 integers (32 bits) and widen that to two vectors of 4 integers (64 bits) each. We then check whether this is the expected buffer index for big delta offsets, if so, we’ll or the high parts of the number back to the vector. This is handled here: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters Vector256<ulong> GetDeltaHighBits() { var ptr = (byte*)bigDeltaOffsets![0]; bigDeltaOffsets.RemoveAt(0); var highBitsDelta = Vector128.Load(ptr) .AsInt32().ToVector256(); if (bigDeltaOffsets.Count > 0) { expectedBufferIndex = *(byte*)bigDeltaOffsets[0]; } else { expectedBufferIndex = -1; } // the last 4 elements are known zero, so no need to AND with zero highBitsDelta = Vector256.Shuffle(highBitsDelta, Vector256.Create(7, 0, 7, 1, 7, 2, 7, 3)); return highBitsDelta.AsUInt64(); } view raw GetDeltaHighBits.cs hosted with ❤ by GitHub There are quite a few things that are happening here all at once. We first read the current delta offset from the metadata, and read 16 bytes into a Vector128. We then translate that vector to a vector of 256 bits. That would basically add zeros to the vector. We set the next index to check and then we shuffle the vector on numbers to arrange it on the high parts of the vector as longs. Let’s say that we had the numbers [1,2,3,4] in the metadata, we read them into the Vector128, like so: Vector128 = [1,2,3,4] We then extended that to a Vector256, like so: Vector256 = [1,2,3,4,0,0,0,0] Finally, we shuffle the values, (in this case, 7 is known to be zero) so we have: highBitsDelta = [0,1,0,2,0,3,0,4] We convert that to a vector of longs (with the same bits pattern) and then or that with the values from the packed deltas. We have to do the expectedBufferIndex twice on each iteration, but we expect that to be rare, so the branch predictor is going to be really good in predicting this. Finally, we have the part where we compute the prefix sum, here: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters void PrefixSumAndStoreToOutput(Vector256<ulong> curUl, ref Vector256<long> prev) { var cur = curUl.AsInt64(); // doing prefix sum here: https://en.algorithmica.org/hpc/algorithms/prefix/ cur += Vector256.Shuffle(cur, Vector256.Create(0, 0, 1, 2)) & Vector256.Create(0, -1, -1, -1); cur += Vector256.Shuffle(cur, Vector256.Create(0, 0, 0, 1)) & Vector256.Create(0, 0, -1, -1); cur += prev; prev = Vector256.Shuffle(cur, Vector256.Create(3, 3, 3, 3)); cur.Store(output + read); read += Vector256<long>.Count; } view raw Read_PrefixSum.cs hosted with ❤ by GitHub This looks like black magic, but let’s break it apart into individual pieces.  Let’s assume that we started out with [25,27,30, 35] – using our approach, we have Baseline: 21 and the deltas: [4,2,3,5]. We start with prev being set to the baseline value on all elements in the vector: prev = [21,21,21,21] And cur is set to deltas: cur = [4,2,3,5] On line 5, we shuffle and BitwiseAnd the value with a mask, let’s see what we get: Line 5 – shuffled: [4,4,2,3] LIne 5- masked: [0,4,2,3] We add that to cur, giving us: cur = [4 + 0, 4 + 2, 3 + 2, 5 + 3] = [4,6,5,8] On line 7, we do another shuffle & mask: Line 7 – shuffled: [4,4,4,6] Line 7 – masked: [0,0,4,6] And adding that to cur, will give us: cur = [4+0, 6+ 0, 5 + 4, 8+ 6] = [4,6,9,14] We now add the prev vector, giving us: cur = [4 + 21, 6 + 21, 9 + 21, 14 + 21] = [25, 27, 30, 35] We computed the sum again, hurray! We then set all elements of prev to the last value of cur: prev = [35,35,35,35] And now we are ready for the next element to compute. And… this is pretty much it, to be honest. We keep reading until we are out of space in the buffer or consumed all the metadata. We are able to decode the data really fast, and it only took a 10 parts (so far) series of blog posts to explain. In the next post, I’m going to discuss what we can see from actually using this with real numbers.