Page 63 • Relatively General .NET

In the last post, I talked about how the simdcomp library is actually just doing bit-packing. Given a set of numbers, it will put the first N bits in a packed format. That is all. On its own, that isn’t very useful, we have to pair it with something else to make it do something more interesting. Previously, I talked about delta compression, what if we put both of those together? Given a sorted list of numbers, we’ll compute the delta between the numbers, then we can figure out how many bits we need to store them, and we can pack that tightly. Analyzing the sample posting list I’m using, we get the following batches (of 128 numbers each) 328 batches at 13 bits 11 batches at 22 bits 7 batches at 23 bits 3 batches as 27 bits 1 batch at 31 bits 1 batch at 28 bits Doing the math, we can fit all of those using 77Kb or so. The idea is that for each batch, we can use simdpack() to write just the relevant bits. We’ll need some extra metadata to keep track of the number of bits per batch, but that is just one byte per every 128 values. For that matter, we can bit-pack that as well . Note that in this case, we are using the simd packing routines just for bit packing. The actual compression aspect comes not from the packing, but from computing the deltas and then figuring out how densely we can pack those values. This approach is called Frame Of Reference (FOR), and it has some really nice advantages. By splitting the data into blocks, we can take advantage of the patterns in the data. Here are the first 16 deltas in the posting list: Value Bits [0] 1871143144 31 [1] 4 3 [2] 4 3 [3] 4 3 [4] 4 3 [5] 4 3 [6] 4 3 [7] 4 3 [8] 4 3 [9] 4 3 [10] 4 3 [11] 7984 13 [12] 4 3 [13] 4 3 [14] 4 3 [15] 4 3 Let’s assume that the batch size is 4 numbers, what would be the numbers here? Basically, we take the max number of bits for each batch of 4 numbers. The first block will require us to use 31 bits, the second would be just 3 bits, but the third would require 13 bits and the last is 3 bits again. The idea is that for each block, we have a separate frame of reference, the number of bits that are required for us to record & retrieve the data. Of course, if you’ll look at the data itself, we are wasting quite a lot of space here needlessly. There are just two numbers (0, 11) that require more than 3 bits. There is an algorithm called Patched Frame Of Reference (PFor) that handles this. Basically, when we compute a batch, we’ll record it using 3 bits only in this case. What about the rest of the values? Those we’ll store outside the batch, as an exception. On reading, we’ll need to patch the data to make it whole. That adds a bit of complexity, but it also means that you can compress the data a lot more densely. In the next post, I’m going to be talking about FastPFor, which combines simd bit packing, patched frame of references and some really clever tricks to get a very interesting integer compression algorithm.

Integer compression

by Oren Eini

posted on: June 07, 2023

In the previous post, I showed how you can use integer compression using variable-size integers. That is a really straightforward approach for integer compression, but it isn’t ideal. To start with, it means that we take at least one byte for each number, but in many cases, especially if we are using delta encoding, we can pack things a lot more tightly. For reference, here are the first 15 records we have in the posting list I’m using for testing, delta encoded: 1871143144 4 4 4 4 4 4 4 4 4 4 7984 4 4 4 See the range of 4, repeating. Each one of them can be represented by 3 bits, but we’ll waste 32 – 64 bits to hold it. When you are searching for data on “database stuff”, the building blocks that are being used to build databases, you’ll usually find Lemire there. In this case, I would like to talk about the SimdComp library. From the GitHub repository: A simple C library for compressing lists of integers using binary packing and SIMD instructions. This library can decode at least 4 billion of compressed integers per second on most desktop or laptop processors. Those are some really nice numbers, but I actually have a pet peeve with the name of the library. It isn’t actually doing compression, what it does is bit packing, which is something quite different. Bit packing just takes numbers that are 32 bits in length and stores them using fewer bits. For example, let’s consider the following list: 1,2,3,4,5,6,7,8,9,10. If I would store that in 32 bits integers, that would take 40 bytes. But given that the maximum size is 10, I can use 4 bits integers, instead. Taking only 5 bytes. The core of the library is this set of routines: simdpack() simdpackwithoutmask() simdunpack() All of them have roughly the same structure, something like this: This switch statement goes on for 32 bits. Each time selecting a different function. This code makes a lot of assumptions. You’ll always give it exactly an input of 128 numbers and it will pack them and write them to output. The idea is to be as compact as you possibly can. Reading this code is.. a challenge, because there is so much mental weight behind it, or so it seems. For example, take a look at how we pack a set of numbers into 2-bit integers: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters static void __SIMD_fastpack2_32(const uint32_t *_in, __m128i *out) { const __m128i *in = (const __m128i *)(_in); __m128i OutReg; const __m128i mask = _mm_set1_epi32((1U << 2) - 1); __m128i InReg = _mm_and_si128(_mm_loadu_si128(in), mask); OutReg = InReg; InReg = _mm_and_si128(_mm_loadu_si128(++in), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 2)); InReg = _mm_and_si128(_mm_loadu_si128(++in), mask); OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 4)); InReg = _mm_and_si128(_mm_loadu_si128(++in), mask); // redacted view raw __SIMD_fastpack2_32.c hosted with ❤ by GitHub The actual code goes on for quite a while (104 of dense SIMD code), so let’s try to break it up a bit. If we’ll translate this from SIMD to scalar, we’ll have code like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters uint mask = (1u << 2) -1; var input = *in++ & mask; var output = input; input = *in++ & mask; output |= input << 2; input = *in++ & mask; output |= input << 4; view raw scalar.c hosted with ❤ by GitHub If you are used to manipulating bits, it’s fairly straightforward. We start by setting the output to the first value, then the second value was shifted by 2 bits and added to the value, then the third value was shifted by 4 bits, etc. With SIMD, the interesting part is that we aren’t operating on a single value, but 4 at a time. That also means that we have a difference in how the data actually end up being packed. Given a list of values from 1 .. 16, bit packing them will usually result in the data looking like this: In contrast, the SIMD bit packing model will store the data like so: This is a result of operating on 4 items at a time, but in practice, the actual placement of the bits does not matter much, just that we are able to store and retrieve them. Note that for packing, we have two options, one that uses a mask and the other that does not. That will be important later. The only difference between those options is whether a mask is applied before we splice the bits into the output, nothing else. The reason I said that this isn’t compression is that this library is focused solely on packing the bits, it has no behavior / concept beyond that stage. In the next post, we’ll start looking into how we can make use of this in more interesting ways.

Integer compression

by Oren Eini

posted on: June 06, 2023

If you are building a database, the need to work with a list of numbers is commonplace. For example, when building an index, we may want to store all the ids of documents of orders from Europe. You can just store the raw values, of course, but that can be incredibly wasteful in terms of disk space, memory bandwidth and performance. Consider a database where millions of orders are from Europe. We’ll need multiple megabytes just to store the document ids. We can utilize patterns in the numbers to reduce the total storage size. For example, if you are storing a list of sorted numbers, you can store the delta of the numbers, instead of the whole value. The delta tends to be much smaller, after all. Take a look at the following code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters static void DeltaEncode(List<long> sortedList) { if (sortedList.Count == 0) return; long prev = sortedList [0]; for (int i = 1; i < sortedList.Count; i++) { long current = sortedList [i]; sortedList [i] = current - prev; prev = current; } } static void WriteListToStream(List<long> deltas, Stream outputStream) { BinaryWriter writer = new BinaryWriter(outputStream); foreach (var num in deltas) { writer.Write7BitEncodedInt64(num); } } view raw delta_encode.cs hosted with ❤ by GitHub This will generate (in place) delta for the sorted list. We can then write it to a stream, using 7-bit integer (also called variable size integers) encoding. This way, smaller numbers take less space, which is the point of first doing delta encoding. For testing purposes, I have a list of 44,956 items, which looks like this (full list is here): 1871143144 1871143148 1871143152 1871143156 1871143160 1871143164 1871143168 1871143172 1871143176 1871143180 Note that I’m using int64, so if we would store the data as is, it would take ~351KB. Using delta encoding, we can pack that into just ~45Kb. I care a lot more about the decoding of the values, so let’s see how we read it back, shall we? This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters static List<long> ReadListFromStream(Stream inputStream) { var nums = new List<long>(); var reader = new BinaryReader(inputStream); while (reader.BaseStream.Position < reader.BaseStream.Length) { nums.Add(reader.Read7BitEncodedInt64()); } return nums; } static void DeltaDecode(List<long> list) { for (int i = 1; i < list.Count; i++) { list[i] = list[i - 1] + list[i]; } } view raw delta_decode.cs hosted with ❤ by GitHub With very little code, we can gain quite a jump in information density. That is quite good. For reference purposes, taking the same data and compressing it using GZip would cost us ~58KB, so significantly more than this simpler mechanism. On the other hand, using GZip on the delta encoded list will put us under 8KB. This is because in this context, there is a great deal of repetition in the data, there are many runs that are exactly 4 values apart from one another, so we can get quite a good compression ratio out of this. That does require that we’ll have patterns in the data, which is by no means a guarantee. There is a whole field out there of research into integer compression, since it is such an important aspect of how database internals are working. In this series, I’m going to focus on detailing how we implemented FastPFor inside of RavenDB, what we learned along the way and what the final results were.

Understanding SameSite cookies

by Andrew Lock

posted on: June 06, 2023

In this post I discuss SameSite cookies, what they are, why they're useful, and the limitations when you use them.…

Debug ASP.NET Core application configuration

by Gérald Barré

posted on: June 05, 2023

Configuration in ASP.NET Core is performed using one or more configuration providers. Configuration providers read configuration data from key-value pairs using a variety of configuration sources such as the appsettings.json file, environment variables, vaults and so on. All these sources are merge

What's the latest .NET roadmap?

by Ardalis

posted on: May 31, 2023

.NET Roadmap The latest version of .NET is 7. The current roadmap is for .NET 8. .NET 8 is scheduled to ship in November 2023 and will be an…Keep Reading →

Disabling HSTS for localhost on Chromium-based browsers

by Gérald Barré

posted on: May 29, 2023

TipA newer version of this post is available: Avoid HSTS issues on localhostHttp Strict Transport Security (HSTS) is a security mechanism that instructs the browser to automatically redirect http requests to https before sending a request to the server. When you are developing a web application, yo

Making authenticated cross-origin requests with ASP.NET Core Identity

by Andrew Lock

posted on: May 23, 2023

In this post I describe how to make authenticated cross-origin requests from the browser to an app using ASP.NET Core Identity.…

How to detect Globalization-Invariant mode in .NET

by Gérald Barré

posted on: May 22, 2023

Some libraries don't work when the application is running using Globalization-Invariant mode. This mode disabled all globalization features and force the use of the Invariant culture. This mode is useful for applications that don't need globalization features and want to reduce the application size

Talk

Integer compression

Integer compression

Integer compression

Understanding SameSite cookies

Debug ASP.NET Core application configuration

What's the latest .NET roadmap?

Disabling HSTS for localhost on Chromium-based browsers

Making authenticated cross-origin requests with ASP.NET Core Identity

How to detect Globalization-Invariant mode in .NET