In the previous post, I discussed FastPFor encoding, now I’m going to focus on how we deal with decoding. Here is the decode struct:
Note that this is a struct for performance reasons. We expect that we’ll have a lot of usage here, so saving the allocation here ends up paying high dividends. Before we’ll start, I want to pay special attention to the fields on the decoder:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
public unsafe struct FastPForDecoder
{
private byte* _input;
private byte* _metadata;
private readonly byte* _end;
private uint* _exceptions;
private fixed int _exceptionOffsets[32];
private Vector256<long> _prev;
}
view raw
FastPForDecoder.fields.cs
hosted with ❤ by GitHub
Of particular interest is the _exceptionOffsets array. If you aren’t familiar with it, this is a fixed-size array on the struct.
Here is the constructor for the decoder:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
public FastPForDecoder(byte* input, int size)
{
if (size <= sizeof(PForHeader)) throw new ArgumentOutOfRangeException(nameof(size));
_input = input;
_end = input + size;
ref var header = ref Unsafe.AsRef<PForHeader>(input);
_metadata = input + header.MetadataOffset;
_exceptions = null;
var exceptionsBufferSize = 1024;
var exceptionBufferOffset = 0;
_exceptions = (uint*)NativeMemory.Alloc((uint)(exceptionsBufferSize * sizeof(uint)));
var exception = _input + header.ExceptionsOffset;
for (int i = 2; i <= 32; i++)
{
if ((header.ExceptionsBitmap & (1 << (i - 1))) == 0)
continue;
var count = Unsafe.Read<ushort>(exception);
exception += sizeof(ushort);
if (count + exceptionBufferOffset > exceptionsBufferSize)
{
exceptionsBufferSize *= 2;
_exceptions = (uint*)NativeMemory.Realloc(_exceptions, (uint)(exceptionsBufferSize * sizeof(uint)));
}
BitPacking.UnpackSegmented(exception, count, _exceptions + exceptionBufferOffset, (uint)i);
_exceptionOffsets[i] = exceptionBufferOffset;
exceptionBufferOffset += count;
exception += BitPacking.RequireSizeSegmented(count, i);
}
_input += sizeof(PForHeader);
_prev = Vector256.Create(header.Baseline);
}
view raw
FastPForDecoder.ctor.cs
hosted with ❤ by GitHub
We are being provided with the encoded buffer and its size. What is actually happening here?
We start by allocating memory for the exceptions buffer, then scan the exceptions bitmap and extract the exceptions data to the buffer. We use a single buffer for all of the exceptions, and the _exceptionsOffsets is an indication of where each bit width exception is currently at.
Finally, we set the _prev to be a vector of the baseline. That is the counterpoint to how we dealt with the values during the encoding. Like the encoding process, we have to deal with three states:
Big deltas (for the next block)
Variable integer (at the end)
Packed block of 256 numbers
We’ll start with dealing with the first two options:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
public int Read(long* output, int outputCount)
{
List<long> bigDeltaOffsets = null;
var buffer = stackalloc uint[256];
int read = 0;
while (_metadata < _end && read < outputCount)
{
Debug.Assert(read + 256 <= outputCount, "We assume a minimum of 256 free spaces");
var numOfBits = *_metadata++;
switch (numOfBits)
{
case FastPForEncoder.BiggerThanMaxMarker:
(bigDeltaOffsets ??= new List<long>()).Add((long)_metadata);
// we don't need to worry about block fit, because we are ensured that we have at least
// 256 items to read into the output here, and these marker are for the next blcok
// batch location + 16 bytes high bits of the delta
_metadata += 17;
continue;
case FastPForEncoder.VarIntBatchMarker:
var countOfVarIntBatch = *_metadata++;
var prevScalar = _prev.GetElement(3);
for (int i = 0; i < countOfVarIntBatch; i++)
{
var cur = ReadVarInt(_input, out var offset);
_input += offset;
cur += prevScalar;
output[read++] = cur;
prevScalar = cur;
}
_prev = Vector256.Create(prevScalar);
continue;
case > 32 and < FastPForEncoder.BiggerThanMaxMarker:
throw new ArgumentOutOfRangeException("Unknown bits amount: " + numOfBits);
}
view raw
Read_Start.cs
hosted with ❤ by GitHub
If we have a big delta, we record that in the bigDeltaOffsets. Note that we are playing an annoying game here. I’m actually storing the metadata position in the delta offsets. I can cast it to long and then to pointer because I’m ensured that the data is fixed during the operation.
For the varint scenario, the last item in the batch, we don’t have to do much, we compute the running total of the values as we read them from the input, that’s about it.
Things get a lot more interesting when we start dealing with the actual compressed blocks:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
var numOfExceptions = *_metadata++;
SimdBitPacking<NoTransform>.Unpack256(0, _input, buffer, numOfBits);
_input += numOfBits * Vector256<byte>.Count;
if (numOfExceptions > 0)
{
var maxNumOfBits = *_metadata++;
var bitsDiff = maxNumOfBits - numOfBits;
if (bitsDiff == 1)
{
var mask = 1u << numOfBits;
for (int i = 0; i < numOfExceptions; i++)
{
var idx = *_metadata++;
buffer[idx] |= mask;
}
}
else
{
ref var offset = ref _exceptionOffsets[bitsDiff];
for (int i = 0; i < numOfExceptions; i++)
{
var remains = _exceptions[offset++];
var idx = *_metadata++;
buffer[idx] |= remains << numOfBits;
}
}
}
var expectedBufferIndex = -1;
if (bigDeltaOffsets != null && bigDeltaOffsets.Count > 0)
{
expectedBufferIndex = *(byte*)bigDeltaOffsets[0];
}
view raw
Read_Decode.cs
hosted with ❤ by GitHub
Here we simply decode the values to the buffer, then we need to apply the exceptions. You can see that we saved some space by not storing exceptions of 1 bit (since we know what the value would be). For exceptions of different sizes, see how we consume the exception from _exceptionOffsets for each used exception. I’m using a ref variable here, so the offset++ operation will increment the value in the array. That ensures that we have to keep very little state in the decoding process as it moves forward. Remember that we require that the output buffer for the decoded numbers be at least 256 values, to ensure that we can fit them all in. That doesn’t mean that we have enough space to fit everything. So we may be called multiple times and need to resume from where we left off.
Finally, we set the expectedBufferIndex if we have a big delta offset. We’ll shortly see what we’ll do with this.
Remember that at this point, the buffer we use has int32 deltas in it, not the actual final numbers. In order to get the final values, we need to compute the sum of the deltas, this is handled here:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
for (int i = 0; i + Vector256<uint>.Count <= 256; i += Vector256<uint>.Count)
{
var (a, b) = Vector256.Widen(Vector256.Load(buffer + i));
if (expectedBufferIndex == i)
{
a |= GetDeltaHighBits();
}
PrefixSumAndStoreToOutput(a, ref _prev);
if (expectedBufferIndex == i + 4)
{
b |= GetDeltaHighBits();
}
PrefixSumAndStoreToOutput(b, ref _prev);
}
view raw
Read_Sum.cs
hosted with ❤ by GitHub
What do we do here? We load a vector of 8 integers (32 bits) and widen that to two vectors of 4 integers (64 bits) each.
We then check whether this is the expected buffer index for big delta offsets, if so, we’ll or the high parts of the number back to the vector. This is handled here:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
Vector256<ulong> GetDeltaHighBits()
{
var ptr = (byte*)bigDeltaOffsets![0];
bigDeltaOffsets.RemoveAt(0);
var highBitsDelta = Vector128.Load(ptr)
.AsInt32().ToVector256();
if (bigDeltaOffsets.Count > 0)
{
expectedBufferIndex = *(byte*)bigDeltaOffsets[0];
}
else
{
expectedBufferIndex = -1;
}
// the last 4 elements are known zero, so no need to AND with zero
highBitsDelta = Vector256.Shuffle(highBitsDelta, Vector256.Create(7, 0, 7, 1, 7, 2, 7, 3));
return highBitsDelta.AsUInt64();
}
view raw
GetDeltaHighBits.cs
hosted with ❤ by GitHub
There are quite a few things that are happening here all at once. We first read the current delta offset from the metadata, and read 16 bytes into a Vector128. We then translate that vector to a vector of 256 bits. That would basically add zeros to the vector. We set the next index to check and then we shuffle the vector on numbers to arrange it on the high parts of the vector as longs.
Let’s say that we had the numbers [1,2,3,4] in the metadata, we read them into the Vector128, like so:
Vector128 = [1,2,3,4]
We then extended that to a Vector256, like so:
Vector256 = [1,2,3,4,0,0,0,0]
Finally, we shuffle the values, (in this case, 7 is known to be zero) so we have:
highBitsDelta = [0,1,0,2,0,3,0,4]
We convert that to a vector of longs (with the same bits pattern) and then or that with the values from the packed deltas.
We have to do the expectedBufferIndex twice on each iteration, but we expect that to be rare, so the branch predictor is going to be really good in predicting this.
Finally, we have the part where we compute the prefix sum, here:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
void PrefixSumAndStoreToOutput(Vector256<ulong> curUl, ref Vector256<long> prev)
{
var cur = curUl.AsInt64();
// doing prefix sum here: https://en.algorithmica.org/hpc/algorithms/prefix/
cur += Vector256.Shuffle(cur, Vector256.Create(0, 0, 1, 2)) &
Vector256.Create(0, -1, -1, -1);
cur += Vector256.Shuffle(cur, Vector256.Create(0, 0, 0, 1)) &
Vector256.Create(0, 0, -1, -1);
cur += prev;
prev = Vector256.Shuffle(cur, Vector256.Create(3, 3, 3, 3));
cur.Store(output + read);
read += Vector256<long>.Count;
}
view raw
Read_PrefixSum.cs
hosted with ❤ by GitHub
This looks like black magic, but let’s break it apart into individual pieces.
Let’s assume that we started out with [25,27,30, 35] – using our approach, we have Baseline: 21 and the deltas: [4,2,3,5].
We start with prev being set to the baseline value on all elements in the vector:
prev = [21,21,21,21]
And cur is set to deltas:
cur = [4,2,3,5]
On line 5, we shuffle and BitwiseAnd the value with a mask, let’s see what we get:
Line 5 – shuffled: [4,4,2,3]
LIne 5- masked: [0,4,2,3]
We add that to cur, giving us:
cur = [4 + 0, 4 + 2, 3 + 2, 5 + 3] = [4,6,5,8]
On line 7, we do another shuffle & mask:
Line 7 – shuffled: [4,4,4,6]
Line 7 – masked: [0,0,4,6]
And adding that to cur, will give us:
cur = [4+0, 6+ 0, 5 + 4, 8+ 6] = [4,6,9,14]
We now add the prev vector, giving us:
cur = [4 + 21, 6 + 21, 9 + 21, 14 + 21] = [25, 27, 30, 35]
We computed the sum again, hurray!
We then set all elements of prev to the last value of cur:
prev = [35,35,35,35]
And now we are ready for the next element to compute.
And… this is pretty much it, to be honest. We keep reading until we are out of space in the buffer or consumed all the metadata. We are able to decode the data really fast, and it only took a 10 parts (so far) series of blog posts to explain.
In the next post, I’m going to discuss what we can see from actually using this with real numbers.
In the previous post I outlined the requirements we have for FastPFor in RavenDB. Now I want to dig into the actual implementation. Here is the shape of the class in question:
The process starts when we start encoding the items. Each call to encode is going to process a different posting list, and it is expected that a single instance will process many posting lists during a single indexing run.
Here is the internal state of this class:
Here is how we start the process:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
public int Encode(long* entries, int count)
{
AssertIsSorted(entries, count);
_entries = entries;
_count = count;
_offset = 0;
_metadataPos = 0;
for (int k = 0; k < _exceptions.Length; k++)
{
_exceptions[k]?.Clear();
_exceptionsStart[k] = 0;
}
_metadata.Clear();
if (_entriesOutputSize < count)
{
var newCount = Math.Max(256, BitOperations.RoundUpToPowerOf2((uint)count));
_entriesOutput = (byte*)NativeMemory.Realloc(_entriesOutput, newCount * sizeof(long));
_entriesOutputSize = newCount;
}
view raw
Encode_Init.cs
hosted with ❤ by GitHub
As you can see, there isn’t much here. We initialize the state of exceptions and exceptionsStarts (will discuss this in more detail later in this post) and then we ensure that we have a big enough buffer to encode the data. Note that we retain a point to the original entries that we were passed, that will be important later as well.
Here is the code of the encoding process:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
var totalSize = sizeof(PForHeader);
int i = 0;
var entriesAsInt = (uint*)_entriesOutput;
var entriesIn = _entries;
var prev = Vector256.Create(*entriesIn);
var max = Vector256.Create<long>(uint.MaxValue);
for (; i + 256 <= _count; i += 256)
{
var blockStart = entriesAsInt;
int j = 0;
for (; j < 256; j += Vector256<long>.Count)
{
var cur = Vector256.Load(entriesIn + i + j);
var mixed = Vector256.Shuffle(cur, Vector256.Create(0, 0, 1, 2)) & Vector256.Create(0, -1, -1, -1) |
Vector256.Shuffle(prev, Vector256.Create(3, 3, 3, 3)) & Vector256.Create(-1, 0, 0, 0);
prev = cur;
var delta = cur - mixed;
if (Vector256.GreaterThanAny(delta, max))
{
HandleDeltaGreaterThanMax(j, delta);
}
var deltaInts = Vector256.Shuffle(delta.AsUInt32(), Vector256.Create(0u,2,4,6,0,0,0,0));
deltaInts.Store(entriesAsInt);
// we write 8 values, but increment by 4, so we'll overwrite it next op
entriesAsInt += Vector256<long>.Count;
}
totalSize += ProcessBlock(blockStart);
}
view raw
Encode_Core.cs
hosted with ❤ by GitHub
I create a couple of vector variables, one to hold the first item in the list and the other to hold uint.MaxValue. Then I start processing the buffers in blocks of 256 items.
A good resource for using vectors and SIMD in C# can be found here.
The classic FastPFor is using vectors of 128 bits (although there are implementations of up to 512 bits). I chose to use Vector256 for the implementation. Today, pretty much any x64 machine is going to have native support for AVX2 (shipped since 2011!). On ARM, they are implemented using Vector128, so they are still vectorized there as well.
What this means is that we can process 256 numbers as a block. But what are we doing with them? Look at the inner look, where you can see some interesting details. We are using shuffle instructions as well as masking to shift the current value and add the previous one.
In other words, let’s assume that we have the following input array: 1,2,3,4,5,6,7,8 – using longs, so Vector256 will contain 4 elements.
On line 5, we create the prev vector and assign it to have 4 copies of the first item in the array:
prev = [1,1,1,1]
On line 13, we load the 4 elements from the array into cur, giving us:
cur = [1,2,3,4]
On line 14, we shuffle both the prev and the cur values, here are intermediate states:
curShuffled = [0,1,2,3] – note that we have 0 at the start because of the mask
prevShuffled = [1,0,0,0] – the last three items are zero because of the mask.
We then or those two intermediates together, giving us:
mixed = [1,1,2,3]
We subtract that from cur to get the delta. The idea is that we massively reduce the number of operations that we have to perform to compute the delta of the values.
On lines 19..22 we check if any of the deltas is greater than uint.MaxValue, we’ll dig into that in a bit.
The last part of the loop is when we deal with detlaInts, where we are doing some interesting things. We are pulling the low bits from all four elements in the vector to the start of the vector. Then we write the deltas to the output entries buffer.
However, we write 8 values, but we increment the position by 4, so we’ll actually overwrite the last 4 items in the next loop iteration. The entries buffer is sized based on the entries buffer, so we are ensured to have enough space, and the double writes give us better performance.
The end result of the inner loop is that we are performing running delta over the items, converting them from longs to ints. Then we can call the ProcessBlock() method to actually do the code of the FastPFor algorithm:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
private int ProcessBlock(uint* currentEntries)
{
var (bestB, maxB, exceptionCount) = FindBestBitWidths(currentEntries);
_metadata.Add((byte)bestB);
_metadata.Add((byte)exceptionCount);
if (exceptionCount > 0)
{
_metadata.Add((byte)maxB);
uint maxVal = 1u << bestB;
for (int j = 0; j < 256; j++)
{
if (currentEntries[j] >= maxVal)
{
var exList = _exceptions[maxB - bestB] ??= new();
exList.Add(currentEntries[j] >>> bestB);
_metadata.Add((byte)j);
}
}
}
return bestB * Vector256<byte>.Count;
}
view raw
ProcessBlock.cs
hosted with ❤ by GitHub
I already explained how this works in this post.
It is important to understand that we have multiple data channels that we operate on here:
The entries output, where we write the packed bits
Metadata, where we describe each block
Exceptions, where we store the exception data
The encoding process runs through all of them, but isn’t actually going to be writing them anywhere, that happens later.
We process the entries in the array in blocks of 256 numbers each, but what happens when we get to the end? The last block size may not be a multiple of 256, after all. For the last item, we use simple delta encoding plus variable-sized integers. Nothing really fancy is needed at this point, since that is very small.
Finally, we run over all the exceptions data that we created during the encoding process and compute their size. The final result we have is the total size we would need to store all the encoded data as a single buffer.
What about handling numbers whose delta is greater than uint.MaxValue?
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
void HandleDeltaGreaterThanMax(int batchOffset, Vector256<long> delta)
{
// we expect this to be *rare*, so not trying to optimize this in any special way
Span<uint> deltaHighBuffer = stackalloc uint[8];
var highAndLow = Vector256.Shuffle(delta.AsUInt32(), Vector256.Create(1u, 3, 5, 7, 0, 2, 4, 6));
highAndLow.StoreUnsafe(ref deltaHighBuffer[0]);
_metadata.Add(BiggerThanMaxMarker);
_metadata.Add((byte)batchOffset);
Span<byte> asBytes = MemoryMarshal.AsBytes(deltaHighBuffer);
for (int j = 0; j < 16; j++)
{
_metadata.Add(asBytes[j]);
}
}
view raw
HandleDeltaGreaterThanMax.cs
hosted with ❤ by GitHub
You can see that we expect this to be rare, so we handle that to be on the safe side, but aren’t trying to be too optimal about it. We simply take the high portion of the current delta and write it to the metadata. That is wasteful in terms of space, but it is simple and doesn’t require complex code. We’ll see how that is being used during the decoding process in the next post.
Finally, it is now time to actually write things out to the buffer. We have to be careful here, since we need to support partial writes. Let’s see how that works:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
public (int Count, int SizeUsed) Write(byte* output, int outputSize)
{
Debug.Assert(outputSize <= ushort.MaxValue, "Output buffer too large, we use ushort for offsets and don't want to overflow");
var sizeUsed = sizeof(PForHeader);
if (sizeUsed > outputSize)
return default;
ref var header = ref Unsafe.AsRef<PForHeader>(output);
var baseline = _entries[Math.Max(0, _offset - 1)];
header.Baseline = baseline;
var exceptionsCounts = stackalloc int[33];
var startingMetadataPosition = _metadataPos;
var exceptionsRequiredSize = 0;
var entriesOutput = (uint*)_entriesOutput;
var oldOffset = _offset;
view raw
Write_Init.cs
hosted with ❤ by GitHub
We define a header for the encoded data, which looks like this:
The header contains the initial value of the encoded data, the exceptions bitmap (discussed here) and the exception and metadata offsets. Because we always limit the size of the buffer to a maximum of 8KB, I used ushort here.
The first value we set in the header is the baseline, here we need to explain a bit. When we encode the data, we start from the very first item in the list, and go forward. The problem with doing things in this manner is that when we need to split the encoded data, we need to use the baseline that is the previous value that we encoded with. For that reason, we use either the first item in the entries array or the item from the end of the previous block that we consumed.
Next, we need to deal with the exceptions, as a reminder, we have a set of exceptions that are shared across blocks. When we do a partial write, we need to keep track of how many exceptions we consumed from each of the exceptions array.
The most interesting part comes when we start iterating over the metadata values, which tells us how to deal with the data:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
while (_metadataPos < _metadata.Count)
{
var batchMetadataStart = _metadataPos;
var numOfBits = _metadata[_metadataPos++];
switch (numOfBits)
{
case BiggerThanMaxMarker:
// nothing to do, doesn't impact writes
_metadataPos += 17; // batch offset + 16 bytes
continue;
case VarIntBatchMarker:
var sizeOfVarIntBatch = _metadata[_metadataPos++];
var varintSize = WriteLastBatchAsVarIntDelta(sizeOfVarIntBatch, output + sizeUsed, outputSize - sizeUsed);
var expectedMetadataSize = _metadataPos - startingMetadataPosition;
if (varintSize == 0 || // couldn't fit the var int buffer
sizeUsed + varintSize + exceptionsRequiredSize + expectedMetadataSize > outputSize) // wouldn't be able to fit the exceptions & metadata
{
_metadataPos = batchMetadataStart;
goto AfterLoop;
}
_offset += sizeOfVarIntBatch;
sizeUsed += varintSize;
continue;
case > 32 and < BiggerThanMaxMarker:
throw new ArgumentOutOfRangeException("Invalid bits value: " + numOfBits);
}
view raw
Write_Loop.cs
hosted with ❤ by GitHub
Here we start with the exceptional cases, of a metadata value that is greater than the max delta or the varint batch marker (which occurs in the end).
You can see that there isn’t much there, except that we verify that we have enough space for the written buffers. For fun, I’m also using a goto here, this is needed to break out of the loop, since I cannot use a break statement in a switch case.
The more interesting aspect happens when we process full blocks of compressed 256 numbers:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
var numOfExceptions = _metadata[_metadataPos++];
var reqSize = numOfBits * Vector256<byte>.Count;
ref var exceptionCountRef = ref exceptionsCounts[0];
var amountToAddToException = 0;
if (numOfExceptions > 0)
{
var maxNumOfBits = _metadata[_metadataPos++];
var exceptionIndex = maxNumOfBits - numOfBits;
if (exceptionIndex > 1)
{
var oldCount = exceptionsCounts[exceptionIndex];
amountToAddToException = numOfExceptions;
exceptionCountRef = ref exceptionsCounts[exceptionIndex];
if (oldCount == 0)
{
exceptionsRequiredSize += sizeof(ushort); // size for the number of items here
}
exceptionsRequiredSize -= BitPacking.RequireSizeSegmented(oldCount, exceptionIndex);
exceptionsRequiredSize += BitPacking.RequireSizeSegmented(numOfExceptions + oldCount, exceptionIndex);
}
}
_metadataPos += numOfExceptions;
var metaSize = _metadataPos - startingMetadataPosition;
var finalSize = (sizeUsed + reqSize + exceptionsRequiredSize + metaSize);
if (finalSize > outputSize)
{
_metadataPos = batchMetadataStart;
break;
}
SimdBitPacking<MaskEntries>.Pack256(0, entriesOutput + _offset,
output + sizeUsed, numOfBits, new MaskEntries((1u << numOfBits) - 1));
sizeUsed += reqSize;
_offset += 256;
exceptionCountRef += amountToAddToException;
view raw
Write_Block.cs
hosted with ❤ by GitHub
This looks like a lot of code, but what it is actually doing is to mostly checking that we have enough space in the buffer for the block we are about to write. Of particular interest to us here is that we may have enough space to write the block, but we also need to ensure that we have space for the metadata and exceptions data that is associated with the current and all previous blocks.
If we have enough space, we continue to write more blocks to the output, or we exit if we are about to run out of room. After the loop, we have the header and compressed integers data in the output buffer, but we still need to write the metadata and exceptions data, this is handled here:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
AfterLoop:
uint bitmap = 0;
header.ExceptionsOffset = checked((ushort)sizeUsed);
for (int numOfBits = 2; numOfBits <= 32; numOfBits++)
{
var count = exceptionsCounts[numOfBits];
if (count == 0)
continue;
bitmap |= 1u << numOfBits - 1;
Unsafe.Write(output + sizeUsed, (ushort)count);
sizeUsed += sizeof(ushort);
var span = CollectionsMarshal.AsSpan(_exceptions[numOfBits]);
var exceptionStart = _exceptionsStart[numOfBits];
span = span[exceptionStart..(exceptionStart + count)];
fixed (uint* b = span)
{
int size = BitPacking.PackSegmented(b, span.Length, output + sizeUsed, (uint)numOfBits);
sizeUsed += size;
}
_exceptionsStart[numOfBits] += count;
}
header.ExceptionsBitmap = bitmap;
header.MetadataOffset = checked((ushort)sizeUsed);
var metadataSize = (_metadataPos - startingMetadataPosition);
var metadataSpan = CollectionsMarshal.AsSpan(_metadata);
var metadataBlockRange = metadataSpan[startingMetadataPosition..(startingMetadataPosition + metadataSize)];
metadataBlockRange.CopyTo(new Span<byte>(output + sizeUsed, metadataSize));
sizeUsed += metadataSize;
return (_offset - oldOffset, sizeUsed);
view raw
Write_AfterLoop.cs
hosted with ❤ by GitHub
We compute the bitmap of all the exceptions, writing them out as packed segments (basically, 256 blocks of packed bits). Note that we have to account for how many of them we actually write out to the output, since we may have stopped in the middle.
Then we can simply write out the metadata buffer and return the total size we took.
I find it very amusing that we use the same bit packing routines to store the numbers and the exceptions data as well.
Overall, most of the code isn’t particularly interesting, but there are a few highlights that I would like to point out:
We compute the deltas and the narrowing from int64 to int32 on the buffer during the Encode()
We do the actual bit packing during the Write()
There is some argument to make here that we can do the bit packing during the Encode() as well, but it is unlikely to matter, I believe.
In the next post, I’m going to be diving into how we implement decoding for FastPFor. That is actually of far more interest to me, since that happens a lot more often. It’s going to be interesting…
RuntimeHelpers.IsReferenceOrContainsReferences<T>() is a method that indicates if T is a reference type or contains a reference type. One use case for this method is to micro-optimize collections when removing items. Indeed, if T is a reference type, you need to remove the reference to the it
In this series so far, I explored several ways that we can implement integer compression. I focused on the FastPFor algorithm and dove deeply into how it works. In the last post, I showed how we can use the SIMD intrinsics to generate vectorized code that can process a lot of data very quickly.
With this baseline established, we need to understand the use case we have in RavenDB for integer compression. We are actually using this for the classic scenario, posting lists as part of the indexing engine. However, there are quite a few differences in how we want to actually utilize it.
Let’s take Lucene as a good example, since that is one of the classic examples for a search engine. From the ground up, Lucene is meant to be used in batching scenarios. That has a lot of impact on the design. In particular, it means that a posting list for Lucene is something that you write once in a big batch of operations. For RavenDB, however, we explicitly reject his approach. We want to allow to do on-the-fly indexing and we cannot afford to pay the cost of merging many small writes to big files over time.
The FastPFor algorithm is meant to get a lot of data, encode that in a compressed form and then allow you to iterate over the results as fast as possible. But RavenDB’s needs are different. At the most basic level, RavenDB thinks about things in 8KB pages. A posting list for RavenDB is composed of the following:
Single – just a single value (common when indexing unique values)
Small – can fit in under 4KB
Large – requires more than 4KB of data, in which case we split it on 8KB boundaries.
What’s the point here? The idea is that when we need to update a posting list (add or remove items), we have one of those three options:
Single – we just update the value, duh!
Small – FastPFor is fast, right? It is cheap enough to unpack the compressed values, merge with the new items (or the removals) and compress again.
Large – here we are talking about a lot of values, potentially tens of millions or higher. How do we handle updates in this case?
The answer is that we are going to be making use of a B+ Tree, with some caveats. The basic idea is that we have two types of pages in the tree. A branch page is the classic B+ Tree concept, which points to leaf pages below it that contain some range of values. However, the leaf pages, on the other hand, are very different.
Each leaf page is just an 8KB buffer that holds the compressed data, nothing more. When we want to update the values in the posting list, we find the relevant page that needs to be updated, then we just unpack, merge & compress the values. It’s actually cheaper to go this route than any other alternative.
There is another important aspect. Lucene (and most other indexing systems) assume that your posting list is using 32 bits integers. That limits the number of items in an index to 2.1 billion (up to 4.2 billion, for some implementations). I don’t like that limit, not at today’s scale. So we internally use 64bits integers in our posting lists.
That… however, is actually too big. In practice, we use 54 bits for the indexed entries. That gives us a maximum number of 18,014,398,509,481,984 items and at a range of 100,000 items per second, we are good for 5,712 years. Those 10 bits allow us to encode additional information about the actual entry that is being indexed, and reduce the number of external lookups we need to do. Taking 10 bits from the entry id means that we are going to hit the 32 bits limit at around 2 million entries, which is really low.
Then there is another problem, if I have a posting list I want to compress that has 100,000 items in it, I would need to compress that into multiple pages, but figuring out how to split the compressed stream in a way that makes it usable to work with is not trivial at all.
In short, we have two major concerns with using FastPFor in RavenDB, int64 and paged writes. There is another aspect that I didn’t mention, however. FastPFor will take an array of numbers and compress them. If you want to also use delta compression, that is something that you need to add on top of that.
Now that we understand the constraints we have to operate around, let’s talk about the kind of API that I would like to build:
What is going on here? Well, the basic idea is that you’ll create an encoder as part of the indexing process. For each posting list that you want to process, you’ll call the Encode() method, passing it the data to be encoded.
The Encode() method will return the required size for encoding the entire list as a single operation. Note that it does not actually write the data out (indeed, it has nowhere to write it). The calling code is then able to make decisions based on the returned size. If they have a buffer big enough to hold the encoded data, they can pass it to the Write() method.
Things get more interesting when we don’t have a suitable buffer. Remember that we have a limit of 4KB for small posting lists and 8KB per page for large posting lists. The idea with Write() is that it will write as much as it can to the output, allowing you to use multiple buffers to store the compressed posting list.
In terms of work being done, the Encode() phase copies the post list entries to its own buffer, preparing it to be written out. The process of actually writing the data to the output buffer is mostly concerned with ensuring that we don’t exceed the size of the provided buffer.
This approach means that we only have to pay the encoding code once, which is quite nice. Moreover, the number of checks that we have to make to ensure that we can fit the data into the buffer is also greatly reduced.
What about decoding? That is handled using:
Here we have a very interesting observation, the Encoder is a class, but the Decoder is a struct. We assume that decoding is far more frequent, so we use a struct to avoid an unnecessary memory allocation.
In order to ensure high performance, the decoder makes assumptions about the provided buffers. We will always accept a buffer that has a minimum size, for example.
With all of this out of the way, I’m going to dedicate the next post to discussing the actual implementation and semantics we have for FastPFor encoding.
In the code of the simdcomp library there is a 25KLOC file that does evil things to SIMD registers to get bit packing to work. When I looked at the code the first few dozen times, I had a strong desire to run away screaming. Luckily, this isn’t just some pile of complicated code, but well-thought-out set of functions that are meant to provide optimal code for specific tasks. In particular, the code is specialized for each bit width that you may want to bit pack (0 .. 32). Even better, no one actually sat down to write it out by hand, there is a Python script that would generate the code.
The first step was to understand what exactly is going on in the code, and then see how we can translate that to C#. Even just a few years ago, that would have been either an impossible dream or required the use of a native library (with the associated PInvoke overhead). However, .NET today has a very robust intrinsics support, and vectors / SIMD instructions are natively supported.
I actually had a tougher challenge, since I want this code to run on x64 and ARM64 instances. The original code is x64 only, of course. The nice thing about SIMD support for .NET is that most of that can be done in a cross platform manner, with the JIT deciding what instructions will actually be emitted. There is still a very strong correlation between the vectorized code and the instructions that are being emitted, which means that I get both great control of what is going on and the appropriate abstraction. I was actually able to implement the whole thing without dropping to architecture-specific code, which makes me very happy.
Before we get any deeper, let’s take a simple challenge. We want to take an array of 32 integers and pack each one of them in 4 bits into an array of 16 bytes. Here is what the code will look like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
public void Pack32IntsWith4Bits(int* items, byte* output) {
var v = (Vector128<int>*)items;
var mask = Vector128.Create(0x0f);
var mixed =
v[0] & mask |
Vector128.ShiftLeft(v[1], 4) & mask |
Vector128.ShiftLeft(v[2], 8) & mask |
Vector128.ShiftLeft(v[3], 12) & mask |
Vector128.ShiftLeft(v[4], 16) & mask |
Vector128.ShiftLeft(v[5], 20) & mask |
Vector128.ShiftLeft(v[6], 24) & mask |
Vector128.ShiftLeft(v[7], 28) & mask ;
mixed.AsByte().Store(output);
}
view raw
pack.cs
hosted with ❤ by GitHub
This is a bit dense, but let’s see if I can explain what is going on here.
We load from the array a vector (4 items) at the 0, 4, 8, 12, 16, 20, 24, and 28 intervals. For each one of those, we shift the values by the required offset and or all of them together. Note that this means that the first item’s four bits go in the first nibble, but the second item’s bits go in the fifth nibble, etc.
The idea is that we are operating on 4 items at a time, reducing the total number of operations we have to perform. It may be easier to understand if you see those changes visually:
What is happening here, however, is that we are able to do this transformation in very compact code. That isn’t just an issue of high-level code, let’s take a look at the assembly instructions that this generates:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
Pack32IntsWith4Bits(Int32*, Byte*)
L0000: vzeroupper
L0003: vmovupd xmm0, [rdx]
L0007: vpand xmm0, xmm0, [0x1ed3a030110]
L000f: vmovupd xmm1, [rdx+0x10]
L0014: vpslld xmm1, xmm1, 4
L0019: vpand xmm1, xmm1, [0x1ed3a030110]
L0021: vpor xmm0, xmm0, xmm1
L0025: vmovupd xmm1, [rdx+0x20]
L002a: vpslld xmm1, xmm1, 8
L002f: vpand xmm1, xmm1, [0x1ed3a030110]
L0037: vpor xmm0, xmm0, xmm1
L003b: vmovupd xmm1, [rdx+0x30]
L0040: vpslld xmm1, xmm1, 0xc
L0045: vpand xmm1, xmm1, [0x1ed3a030110]
L004d: vpor xmm0, xmm0, xmm1
L0051: vmovupd xmm1, [rdx+0x40]
L0056: vpslld xmm1, xmm1, 0x10
L005b: vpand xmm1, xmm1, [0x1ed3a030110]
L0063: vpor xmm0, xmm0, xmm1
L0067: vmovupd xmm1, [rdx+0x50]
L006c: vpslld xmm1, xmm1, 0x14
L0071: vpand xmm1, xmm1, [0x1ed3a030110]
L0079: vpor xmm0, xmm0, xmm1
L007d: vmovupd xmm1, [rdx+0x60]
L0082: vpslld xmm1, xmm1, 0x18
L0087: vpand xmm1, xmm1, [0x1ed3a030110]
L008f: vpor xmm0, xmm0, xmm1
L0093: vmovupd xmm1, [rdx+0x70]
L0098: vpslld xmm1, xmm1, 0x1c
L009d: vpand xmm1, xmm1, [0x1ed3a030110]
L00a5: vpor xmm0, xmm0, xmm1
L00a9: vmovdqu [r8], xmm0
L00ae: ret
view raw
pack.asm
hosted with ❤ by GitHub
I’m going to assume that you aren’t well versed with assembly, so let’s explain what is going on. This code contains zero branches, it does four reads from memory, mask the elements, shift them and or them together.
The relevant instructions are:
vmovupd – read 4 integers to the register
vpand – binary and with a value (masking)
vpslld – shift to the left
vpor – binary or
vmovdqu – write 16 bytes to memory
There are no branches, nothing to complicate the code at all. This is about as tight as you can get, at the level of machine instructions.
Of particular interest here is the C# code. The entire thing is basically a couple of lines of code, and I could express the whole thing as a single expression in a readable manner. Let’s look at the C code, to compare what is going on:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
static void pack32_4bits(const uint32_t *_in, __m128i *out) {
const __m128i *in = (const __m128i *)(_in);
__m128i OutReg, InReg;
const __m128i mask = _mm_set1_epi32((1U << 4) - 1);
uint32_t outer;
InReg = _mm_and_si128(_mm_loadu_si128(in), mask);
OutReg = InReg;
InReg = _mm_and_si128(_mm_loadu_si128(in + 1), mask);
OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 4));
InReg = _mm_and_si128(_mm_loadu_si128(in + 2), mask);
OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 8));
InReg = _mm_and_si128(_mm_loadu_si128(in + 3), mask);
OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 12));
InReg = _mm_and_si128(_mm_loadu_si128(in + 4), mask);
OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 16));
InReg = _mm_and_si128(_mm_loadu_si128(in + 5), mask);
OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 20));
InReg = _mm_and_si128(_mm_loadu_si128(in + 6), mask);
OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 24));
InReg = _mm_and_si128(_mm_loadu_si128(in + 7), mask);
OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 28));
_mm_storeu_si128(out, OutReg);
}
view raw
pack.c
hosted with ❤ by GitHub
Note that both should generate the same machine code, but being able to rely on operator overloading means that I can now get a far more readable code.
From that point, the remaining task was to re-write the Python script so it would generate C# code, not C. In the next post I’m going to be talking more about the constraints we have and what we are actually trying to solve with this approach.
As I mentioned, I spent quite a lot of time trying to understand the mechanism behind how the FastPFor algorithm works. A large part of that was the fact that I didn’t initially distinguish the separate steps in the process. Finding the bits' widths, packing the bits, metadata and exception management are all integral to how this works, and I tried to grok it all in one shot. I kept getting lost in 14KLOC files with dense SIMD instructions.
When looking at the code, it is also important to keep in mind that this is not a library that you are meant to pick up & use. This is a research project. As such, there was enough work done to show the results, but all the spit & polish that are associated with making a library ready for general consumption weren’t done.
A simple example of that is that the code just assumes that the output buffer provided will be large enough (with no way to estimate upfront) and will overflow the buffer if that happens. Perfectly acceptable for research code, hell no for production usage, naturally.
In the same manner, the repository contains a lot of code. Most of that is being used as part of comparing the FastPFor algorithm to various other options. That can make it hard to understand what are the core parts of the algorithm and what is effectively chuff.
To be honest, the hardest part for me, however, was figuring out the memory allocation patterns that are going on in here. There is a high usage of C++ containers, with implicit allocations and hidden memory management. Here is a piece of code that I had to struggle to understand for quite some time, for example:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
template <class STLContainer>
static uint32_t *packmeupwithoutmasksimd(STLContainer &source, uint32_t *out,
const uint32_t bit) {
const uint32_t size = static_cast<uint32_t>(source.size());
*out = size;
out++;
if (source.size() == 0)
return out;
source.resize((source.size() + 32 - 1) / 32 * 32);
uint32_t j = 0;
for (; j + 128 <= size; j += 128) {
usimdpackwithoutmask(&source[j], reinterpret_cast<__m128i *>(out), bit);
out += 4 * bit;
}
for (; j < size; j += 32) {
fastpackwithoutmask(&source[j], out, bit);
out += bit;
}
out -= (j - size) * bit / 32;
source.resize(size);
return out;
}
view raw
packmeupwithoutmasksimd.cpp
hosted with ❤ by GitHub
This does bit-packing using SIMD without a mask (the naming convention is something that you really have to get used to, I don’t think I encountered justlowercase before).
Note the call to fastpackwithoutmask() in there, it assumes that the provided buffer has exactly 32 elements. A lot of the code in the project assumes that you are passed a buffer of a certain size and operate on that directly. That produces great results in terms of efficiency and code density. But when I read this code it was “obvious” to me that there is an out-of-band work here. If the provided buffer isn’t perfectly aligned on 32 boundary, that method will write read or write beyond the end of the buffer.
Look at line 9 there, where the source container is being resized to ensure that this is fine, since we increase the container size to be exactly 32 elements aligned. We revert this operation later in line 20.
For some SIMD instructions, it matters the alignment of the buffers we are working on. This is handled using a custom allocator on the containers, which is not usually something that I would pay attention to.
In short, from my perspective, there is a lot of hidden behavior in non-trivial C++ code usage there that masks the actual behavior of the algorithm in question.
If you are looking at FastPFor (and if you care enough to look at this series of posts, you probably should), take into consideration that this code shouldn’t be applied as is, but should probably be adapted for your needs.
In the next post, I’m going to start discussing how we did just that for RavenDB.
The FastPFor is an integer compression algorithm that was published in 2012 initially. You can read the paper about it here: Decoding billions of integers per second through vectorization.
I’ve run into this algorithm many times in the past. You pretty much can’t be in the database arena and not run into that. It is an interesting paper, and it has a GitHub repository with the code, which is great. Except that I couldn’t figure out what was going on there.
I actually tried stepping through the code a bunch of times, and I always ended up getting lost. The code is in C++ and makes heavy use of templates, containers and non-trivial amounts of magic. To give some context, I gave up when I run into these code snippers:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
/*assumes that integers fit in the prescribed number of bits */
void __fastpackwithoutmask6(const uint32_t *__restrict__ in,
uint32_t *__restrict__ out) {
Unroller<6>::PackNoMask(in, out);
}
/*assumes that integers fit in the prescribed number of bits */
void __fastpackwithoutmask7(const uint32_t *__restrict__ in,
uint32_t *__restrict__ out) {
Unroller<7>::PackNoMask(in, out);
}
static void __SIMD_fastpackwithoutmask7_32(const uint32_t *__restrict__ _in,
__m128i *__restrict__ out) {
const __m128i *in = reinterpret_cast<const __m128i *>(_in);
__m128i OutReg;
__m128i InReg = _mm_loadu_si128(in);
OutReg = InReg;
InReg = _mm_loadu_si128(++in);
OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 7));
InReg = _mm_loadu_si128(++in);
OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 14));
InReg = _mm_loadu_si128(++in);
OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 21));
InReg = _mm_loadu_si128(++in);
OutReg = _mm_or_si128(OutReg, _mm_slli_epi32(InReg, 28));
_mm_storeu_si128(out, OutReg);
++out;
OutReg = _mm_srli_epi32(InReg, 7 - 3);
InReg = _mm_loadu_si128(++in);
// this method goes on for another 100 lines...
}
view raw
wow.cpp
hosted with ❤ by GitHub
The paper itself describes the algorithm, but not in a way that actually made sense to me. I tried looking at the code, and I basically noped out. Too complex for me to understand, it seems.
But I kept running into this, and we recently had a strong need for something similar. So I basically took the time to meditate on this for a while.
On a personal level, I realized that I was trying to understand how the whole thing worked by just stepping through the code and reading the paper. The problem was that I was mixing several different layers and wasn’t able to create the proper separation between them.
FastPFor builds upon the previously discussed simdcomp, once you understand that, you can go through this in isolation. For this reason, I first talked about SIMD bit-packing and showed an example usage, before actually diving into how FastPFor itself works.
As a reminder, simdcomp provides routines for quick packing and unpacking of integers, nothing more. FastPFor is where the actual compression happens. Let’s see what is actually going on here. Consider the following list, which we previously discussed:
1871143144
4
4
4
4
4
4
4
4
4
4
7984
4
4
4
4
If we’ll try bit-pack this list, we’ll find out that we aren’t gaining much. The first value is 31 bits in length, after all, and that means that we aren’t able to save much. We talked about Frame of Reference (FOR) as a way to handle that. We can treat every128 block of numbers as a block that has its own reference point and compute the maximum number of bits for that location. That would actually save us a lot of space, but it isn’t ideal. In the case above, most entries in the list can be packed in just 3 bits, except 2. That is where PFor comes into play, which stands for Patched Frame of Reference.
The key aspects of FastPFor are how it figures out what is the best size to use to pack the numbers, the way it detects what should be patched, and the manner in which it stores this.
The code boils down to a single function:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
void getBestBFromData(const uint32_t *in, uint8_t &bestb, uint8_t &bestcexcept, uint8_t &maxb) {
uint32_t freqs[33];
for (uint32_t k = 0; k <= 32; ++k)
freqs[k] = 0;
for (uint32_t k = 0; k < BlockSize; ++k) {
freqs[asmbits(in[k])]++;
}
bestb = 32;
while (freqs[bestb] == 0)
bestb--;
maxb = bestb;
uint32_t bestcost = bestb * BlockSize;
uint32_t cexcept = 0;
bestcexcept = static_cast<uint8_t>(cexcept);
for (uint32_t b = bestb - 1; b < 32; --b) {
cexcept += freqs[b + 1];
uint32_t thiscost = cexcept * overheadofeachexcept +
cexcept * (maxb - b) + b * BlockSize +
8; // the extra 8 is the cost of storing maxbits
if (thiscost < bestcost) {
bestcost = thiscost;
bestb = static_cast<uint8_t>(b);
bestcexcept = static_cast<uint8_t>(cexcept);
}
}
}
view raw
getBestBFromData.cpp
hosted with ❤ by GitHub
What is going on here? This function accepts an array of integers (whose size is 128) and computes the best number of bits to use to pack it.
How does that work? The first thing this does is compute how many numbers we have for each bit range. This is done using the asmbits() call, which is basically a LeadingZeroCount(). The end result is a frequency map that we can use to estimate how many bits will be required to store the items from this block.
We start from the maximum number of bits that we need to store all the numbers in the block, and we go down, checking what would be the best cost (in terms of space). Since we need to store exceptions somewhere, we compute the cost of that as we go down in the list.
This function gives us:
bestb – the number of bits needed to pack all the items in the block that would result in the smallest output size
bestcexception – the count of exceptions (numbers that do not fit into bestb bits
maxb – the maximum number of bits that we need to store for this block
That is the key function, the one that is responsible for getting the best compression ratio.
What I found that made FastPFor challenging to understand is that it has a lot of state that you need to keep track of. What I described so far is a single block, but FastPFor operates over sets of numbers, and so needs to tie all of them together.
At any given point, FastPFor has:
The block that is being outputted
The metadata about the entire process
32 arrays of exceptions
The process interleaves all of them together in interesting ways, and I had a hard time figuring out how it all ties together.
Let’s talk about the way FastPFor processes a single block. We process the data in the array in blocks of 128 numbers at a time, like so:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
for (const uint32_t *const final = in + length; (in + BlockSize <= final);
in += BlockSize) {
uint8_t bestb, bestcexcept, maxb;
getBestBFromData(in, bestb, bestcexcept, maxb);
*bc++ = bestb;
*bc++ = bestcexcept;
if (bestcexcept > 0) {
*bc++ = maxb;
std::vector<uint32_t, cacheallocator> &thisexceptioncontainer =
datatobepacked[maxb - bestb];
const uint32_t maxval = 1U << bestb;
for (uint32_t k = 0; k < BlockSize; ++k) {
if (in[k] >= maxval) {
// we have an exception
thisexceptioncontainer.push_back(in[k] >> bestb);
*bc++ = static_cast<uint8_t>(k);
}
}
}
out = packblockupsimd(in, out, bestb);
}
view raw
ProcessBlock.cpp
hosted with ❤ by GitHub
The first thing that happens is the computation of the best bits widths for the current block. Then, we use the bc value to record the metadata about the block.
That is an array of bytes with the following format:
Packed number of bits
Number of exceptions
If there are exceptions for this block, we also add:
Maximum number of bits
Offset of exception in the block (repeated as the number of exceptions)
The metadata is shared for the entire encoding operation, across blocks.
You can see in the code that if bestcexcept (which holds the number of exceptions) is greater than zero, we find the appropriate exceptions buffer to use. That requires a better explanation.
the getBestBFromData() call gave us two important data points, the best number of bits to pack the numbers, and the maximum number. We are going to pack all the numbers in the block to the output, but what about the exceptions? Where do they go?
It turns out that we need to store the exceptions, but we don’t actually need to store max bits, only the difference between the best number of bits and the maximum. This is what thisexceptioncontainer is holding, the remainding bits for exceptions. It’s important to understand that this is done across blocks. So the thisexceptioncontainer value holds exceptions from many different blocks. That will turn out to be very important in just a little bit. We then scan the block and write the remainding bits to the container, and write to the metadata the offset of the value in the block. Since we are using blocks of 128 numbers, this is ensured to fit inside a byte.
The last step that we do for the block is to call: packblockupsimd(), this ends up calling to simdpack() and packs all the numbers from the block to bestb bits in the output.
It’s really important to understand that we still have two data items that haven’t been written. The first is the metadata for the encoding process (bits in blocks, exceptions offsets, etc). The second is the set of exceptions themselves.
This process repeats itself for each block, until the end of the buffer. It is at this point that we need to write the remaining data to the output. Here is what the code looks like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
const uint32_t bytescontainersize =
static_cast<uint32_t>(bc - &bytescontainer[0]);
*(out++) = bytescontainersize;
memcpy(out, &bytescontainer[0], bytescontainersize);
uint8_t* pad8 = (uint8_t*)out + bytescontainersize;
out += (bytescontainersize + sizeof(uint32_t) - 1) / sizeof(uint32_t);
while (pad8 < (uint8_t*)out)
*pad8++ = 0; // clear padding bytes
uint32_t bitmap = 0;
for (uint32_t k = 2; k <= 32; ++k) {
if (datatobepacked[k].size() != 0)
bitmap |= (1U << (k - 1));
}
*(out++) = bitmap;
for (uint32_t k = 2; k <= 32; ++k) {
if (datatobepacked[k].size() > 0)
out = packmeupwithoutmasksimd(datatobepacked[k], out, k);
}
view raw
final.cpp
hosted with ❤ by GitHub
What is going on? This is dense, and it takes a while to unpack (pun intended) what is going on here.
First, we write the size of the metadata and then we copy the metadata that we have to the output. That is the description of all the blocks that we just went over. Then, we run over the set of exceptions that we gathered. Remember, the datatobepacked is an array that holds lists of exception data for each bit range that we have to store. We iterate over all the bit widths where we wrote exception data and generate a bitmap. That will help us later during the decoding process.
Finally, we run over the same exceptions and write them out. Note that we do that using the same simd bit-packing approach as the rest of the code.
The end result is that we have the following format:
For decoding, we start from the metadata offset and jump to that. Then we read the exceptions bitmap. That tells us how many exceptions we have and how many bits we have in each one of them. In the image above, you can see that we have 3 exceptions list, for 4 bits, 12 bits and 19 bits.
We use the simd bit-packing to decode those values into memory. We’ll need them shortly. Now, we start iterating over the metadata, each block has an overhead of minimum two metadata bytes (number of bits and number of exceptions). Let’s assume that we don’t have any exceptions in the block. In that case, the process of decoding is simple. Just do the unpacking from bits to numbers and you are done.
If we have exceptions, on the other hand, we have to deal with them. At that point, the next metadata byte would contain the maximum number of bits for the exceptions in this block. Using that and the normal number of bits, we can tell where the extra bits are located. Here is the relevant code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
for (uint32_t run = 0; run < nvalue / BlockSize; ++run, out += BlockSize) {
const uint8_t b = *bytep++;
const uint8_t cexcept = *bytep++;
in = unpackblocksimd(in, out, b);
if (cexcept > 0) {
const uint8_t maxbits = *bytep++;
if (maxbits - b == 1) {
for (uint32_t k = 0; k < cexcept; ++k) {
const uint8_t pos = *(bytep++);
out[pos] |= static_cast<uint32_t>(1) << b;
}
} else {
std::vector<uint32_t, cacheallocator>::const_iterator &exceptionsptr =
unpackpointers[maxbits - b];
for (uint32_t k = 0; k < cexcept; ++k) {
const uint8_t pos = *(bytep++);
out[pos] |= (*(exceptionsptr++)) << b;
}
}
}
}
view raw
decoding.cpp
hosted with ❤ by GitHub
Note that there is a special case here if the difference in the number of bits between the block bits and the maximum number of bits is one. In that case, we don’t need to store the data, we already know what the value is then (it’s one, after all). So we can compute it once and set it in the output.
For scenarios where we can’t tell from the bit width along, we read the relevant exception array based on the difference between the bits in the block and the exceptions bits. That gives us an array that is shared across blocks. The idea is that the metadata contains the offset in the block, and we read from the relevant array one item at a time. This two-step process will result in us setting the right value and getting ready for the next call, which may be done in a different block.
Note that the exception bits buffers only care about the number of bits, not where they come from. So two blocks, which have (bestb: 4, maxb: 9) and (bestb: 7: maxb: 10) will both go to the 3 bits container.
Okay, this blog post is getting long enough, so I think that would be it. Hopefully, it will make it easier to understand exactly how the FastPFor format is working. I’m going to be talking more about FastPFor and the format implications in the next post, then move on to actually using that in C#.
I talked a bit before about the nature of bit packing and how the simdcomp library isn’t actually doing compression. Why do I care about that, then?
Because the simdcomp library provides a very useful building block. A simple set of functions that allows you to very quickly pack and unpack bits. That is important, since those operations are fast enough that we can treat them as effectively free. Once that exists, we can now start finding interesting usages for those.
Let’s assume that we have the following functions:
Pack(Span<uint> input, Stream output, int bit);
Unpack(Stream input, Span<uint> output, int bit);
It get a set of numbers and the number of bits and encode or decode them as needed.
Because those methods are fast, it means that we can play with various interesting designs. For example, let’s take dictionary compression. Assume that we have a set of values, such as country names. This looks like this:
[ "United States", "China", "Japan", "India", "United States", "Brazil", "Germany", "India", "France", "China", "Japan", "Brazil", … ]
This array contains thousands or millions of entries. We want to encode that in an efficient manner, how can we do that? Here is a simple way to utilize the speed of bit-packing to do something useful:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
void Encode(List<string> countries, Stream output)
{
var nums = new List<int>();
var dic = new Dictionary<string, int>();
foreach (var country in countries)
{
if(dic.TryGetValue(country, out var index) == false)
{
dic[country] = index = dic.Count;
}
nums.Add(index);
}
var bits = 32 - BitOperations.LeadingZeroCount((uint)dic.Count);
var br = new BinaryWriter(output);
br.Write(dic.Count);
foreach(var country in dic.Keys)
{
br.Write(country);
}
br.Write(nums.Count);
Pack(nums, output, bits);
}
view raw
DictionaryCompression.cs
hosted with ❤ by GitHub
What is going on here?
We first iterate over the set of countries, finding the unique names and giving an index for each one of them.
Along the way, we produce an integer array of the indexes of country names.
Then we compute the maximum number of bits that we have to use to store those indexes.
To the output, we write the number of unique country names, the names themselves and then the bit-packed set of indexes.
Let’s assume that we have 30 unique values in this list, but the list itself is 5 million items in size. What is the actual cost here?
We’ll assume an average country name of 8 bytes, to make the math easy. That means that to store the list of countries would cost us 38MB, if we did that using the most obvious approach.
Using the code above, assuming 30 unique values, we’ll have 5 million values with 5 bits each, leading to a total cost of about 3MB, instead of 38MB.
The nice thing about this approach is that once you have fast bit-packing routines, you can now start using them in all sorts of interesting ways.
The reason I talked about the topic before starting to discuss FastPFor is that it makes a lot more sense to understand how bit-packing can be applied separately from FastPFor, since that is just making interesting use of the capability.