In this series so far, we reduced the storage cost of key/value lookups by a lot. And in the last post we optimized the process of encoding the keys and values significantly. This is great, but the toughest challenge is ahead of us, because as much as encoding efficiency matters, the absolute cost we have is doing lookups. This is the most basic operation, which we do billions of times a second. Any amount of effort we’ll spend here will be worth it. That said, let’s look at the decoding process we have right now. It was built to be understandable over all else, so it is a good start.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
private static (long Key, long Val) Decode(Span<byte> buffer, ushort offset)
{
var offsetInPage = ((offset & 0xFFF0) >> 3);
var (keyLen, valLen) = (offset & 0xF) switch
{
1 => (1, 1),
2 => (2, 2),
3 => (3, 3),
4 => (4, 4),
5 => (2, 1),
6 => (2, 3),
7 => (2, 4),
8 => (3, 2),
9 => (3, 4),
10 => (3, 5),
11 => (4, 2),
12 => (4, 3),
13 => (5, 4),
14 => (5, 5),
15 => (4, 5),
_ => (buffer[offsetInPage] >> 4, buffer[offsetInPage++] & 0xF), // note, inc offsetInPage
};
long k = 0;
long v = 0;
var entry = buffer.Slice(offsetInPage);
entry[..keyLen].CopyTo(MemoryMarshal.AsBytes(new Span<long>(ref k)));
entry[(keyLen)..(keyLen + valLen)].CopyTo(MemoryMarshal.AsBytes(new Span<long>(ref v)));
return (k, v);
}
view raw
Decode.cs
hosted with ❤ by GitHub
What this code does is to accept a buffer and an offset into the buffer. But the offset isn’t just a number, it is composed of two values. The first 12 bits contain the offset in the page, but since we use 2-byte alignment for the entry position, we can just assume a zero bit at the start. That is why we compute the actual offset in the page by clearing the first four bits and then shifting left by three bits. That extracts the actual offset to the file, (usually a 13 bits value) using just 12 bits. The first four bits in the offset are the indicator for the key and value lengths. There are 15 known values, which we computed based on probabilities and one value reserved to say: Rare key/val length combination, the actual sizes are stored as the first byte in the entry.
Note that in the code, we handle that scenario by reading the key and value lengths (stored as two nibbles in the first byte) and incrementing the offset in the page. That means that we skip past the header byte in those rare situations.
The rest of the code is basically copying the key and value bytes to the relevant variables, taking advantage of partial copy and little-endian encoding.
The code in question takes 512 bytes and has 23 branches. In terms of performance, we can probably do much better, but the code is clear in what it is doing, at least.
The first thing I want to try is to replace the switch statement with a lookup table, just like we did before. Here is what the new version looks like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
static ReadOnlySpan<byte> DecodeTable => new byte[16] {
(0 /*rare*/), (1 << 4 | 1), (2 << 4 | 2), (3 << 4 | 3),
(4 << 4 | 4), (2 << 4 | 1), (2 << 4 | 3), (2 << 4 | 4),
(3 << 4 | 2), (3 << 4 | 4), (3 << 4 | 5), (4 << 4 | 2),
(4 << 4 | 3), (5 << 4 | 4), (5 << 4 | 5), (4 << 4 | 5),
};
private static (long Key, long Val) Decode(Span<byte> buffer, ushort offset) {
var offsetInPage = ((offset & 0xFFF0) >> 3);
var indicator = DecodeTable[offset & 0xF];
if (indicator == 0) {
indicator = buffer[offsetInPage++];
}
var keyLen = indicator >> 4;
var valLen = indicator & 0xF;
long k = 0;
long v = 0;
var entry = buffer.Slice(offsetInPage);
entry[..keyLen].CopyTo(MemoryMarshal.AsBytes(new Span<long>(ref k)));
entry[(keyLen)..(keyLen + valLen)].CopyTo(
MemoryMarshal.AsBytes(new Span<long>(ref v)));
return (k, v);
}
view raw
Decode.cs
hosted with ❤ by GitHub
The size of the function dropped by almost half and we have only 7 more branches involved. There are also a couple of calls to the memory copy routines that weren’t inlined. In the encoding phase, we reduced branches due to bound checks using raw pointers, and we skipped the memory calls routines by copying a fixed size value at varied offsets to be able to get the data properly aligned. In this case, we can’t really do the same. One thing that we have to be aware of is the following situation:
In other words, we may have an entry that is at the end of the page, if we’ll try to read unconditionally 8 bytes, we may read past the end of the buffer. That is not something that we can do. In the Encode() case, we know that the caller gave us a buffer large enough to accommodate the largest possible size, so that isn’t an issue. That complicates things, sadly, but we can go the other way around.
The Decode() function will always be called on an entry, and that is part of the page. The way we place entries means that we are starting at the top and moving down. The structure of the page means that we can never actually place an entry below the first 8 bytes of the page. That is where the header and the offsets array are going, after all. Given that, we can do an unconditional read backward from the entry. As you can see in the image below, we are reading some data that we don’t care about, but this is fine, we can fix it later, and without any branches.
The end result is that we can have the following changes:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
private unsafe static (long Key, long Val) Decode(byte* buffer, ushort offset)
{
var offsetInPage = ((offset & 0xFFF0) >> 3);
var indicator = DecodeTable[offset & 0xF];
if (indicator == 0)
{
indicator = buffer[offsetInPage++];
}
var keyLen = indicator >> 4;
var valLen = indicator & 0xF;
long k = ReadBackward(buffer, offsetInPage, keyLen);
long v = ReadBackward(buffer, offsetInPage + keyLen, valLen);
return (k, v);
long ReadBackward(byte* buffer, int offsetInPage, int len)
{
var shift = 8 - len;
var l = Unsafe.ReadUnaligned<long>(ref buffer[offsetInPage - shift]);
l >>>= shift * 8;
return l;
}
}
view raw
Decode.cs
hosted with ❤ by GitHub
I changed the code to use a raw pointer, avoiding bound checks that we already reasoned about. Most interesting is the ReadBackward function. This is an inner function, and was properly inlined during JIT compilation, it implements the backward reading of the value. Here is what the assembly looks like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
L0041: mov r10d, r9d
L0044: neg r10d
L0047: add r10d, 8
L004b: mov r11d, r8d
L004e: sub r11d, r10d
L0051: movsxd r11, r11d
L0054: mov r11, [r11+rdx]
L0058: shl r10d, 3
L005c: shrx r10, r11, r10
view raw
ReadBackward.asm
hosted with ❤ by GitHub
With this in place, we are now at 133 bytes and a single branch operation. That is pretty awesome, but we can do better still. Consider the following code (explanations to follow):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
static ReadOnlySpan<byte> DecodeTable => new byte[16] {
( 1 << 3), // special!
( 1 << 4 | 1),
( 2 << 4 | 2),
( 3 << 4 | 3),
( 4 << 4 | 4),
( 2 << 4 | 1),
( 2 << 4 | 3),
( 2 << 4 | 4),
( 3 << 4 | 2),
( 3 << 4 | 4),
( 3 << 4 | 5),
( 4 << 4 | 2),
( 4 << 4 | 3),
( 5 << 4 | 4),
( 5 << 4 | 5),
( 4 << 4 | 5),
};
public unsafe static void Decode(byte* buffer, ushort offset, out long key, out long val)
{
var offsetInPage = ((offset & 0xFFF0) >> 3);
var decoderByte = DecodeTable[offset & 0xF];
var offload = decoderByte & 8;
byte decodeParts = (byte)((buffer[offsetInPage] << 8 | decoderByte) >> offload);
int keyLen = (decodeParts >> 4);
int valLen = (decodeParts & 0xF);
offload >>= 3;
key = ReadBackward(buffer, offsetInPage + offload, keyLen);
val = ReadBackward(buffer, offsetInPage + offload + keyLen, valLen);
long ReadBackward(byte* buffer, int offsetInPage, int len)
{
var shift = 8 - len;
var l = Unsafe.ReadUnaligned<long>(ref buffer[offsetInPage - shift]);
l >>>= shift * 8;
return l;
}
}
view raw
Decode.cs
hosted with ❤ by GitHub
Note that the first element in the table here is different, it is now setting the 4th bit. This is because we are making use of that. The structure of the bytes in the table are two nibbles, but no other value in the table sets the 4th bit. That means that we can operate on that.
Indeed, what we are doing is use the decoder byte to figure out what sort of shift we want. We have the byte from the table and the byte from the buffer. And we use the fact that masking this with 8 gives (just for this value) the value of 8. We can then use that to select the appropriate byte. If we have an offloaded byte, then we’ll shift the value by 8, getting the byte from the buffer. For any other value, we’ll get 0 as the shift index, resulting in us getting the value from the table. That gives us a function with zero branches, and 141 bytes.
I spent a lot of time thinking about this, so now that we have those two approaches, let's benchmark them. The results were surprising:
| Method | Mean | Error | StdDev |
|------------------------ |-----------:|---------:|---------:|
| DecodeBranchlessShifts | 2,107.1 ns | 20.69 ns | 18.34 ns |
| DecodeBranchy | 936.2 ns | 1.89 ns | 1.68 ns |
It turns out that the slightly smaller code with the branches is able to beat up the branchless code. When looking into what they are doing, I think that I can guess why. Branches aren’t a huge problem if they are predictable, and in our case, the whole point of all of this is that the offload process where we need to go to the entry to get the value is meant to be a rare event. In branchless code, on the other hand, you have to do something several times to avoid a branch (like shifting the value from the buffer up and maybe shifting it down, etc).
You can’t really argue with a difference like that. We also tried an AVX version, to see if this would have better performance. It turns out that there is really no way for us to beat the version with the single branch. Everything else was at least twice as slow.
At this point, I believe that we have a winner.
The application manifest file is a compatibility feature that allows Windows to run older applications on newer versions of Windows. It contains information such as the supported versions of Windows, if it should run as administrator, if it supports long paths, etc.This file is stored as an embedde
In my previous post, I showed how we use the nibble offload approach to store the size of entries in space that would otherwise be unused. My goal in that post was clarity, so I tried to make sure that the code was as nice to read as possible. In terms of machine code, that isn’t really ideal. Let’s talk about how we can make it better. Here is the starting version:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
private static (int EntrySize, byte SizeNibble) Encode(Span<byte> buffer, long key, long val)
{
var keyLen = (int)(8 - Lzcnt.X64.LeadingZeroCount((ulong)key) / 8);
var valLen = (int)(8 - Lzcnt.X64.LeadingZeroCount((ulong)val) / 8);
byte inlined = (keyLen, valLen) switch
{
(1, 1) => 1,
(2, 2) => 2,
(3, 3) => 3,
(4, 4) => 4,
(2, 1) => 5,
(2, 3) => 6,
(2, 4) => 7,
(3, 2) => 8,
(3, 4) => 9,
(3, 5) => 10,
(4, 2) => 11,
(4, 3) => 12,
(5, 4) => 13,
(5, 5) => 14,
(4, 5) => 15,
// note the trickery here with increment the write offset
_ => 0
};
var writeOffset = 0;
if (inlined == 0) // cannot fit, need to write to entry itself
{
writeOffset = 1;
buffer[0] = (byte)(keyLen << 4 | valLen);
}
MemoryMarshal.AsBytes(new Span<long>(ref key))[..keyLen].CopyTo(buffer[writeOffset..]);
MemoryMarshal.AsBytes(new Span<long>(ref val))[..valLen].CopyTo(buffer[(keyLen + writeOffset)..]);
return (writeOffset + keyLen + valLen, inlined);
}
view raw
Encode.cs
hosted with ❤ by GitHub
This code generates a function whose size exceeds 600 bytes and contains 24(!) branches. I already talked before about why this is a problem, there is a good discussion of the details on branches and their effect on performance here. In short, fewer branches are better. And when looking at machine instructions, the smaller the function, the better we are.
The first thing to do then is to remove the switch statement and move to a table-based approach. Given that this is a lookup of a small set of values, we can precompute all the options and just do a lookup like that. Here is what the code looks like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
static ReadOnlySpan<byte> EncodingTable => new byte[64]
{
1, 0, 0, 0, 0, 0, 0, 0,
5, 2, 6, 7, 0, 0, 0, 0,
0, 8, 3, 9, 10, 0, 0, 0,
0, 11, 12, 4, 15, 0, 0,
0, 0, 0, 0, 13, 14, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0
};
static (int EntrySize, byte SizeNibble) Encode(Span<byte> buffer, long key, long val)
{
var keyLen = (int)(8 - Lzcnt.X64.LeadingZeroCount((ulong)key) / 8);
var valLen = (int)(8 - Lzcnt.X64.LeadingZeroCount((ulong)val) / 8);
var inlined = EncodingTable[(keyLen-1)*8 + valLen-1];
var writeOffset = 0;
if (inlined == 0) // cannot fit, need to write to entry itself
{
writeOffset = 1;
buffer[0] = (byte)(keyLen << 4 | valLen);
}
MemoryMarshal.AsBytes(new Span<long>(ref key))[..keyLen].CopyTo(buffer[writeOffset..]);
MemoryMarshal.AsBytes(new Span<long>(ref val))[..valLen].CopyTo(buffer[(keyLen + writeOffset)..]);
return (writeOffset + keyLen + valLen, inlined);
}
view raw
Encode.cs
hosted with ❤ by GitHub
This is already a win, since we are now at about half the code size (340 bytes) and there are just 7 branches in the function. Let’s take a look at the code and its assembly:
Code
Assembly
if (inlined == 0) { writeOffset = 1; buffer[0] = (byte)(keyLen << 4 | valLen); }
L0065: test r14d, r14d L0068: jne short L0081 L006a: mov r15d, 1 L0070: mov ecx, ebx L0072: shl ecx, 4 L0075: or ecx, ebp L0077: test edi, edi L0079: je L014fL007f: mov [rsi], cl
As you can see, in the assembly, we first test the value, and if it isn’t zero, we jump after the if statement. If it is 0, we compute a shift right by 4 and then or the values, then we do another check and finally set the value in the buffer.
Where does this check came from? There is no if there?
Well, that is the bound checking that we have with using Span, in fact, most of the checks there are because of Span or because of the safe intrinsics that are used.
Let’s get rid of this. There are actually several interesting iterations in the middle, but let’s jump directly to the final result I have. It’s under 10 lines of code, and it is quite beautiful.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
static ReadOnlySpan<byte> EncodingTable => new byte[64]
{
1, 16, 16, 16, 16, 16, 16, 16,
5, 2, 6, 7, 16, 16, 16, 16,
16, 8, 3, 9, 10, 16, 16, 16,
16, 11, 12, 4, 15, 16, 16, 16,
16, 16, 16, 13, 14, 16, 16, 16,
16, 16, 16, 16, 16, 16, 16, 16,
16, 16, 16, 16, 16, 16, 16, 16,
16, 16, 16, 16, 16, 16, 16, 16,
};
unsafe static void Encode(byte* buffer, long key, long val, out int entrySize, out byte nibble)
{
var keyLen = (int)(8 - Lzcnt.X64.LeadingZeroCount((ulong)key) / 8);
var valLen = (int)(8 - Lzcnt.X64.LeadingZeroCount((ulong)val) / 8);
var inlined = Unsafe.Add(ref MemoryMarshal.GetReference(EncodingTable), (keyLen - 1) * 8 + valLen - 1);
nibble = (byte)(inlined & 0xF);
var writeOffset = inlined >> 4;
buffer[0] = (byte)(keyLen << 4 | valLen);
Unsafe.WriteUnaligned(ref buffer[writeOffset], key);
Unsafe.WriteUnaligned(ref buffer[writeOffset + keyLen], val);
entrySize = writeOffset + keyLen + valLen;
}
view raw
Encode.cs
hosted with ❤ by GitHub
I’m sure that you’ll look at this versus the previous iterations and go… huh?! I mean, even the reference table is different now.
Let’s analyze what is going on here in detail. The first thing you’ll note that changed is the method signature. Before we had multiple result types and now we use out parameters. This turns out to generate slightly less code, so I went with that approach, instead.
Next, computing the number of bytes we need to copy is the same. Once we have the sizes of the key and the value, we fetch the relevant instruction from the reference table. We do that in a way that skips the bounds checking on the table, since we know that we cannot exceed the length of the array.
Unlike before, we have new values in the table, where before we had 0 for entries that we didn’t care for, now we put 16. That means that we need to clear that when we set the nibble parameter. The reason for this is that we use this to compute the writeOffset. For cases where we have an offboarded nibble, the shift we do there will clear all the values, leaving us with a zero. For the values we cannot offload, we get 16, shifting by 4 gives us 1.
The reason we do it this way is that we do not use any conditional in this method. We unconditionally set the first byte of the buffer, because it is cheaper to do the work and discard that than check if it needs to be done.
Finally, we previously used Span.Copy() to move the data. That works, but it turns out that it is not ideal for very small writes, like what we have. At the same time, we write variable size each time, how can we manage that?
The answer is that we know that the buffer we are passed is large enough to contain the largest size possible. So we don’t need to worry about bound checking, etc.
We take advantage of the fact that the data is laid out in little-endian format and just write the whole 8 bytes of the key to the buffer at the right location. That may be shifted by the computed writeOffset. We then write the value immediately following the computed key length. The idea is that we overwrite the memory we just wrote (because parts of that were found to be not interesting). Using this approach, we were able to drop the code for this function to 114 bytes(!). Even with the encoding table, that is under three cache lines for the whole thing. That is really small.
There are also no conditionals or branches throughout the process. This is a piece of code that is ready and willing to be inlined anywhere. The amount of effort to understand what is going on here is balanced against how much this function can achieve in its lifetime.
For reference, here is the assembly of the encoding process:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
Main.Encode(Byte*, Int64, Int64, Int32 ByRef, Byte ByRef)
L0000: push rdi
L0001: push rsi
L0002: xor eax, eax
L0004: lzcnt rax, rdx
L0009: shr rax, 3
L000d: neg eax
L000f: add eax, 8
L0012: xor r10d, r10d
L0015: lzcnt r10, r8
L001a: shr r10, 3
L001e: neg r10d
L0021: add r10d, 8
L0025: lea r11d, [r10+rax*8-9]
L002a: movsxd r11, r11d
L002d: mov rsi, 0x216751e0b80
L0037: movzx r11d, byte ptr [r11+rsi]
L003c: mov esi, r11d
L003f: and esi, 0xf
L0042: mov rdi, [rsp+0x38]
L0047: mov [rdi], sil
L004a: sar r11d, 4
L004e: mov esi, eax
L0050: shl esi, 4
L0053: or esi, r10d
L0056: mov [rcx], sil
L0059: movsxd rsi, r11d
L005c: mov [rcx+rsi], rdx
L0060: add eax, r11d
L0063: movsxd rdx, eax
L0066: mov [rdx+rcx], r8
L006a: add eax, r10d
L006d: mov [r9], eax
L0070: pop rsi
L0071: pop rdi
L0072: ret
view raw
Encode.asm
hosted with ❤ by GitHub
The EncodingTable deserves a better explanation. I generated it using the following code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
for (int i = 0; i < 8; i++)
{
for (int j = 0; j < 8; j++)
{
byte inlined = (i + 1, j + 1) switch
{
(1, 1) => 1,
(2, 2) => 2,
(3, 3) => 3,
(4, 4) => 4,
(2, 1) => 5,
(2, 3) => 6,
(2, 4) => 7,
(3, 2) => 8,
(3, 4) => 9,
(3, 5) => 10,
(4, 2) => 11,
(4, 3) => 12,
(5, 4) => 13,
(5, 5) => 14,
(4, 5) => 15,
// note the trickery here with increment the write offset
_ => 16
};
Console.Write($"{inlined,4}, ");
}
Console.WriteLine();
}
view raw
EncodingGeneration.cs
hosted with ❤ by GitHub
In other words, I basically wrote out all the options and generated the appropriate value for each one of the options. I’m using the value of 16 here to be able to get to the right result using some bit shifts without any conditionals. So instead of doing many conditions, I replaced this with a (small) table lookup.
In my next post, I’m going to try to handle the same process for decoding.
Moving to nibble encoding gave us a measurable improvement in the density of the entries in the page. The problem is that we pretty much run out of room to do so. We are currently using a byte per entry to hold the size of the entry (as two nibbles, of 4 bits each). You can’t really go lower than that.
Let’s review again what we know about the structure of the data, we have an 8KB page, with three sections, fixed size header and variable size offsets and entries array. Here is what this looks like:
This is called a slotted page design. The idea is that the offset array at the bottom of the page is maintaining the sort orders of the entries, and that we can write the entries from the top of the page. When we need to sort the entries, we just need to touch the offsets array (shown in yellow in the image).
Given that we are talking about size and density, we spent a lot of time trying to reduce the size of the entries, but can we do something with the header or the offsets? The header is just 4 bytes right now, two shorts that denote the location of the bottom and the top position in the page. Given that the page is 8KB in size, we have to use 16 bits integer to cover the range. For offsets, the situation is the same. We have to be able to point to the entry location on the page, and that means that we have to reach 8KB. So the offsets are actually 16 bits ints and take two bytes.
In other words, there is a hidden overhead of 2 bytes per entry that we didn’t even consider. In the case of our latest success, we were able to push 759 entries into the page, which means that we are actually using 18.5% of the page just to hold the offsets of the entries. That is 1.48 KB that is being used.
The problem is that we need to use this. We have to be able to point to an entry anywhere in the page, which means that we have to reach 0 .. 8192. The minimum size we can use is 16 bits or two bytes.
Or do we?
16 bits gives us a range of 0 .. 65,535, after all. That is far in excess of what we need. We could use a 64KB page, but there are other reasons to want to avoid that.
To cover 8KB, we only need 13 bits to cover the range we need, after all. For that matter, we can extend that bit by 25%. If we decide that an entry should be 2 bytes aligned, we can access the entire page in 12 bits.
That means that we have 4 whole free bits to play with. The first idea is to change the offsets array from 16 bits ints to 12 bits ints. That would save us 380 bytes at 759 entries per page. That is quite a lot. Unfortunately, working with bits in this manner would be super awkward. We are doing a lot of random access and moves while we are building the page. It is possible to do this using bits, but not fun.
So we can set things up so we have a nibble free to use. We just used nibbles to save on the cost of variable size ints, to great success.
However, we don’t need just a nibble, we need two of them. We need to store the size of the key and the value in bytes. Actually, we don’t need two nibbles. The size of the key and the value maxes at 8 bytes, after all. We can encode that in 3 bits. In other words, we need 6 bits to encode this information.
We only have 4 bits, however. It is a really nice idea, however, and I kept turning that in my head, trying to come up with all sorts of clever ways to figure out how we can push 64 values in 4 bits. The impact of that would be pretty amazing.
Eventually, I realized that it is fairly easy to prove, using math, that there is no way to do so. Faced with this failure, I realigned my thinking and found a solution. I don’t need to have a perfect answer, I can have a good one.
4 bits give me a range of 16 values (out of the possible 64). If I give up on trying to solve the whole problem, can I solve a meaningful part of it?
And I came up with the following idea. We can do a two-stage approach, we’ll map the most common 15 values of key and value sizes to those 4 bits. The last value will be a marker that you have to go and look elsewhere.
Using just the data in the offset, I’m able to figure out what the location of the entry in the page is as well as the size of the key and value for most cases. For the (hopefully rare) scenarios where that is not the case, we fall back to storing the size information as two nibbles preceding the entry data.
This is a pretty neat idea, even if I say so myself, and it has a good chance to allow us to save about 1 byte per entry in the common case. In fact, I tested that and about 90% of the cases in my test case are covered by the top 15 cases. That is a pretty good indication that I’m on the right track.
All of that said, let’s look at how this looks in code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
private static (int EntrySize, byte SizeNibble) Encode(Span<byte> buffer, long key, long val)
{
var keyLen = (int)(8 - Lzcnt.X64.LeadingZeroCount((ulong)key) / 8);
var valLen = (int)(8 - Lzcnt.X64.LeadingZeroCount((ulong)val) / 8);
byte inlined = (keyLen, valLen) switch
{
(1, 1) => 1,
(2, 2) => 2,
(3, 3) => 3,
(4, 4) => 4,
(2, 1) => 5,
(2, 3) => 6,
(2, 4) => 7,
(3, 2) => 8,
(3, 4) => 9,
(3, 5) => 10,
(4, 2) => 11,
(4, 3) => 12,
(5, 4) => 13,
(5, 5) => 14,
(4, 5) => 15,
// note the trickery here with increment the write offset
_ => 0
};
var writeOffset = 0;
if (inlined == 0) // cannot fit, need to write to entry itself
{
writeOffset = 1;
buffer[0] = (byte)(keyLen << 4 | valLen);
}
MemoryMarshal.AsBytes(new Span<long>(ref key))[..keyLen].CopyTo(buffer[writeOffset..]);
MemoryMarshal.AsBytes(new Span<long>(ref val))[..valLen].CopyTo(buffer[(keyLen + writeOffset)..]);
return (writeOffset + keyLen + valLen, inlined);
}
view raw
Encode.cs
hosted with ❤ by GitHub
I’m using a switch expression here for readability, so it is clear what is going on. If the key and value sizes are in one of the known patterns, we can put that in the nibble we’ll return. If the value is not, we’ll write it to the entry buffer.
The Set method itself had to change in some subtle but crucial ways, let’s look at it first, then I’ll discuss those changes:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
public static bool Set(Span<byte> page, long key, long val)
{
Debug.Assert(page.Length == 8192);
// header is 4 bytes [bottom, top]
ref var bottom = ref Unsafe.AsRef(MemoryMarshal.Cast<byte, ushort>(page)[0]);
ref var top = ref Unsafe.AsRef(MemoryMarshal.Cast<byte, ushort>(page)[1]);
if (bottom == 0) // init empty page
{
top = 8192; // top of the page
bottom = 4; // leave 4 bytes for the bottom/top header
}
var usableSpace = top - bottom;
Span<byte> temp = stackalloc byte[17]; // max 8 bytes per, plus 1
var (reqEntryLen, nibble) = Encode(temp, key, val);
var idx = BinarySearch(page, key);
Span<ushort> offsets;
if (idx >= 0)
{
// let's check if we can update?
offsets = OffsetsFor(page);
var actualEntryOffset = ((offsets[idx] & 0xFFF0) >> 3);
var curEntryLen = DecodeEntryLength(page[actualEntryOffset..]);
if (reqEntryLen <= curEntryLen) // can fit with out needing new allocation
{
// can fit, just copy & done
offsets[idx] = (ushort)(actualEntryOffset << 3 | nibble);
temp[..reqEntryLen].CopyTo(page[actualEntryOffset..]);
return true;
}
// is there enough space for a new entry?
if (reqEntryLen > usableSpace)
return false;
}
else // increase size of offsets array
{
var expectedStart = (top - reqEntryLen) & ~1; // aligned on 2 bytes boundary
// is there enough space for a new entry + it's offset?
if (bottom + sizeof(ushort) > expectedStart)
return false;
idx = ~idx;
bottom += 2;
offsets = OffsetsFor(page);
offsets[idx..(offsets.Length - 1)].CopyTo(offsets[(idx + 1)..]);
}
top = (ushort)((top - reqEntryLen) & ~1); // align on two bytes boundary
offsets[idx] = (ushort)(top << 3 | nibble);
temp[..reqEntryLen].CopyTo(page[top..]);
return true;
}
view raw
Set.cs
hosted with ❤ by GitHub
As before, we encode the entry into a temporary buffer. Now, in addition to getting the length of the entry, we are also getting the nibble that we’ll need to store.
You can see the changes in how we work with the offsets array following that. When we need to update an existing value, we are using this construction to figure out the actual entry offset:
var actualEntryOffset = ((offsets[idx] & 0xFFF0) >> 3);
What exactly is going on here? Don’t try to figure it out yet, let’s see how we are writing the data:
top = (ushort)((top - reqEntryLen) & ~1); // align on two bytes boundary
offsets[idx] = (ushort)(top << 3 | nibble);
Those two code snippets may look very odd, so let’s go over them in detail.
First, remember that we have an 8KB page to work with, but we need to use 4 bits for the size nibble we got from encoding the entry. To address the full 8,192 values in the page, we’ll need to reserve 13 bits. That is… a problem. We solve that by saying that the entry addresses must always be aligned on two bytes boundary. That is handled by clearing the first bit in the new top computation. Since we are growing down, that has the effect of ensuring aligned-by-two.
Then, we merge the top location and the nibble together. We know that the bottom-most of the top is cleared, so we can just move the value by 3 bits and we know that we’ve 4 cleared bits ready.
Conversely, when we want to read, we clear the first 4 bits and then we shift by three. That has the effect of returning us back to the original state.
A little bit confusing, but we managed to get to squeeze 784 entries into the page using the realistic dataset and 765 using the full one. That is another 3.5% of space savings over the previous nibble attempt and over 10% increase in capacity from the variable integer approach.
And at this point, I don’t believe that there is anything more that I can do to reduce the size in a significant manner without having a negative impact elsewhere.
We are not done yet, however. We are done with the size aspect, but we also have much to do in terms of performance and optimizations for runtime.
In the meantime, you can see my full code here. In the next post, we are going to start talking about the actual machine code and how we can optimize it.
In my last post we implemented variable-sized encoding to be able to pack even more data into the page. We were able to achieve 40% better density because of that. This is pretty awesome, but we would still like to do better. There are two disadvantages for variable size integers:
They may take more space than the actual raw numbers.
The number of branches is high, and non-predictable.
Given that we need to encode the key and value together, let’s see if we can do better. We know that both the key and the value are 8 bytes long. Using little-endian systems, we can consider the number as a byte array.
Consider this number: 139,713,513,353 which is composed of the following bytes: [137, 7, 147, 135, 32, 0, 0, 0]. This is how it looks in memory. This means, that we only need the first 5 bytes, not the last 3 zero ones.
It turns out that there is a very simple way to compute the number of used bytes, like so:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
int UsedBytes(long l)
{
return (int)(8 - Lzcnt.X64.LeadingZeroCount((ulong)l) / 8);
}
view raw
UsedBytes.cs
hosted with ❤ by GitHub
This translates into the following assembly:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
UsedBytes|0_0(Int64)
L0000: xor eax, eax
L0002: lzcnt rax, rcx
L0007: shr rax, 3
L000b: neg eax
L000d: add eax, 8
L0010: ret
view raw
UsedBytes.asm
hosted with ❤ by GitHub
Which is about as tight as you can want it to be.
Of course, there is a problem. In order to read the value back, we need to store the number of bytes we used somewhere. For variable-sized integers, they use the top bit until they run out. But we cannot do that here.
Remember however, that we encode two numbers here. And the length of the number is 8 bytes. In binary, that means that we need 4 bits to encode the length of each number. This means that if we’ll take an additional byte, we can fit the length of both numbers into a single byte.
The length of the key and the value would each fit on a nibble inside that byte. Here is what the encoding step looks like now:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
static unsafe uint Encode(byte* buffer, long key, long val)
{
var keyLen = (uint)(8 - Lzcnt.X64.LeadingZeroCount((ulong)key) / 8);
var valLen = (uint)(8 - Lzcnt.X64.LeadingZeroCount((ulong)val) / 8);
buffer[0] = (byte)(keyLen << 4 | valLen);
Unsafe.CopyBlock(buffer + 1, &key, keyLen);
Unsafe.CopyBlock(buffer + 1 + keyLen, &val, keyLen);
return 1 + keyLen + valLen;
}
view raw
Encode.cs
hosted with ❤ by GitHub
And in assembly:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
Encode(ulong,long,long):uint:
push r15
push r14
push r12
push rbx
sub rsp, 24
mov qword ptr [rsp+10H], rsi
mov qword ptr [rsp+08H], rdx
mov rbx, rdi
xor edi, edi
lzcnt rdi, qword ptr [rsp+10H]
shr rdi, 3
mov r14d, edi
neg r14d
add r14d, 8
xor edi, edi
lzcnt rdi, qword ptr [rsp+08H]
shr rdi, 3
mov r15d, edi
neg r15d
add r15d, 8
mov edi, r14d
shl edi, 4
or edi, r15d
mov byte ptr [rbx], dil
lea rdi, [rbx+01H]
lea rsi, [rsp+10H]
mov r12d, r14d
mov rdx, r12
call [CORINFO_HELP_MEMCPY]
mov edi, r14d
lea rdi, [rbx+rdi+01H]
lea rsi, [rsp+08H]
mov rdx, r12
call [CORINFO_HELP_MEMCPY]
lea eax, [r14+r15+01H]
add rsp, 24
pop rbx
pop r12
pop r14
pop r15
ret
view raw
Encode.asm
hosted with ❤ by GitHub
Note that there are no branches at all here. Which I’m really stoked about. As for decoding, we just have to go the other way around:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
Main:Decode(ulong):System.ValueTuple`2[long,long]:
push r15
push r14
push rbx
sub rsp, 16
xor eax, eax
mov qword ptr [rsp+08H], rax
mov qword ptr [rsp], rax
mov rbx, rdi
movzx r14, byte ptr [rbx]
mov r15d, r14d
sar r15d, 4
and r14d, 15
lea rdi, [rsp+08H]
lea rsi, [rbx+01H]
mov edx, r15d
call [CORINFO_HELP_MEMCPY]
lea rdi, [rsp]
mov esi, r15d
lea rsi, [rbx+rsi+01H]
mov edx, r14d
call [CORINFO_HELP_MEMCPY]
mov rax, qword ptr [rsp+08H]
mov rdx, qword ptr [rsp]
add rsp, 16
pop rbx
pop r14
pop r15
ret
view raw
Decode.asm
hosted with ❤ by GitHub
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
unsafe static (long Key, long Val) Decode(byte* buffer)
{
var keyLen = (uint)(buffer[0] >> 4);
var valLen = (uint)(buffer[0] & 0xF);
long k = 0;
long v = 0;
Unsafe.CopyBlock(&k, buffer + 1, keyLen);
Unsafe.CopyBlock(&v, buffer + 1 +keyLen, valLen);
return (k, v);
}
view raw
Decode.cs
hosted with ❤ by GitHub
No branches, and really predictable code.
That is all great, but what about the sizes? We are always taking 4 additional bits per number. So it is actually a single additional byte for each entry we encode. By using varint, the memory we encode numbers that are beyond the 2GB range, we’re already winning. Encoding (3,221,241,856), for example, will cost us 5 bytes (since we limit the range of each byte to 7 bits). The key advantage in our case is that if we have any case where either key or value needs to take an additional byte, we are at parity with the nibble method. If both of them need that, we are winning, since the nibble method will use a single additional byte and the variable size integer will take two (one for each number).
Now that we understand encoding and decoding, the rest of the code is basically the same. We just changed the internal format of the entry, nothing about the rest of the code changes.
And the results?
For the realistic dataset, we can fit 759 entries versus 712 for the variable integer model.
For the full dataset, we can fit 752 entries versus 710 for the variable integer model.
That is a 7% improvement in size, but it also comes with a really important benefit. Fewer branches.
This is the sort of code that runs billions of times a second. Reducing its latency has a profound impact on overall performance. One of the things that we pay attention to in high-performance code is the number of branches, because we are using super scalar CPUs, multiple instructions may execute in parallel at the chip level. A branch may cause us to stall (we have to wait until the result is known before we can execute the next instruction), so the processor will try to predict what the result of the branch would be. If this is a highly predictable branch (an error code that is almost never taken, for example), there is very little cost to that.
The variable integer code, on the other hand, is nothing but branches, and as far as the CPU is concerned, there is no way to actually predict what the result will be, so it has to wait. Branchless or well-predicted code is a key aspect of high-performance code. And this approach can have a big impact.
As a reminder, we started at 511 items, and we are now at 759. The question is, can we do more?
I’ll talk about it in the next post…
In my previous post, we stored keys and values as raw numbers inside the 8KB page. That was simple, but wasteful. For many scenarios, we are never going to need to utilize the full 8 bytes range for a long. Most numbers are far smaller.
In the example I gave in the last post, we are storing the following range of numbers (file offsets, basically). I’m using two test scenarios, one where I’m testing the full range (for correctness) and one where I’m testing files under 512 GB in size. Given that we are trying to compress the space, once we hit the 512GB mark, it is probably less urgent, after all.
Here are the number generations that I’m using:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
IEnumerable<long> Generate_Full()
{
var rand = new Random(2023_04_21);
while (true)
{
yield return rand.Next(100) switch
{
< 3 => rand.NextInt64(0, 1L << 7), // 3%
< 10 => rand.NextInt64(0, 1L << 15),// 7%
< 35 => rand.NextInt64(0, 1L << 23),// 25%
< 75 => rand.NextInt64(0, 1L << 31),// 35%
< 90 => rand.NextInt64(0, 1L << 39),// 15%
< 95 => rand.NextInt64(0, 1L << 47),// 5%
< 98 => rand.NextInt64(0, 1L << 55),// 3%
_ => rand.NextInt64(0, 1L << 62) // 2%
};
}
}
IEnumerable<long> Generate_Realistic()
{
var rand = new Random(2023_04_21);
while (true)
{
yield return rand.Next(100) switch
{
< 1 => rand.NextInt64(0, 1L << 7), // 1%
< 3 => rand.NextInt64(0, 1L << 15), // 2%
< 30 => rand.NextInt64(0, 1L << 23),// 27%
< 75 => rand.NextInt64(0, 1L << 31),// 35%
_ => rand.NextInt64(0, 1L << 39), // 25%
};
}
}
view raw
Generate.cs
hosted with ❤ by GitHub
What this means is:
Full data set
Realistic data set
3% in the first 128 bytes
7% in the first 64 KB
25% in the first 8 MB
35% in the first 2 GB
15% in the first 512 GB
5% in the first 128 TB
3% in the first 32 Petabytes
2% in the first 4 Exabytes
1% in the first 128 bytes
2% in the first 64 KB
27% in the first 8 MB
35% in the first 2 GB
25% in the first 512 GB
This is meant to verify that we can handle any scenario, in practice, we can usually focus on the first 512 GB, which is far more common.
Using both approaches, I can fit using my previous approach, up to 511 entries per page. That makes sense, we are storing the data raw, so how can we do better? Most of the time, we don’t need anywhere near 8 bytes per value. For that reason, we have variable length encoding, which has many names, such as variable size int, 7 bits integers, etc. I adapted some methods from the .NET codebase to allow me to operate on Spans, like so:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
private static int Write7BitEncodedInt64(long value, Span<byte> buffer)
{
// Adapted from: https://github.com/dotnet/runtime/blob/2087ceb0c70f0c13e79fbbdec90efd242e1f2cd4/src/libraries/System.Private.CoreLib/src/System/IO/BinaryWriter.cs#L492
ulong uValue = (ulong)value;
// Write out an int 7 bits at a time. The high bit of the byte,
// when on, tells reader to continue reading more bytes.
//
// Using the constants 0x7F and ~0x7F below offers smaller
// codegen than using the constant 0x80.
int index = 0;
while (uValue > 0x7Fu)
{
buffer[index++] = ((byte)((uint)uValue | ~0x7Fu));
uValue >>= 7;
}
buffer[index++] = ((byte)uValue);
return index;
}
private static long Read7BitEncodedInt64(Span<byte> buffer, out int len)
{
// Adapted from: https://github.com/dotnet/runtime/blob/2087ceb0c70f0c13e79fbbdec90efd242e1f2cd4/src/libraries/System.Private.CoreLib/src/System/IO/BinaryReader.cs#L588
ulong result = 0;
byte byteReadJustNow;
len = 0;
// Read the integer 7 bits at a time. The high bit
// of the byte when on means to continue reading more bytes.
//
// There are two failure cases: we've read more than 10 bytes,
// or the tenth byte is about to cause integer overflow.
// This means that we can read the first 9 bytes without
// worrying about integer overflow.
const int MaxBytesWithoutOverflow = 9;
for (int shift = 0; shift < MaxBytesWithoutOverflow * 7; shift += 7)
{
// ReadByte handles end of stream cases for us.
byteReadJustNow = buffer[len++];
result |= (byteReadJustNow & 0x7Ful) << shift;
if (byteReadJustNow <= 0x7Fu)
{
return (long)result; // early exit
}
}
// Read the 10th byte. Since we already read 63 bits,
// the value of this byte must fit within 1 bit (64 - 63),
// and it must not have the high bit set.
byteReadJustNow = buffer[len++];
if (byteReadJustNow > 0b_1u)
{
throw new FormatException("Format_Bad7BitInt");
}
result |= (ulong)byteReadJustNow << (MaxBytesWithoutOverflow * 7);
return (long)result;
}
view raw
varint.cs
hosted with ❤ by GitHub
Let’s check what sort of savings we can get using this approach:
Under 127 bytes– 1 byte
128 bytes .. 32 KB – 2 bytes
32KB .. 8MB – 3 bytes
8MB .. 2GB – 4 bytes
2 GB .. 512 GB – 5 bytes
512GB .. 128 TB – 6 bytes
128TB .. 32 Petabytes – 7 bytes
32 Petabytes .. 8 Exabytes – 8 bytes
Greater than 8 Exabytes – 9 bytes
That is really cool, since for the realistic data set, we can pack a lot more data into the page.
It comes with a serious issue, however. The data is no longer fixed size (well, that is the point, no?). Why is that a problem? Because we want to be able to do a binary search on that, which means that we need to be able to access the data by index. As usual, the solution is to utilize indirection. We’ll dedicate the bottom of the page to an array of fixed-size int (16 bits – sufficient to cover the 8KB range of the page) that will point to the actual location of the entry. Like before, we are going to reserve the first few bytes as a header, in this case we’ll use 4 bytes, divided into two shorts. Those will keep track of the writes to the bottom and the top of the page.
At the bottom, we’ll have the actual offsets that point to the entries, and at the top, we write the actual entries. Here is what this looks like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
private static Span<ushort> OffsetsFor(Span<byte> page)
{
var asShorts = MemoryMarshal.Cast<byte, ushort>(page);
var bottom = asShorts[0];
var top = asShorts[1]; // unused here, included for clarity
return MemoryMarshal.CreateSpan(ref asShorts[2], (bottom - 4) / 2);
}
view raw
Header.cs
hosted with ❤ by GitHub
Let’s see how our reading from the page will look now. As you can see, it is very similar to what we had before, but instead of going directly to the key by its offset, we have to use the indirection:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
public static bool TryGetValue(Span<byte> page, long key, out long val)
{
var idx = BinarySearch(page, key);
if (idx < 0) // update
{
val = default;
return false;
}
var offsets = OffsetsFor(page);
var entryOffset = offsets[idx];
_ = Read7BitEncodedInt64(page[entryOffset..], out var keyLen);
val = Read7BitEncodedInt64(page[(entryOffset + keyLen)..], out _);
return true;
view raw
TryGetValue.cs
hosted with ❤ by GitHub
The offsets array contains the location of the entry in the page, and that is laid out as the [varint-key][varint-val]. So we read (and discard) the key from the offset we found (we have to do that to discover its size) and then we read and return the actual value.
Let’s look at how we implemented the actual binary search in the page:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
private static int BinarySearch(Span<byte> page, long key)
{
var offsets = OffsetsFor(page);
int lo = 0, hi = offsets.Length - 1;
while (lo <= hi)
{
int mid = (lo + hi) >> 1;
var entryOffset = offsets[mid];
var cur = Read7BitEncodedInt64(page[entryOffset..], out _);
int cmp = key.CompareTo(cur);
if (cmp == 0)
return mid;
if (cmp < 0)
lo = mid + 1;
else
hi = mid - 1;
}
// where it is *supposed* to go
return ~lo;
}
view raw
BinarySearch.cs
hosted with ❤ by GitHub
This is a bog standard binary search, with the only interesting bit that we are going through the offsets array to find the actual location of the key, which we then read using variable size decoding.
The interesting part of this model happens when we need to set a value. Here is what this looks like, with my notes following the code.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
public static bool Set(Span<byte> page, long key, long val)
{
Debug.Assert(page.Length == 8192);
// header is 4 bytes [bottom, top]
ref var bottom = ref Unsafe.AsRef(MemoryMarshal.Cast<byte, ushort>(page)[0]);
ref var top = ref Unsafe.AsRef(MemoryMarshal.Cast<byte, ushort>(page)[1]);
if (bottom == 0) // init empty page
{
top = 8192; // top of the page
bottom = 4; // leave 4 bytes for the bottom/top header
}
var usableSpace = top - bottom;
Span<byte> temp = stackalloc byte[18]; // max varint = 9 - we need to encode 2
var reqEntryLen = Write7BitEncodedInt64(key, temp);
reqEntryLen += Write7BitEncodedInt64(val, temp[reqEntryLen..]);
var idx = BinarySearch(page, key);
Span<ushort> offsets;
if (idx >= 0)
{
// let's check if we can update?
offsets = OffsetsFor(page);
var entryOffset = offsets[idx];
Read7BitEncodedInt64(page[entryOffset..], out var keyLen);
Read7BitEncodedInt64(page[(entryOffset + keyLen)..], out var valLen);
if (reqEntryLen > keyLen + valLen) // cannot fit, need new allocation
{
if (reqEntryLen > usableSpace)
return false; // we are full, need to split the page
}
else
{
// can fit, just copy & done
temp[..reqEntryLen].CopyTo(page[entryOffset..]);
return true;
}
}
else // increase size of offsets array
{
if (reqEntryLen + sizeof(ushort) > usableSpace)
return false; // we are full, need to split the page
idx = ~idx;
bottom += 2;
offsets = OffsetsFor(page);
offsets[idx..(offsets.Length - 1)].CopyTo(offsets[(idx + 1)..]);
}
top -= (ushort)reqEntryLen;
offsets[idx] = top; // new data location
temp[..reqEntryLen].CopyTo(page[top..]);
return true;
}
view raw
Set.cs
hosted with ❤ by GitHub
This is quite a lot, I’ll admit. Let’s try to break up into individual pieces what is going on here.
First, we get the header values (bottom, top) and initialize them if empty (note that bottom is set to 4, after the header, while top is set to the end of the buffer). The idea is that the bottom grows up and the top grows down. This is called Slotted Page design and it is a staple of database design.
We then encode the key and the value into a temporary buffer. We need to do that so we’ll know what size the entry will take. Then we need to figure out if we are updating an existing record or creating a new one.
Updating an existing record is complex. This is because the size of the new record may be greater than the size of the old one. So we can’t put it in the same location. I’m handling this by just allocating new space for this entry, ignoring the old space that was allocated to it.
I’m not handling any deletes / space reclamation on this series. That is a separate subject, not complex, but fairly tedious to do properly. So I’m going to focus solely on writes.
Updates to an existing entry that also change its size aren’t in my test dataset, so I’m not worried about it too much here. I mention this to point out that variable length records bring with them considerations that we wouldn’t have run into with the fixed-size model.
And after all of this work? What are the results?
With the fixed-size version, we could fit 511 entries into the page. With the variable size int, however, we can do better.
For the realistic dataset, I can fit 712 entries for the page, and for the full dataset, 710 (there are very few very big elements even there, but we can see that it has an impact).
511 vs. 712 may not sound like much, but that is 40% increase in the number of entries that I can fit. To give some context, using 8KB pages, that is a difference of 5 MB per million entries. That adds up.
The question is, can we do better? More on that in my next post…
I write databases for a living, which means that I’m thinking a lot about persistence. Here is a fun challenge that we went through recently. We have the need to store a list of keys and values and then lookup a value by key. Pretty standard stuff. The keys and values are both 64 bits integers. In other words, what I would like to have is:
Dictionary<long,long> lookup;
That would be perfect, except that I’ve to persist the data, which means that I have to work with raw bytes. It’s easiest to think about it if we have some code in front of us. Here is the interface that I need to implement:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
void Set(Span<byte> page, long key, long val)
{
Debug.Assert(page.Length == 8192);
// impl...
}
bool TryGetValue(Span<byte> page, long key, out long val)
{
Debug.Assert(page.Length == 8192);
// impl...
}
view raw
Page.cs
hosted with ❤ by GitHub
As you can see, we have a byte buffer (8KB in size) and we want to add or lookup values from the buffer. All the data resides in the buffer, nothing is external. And we cannot unpack it in memory, because this is used for lookups, so this needs to be really fast.
The keys we are storing are file offsets, so they correlate quite nicely to the overall size of the file. Meaning that you’ll have a lot of small values, but also large ones. Given a key, we need to be able to look its value quickly, since we may run this lookup billions of times.
Given that I have 8KB of data, I can do the following, just treat the buffer as a sorted array, which means that I get a pretty easy way to search for a particular value and a simple way to actually store things.
Theoretically, given an 8KB page, and 16 bytes per each (key, value) entry, we can store up to 512 entries per page. But it turns out that this is just a theory. We also need to keep track of the number of items that we have, and that takes some space. Just a couple of bytes, but it means that we don’t have those bytes available. A page can now contain up to 511 entries, and even at full capacity, we have 14 bytes wasted (2 for the number of entries, and the rest are unused).
Here is what this looks like in code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
bool Set(Span<byte> page, long key, long val)
{
Debug.Assert(page.Length == 8192);
// takes two bytes, but we reserve 8 bytes to make alignment easier
ref var numberOfItems = ref Unsafe.AsRef(MemoryMarshal.Cast<byte, ushort>(page)[0]);
// actual usable range
var entries = MemoryMarshal.CreateSpan(ref MemoryMarshal.Cast<byte, long>(page)[1], page.Length / sizeof(long) - 1);
var keys = entries[..numberOfItems]; // from the bottom
var values = entries[(entries.Length - numberOfItems)..]; // from the top
var idx = keys.BinarySearch(key); // makes searching easy
if (idx >= 0) // update
{
values[idx] = val;
return true;
}
if (numberOfItems >= entries.Length / 2)
return false; // we are full, need to split the page
idx = ~idx; // find where it *should* go
numberOfItems++;
// now extend the ranges
keys = entries[..numberOfItems];
values = entries[(entries.Length - numberOfItems)..];
// move the existing entries to make room for the new one
keys[..idx].CopyTo(keys[idx..]);
values[1..(idx + 1)].CopyTo(values[0..idx]);
keys[idx] = key;
values[idx] = val;
return true;
}
bool TryGetValue(Span<byte> page, long key, out long val)
{
Debug.Assert(page.Length == 8192);
// takes two bytes, but we reserve 8 bytes to make alignment easier
ref var numberOfItems = ref Unsafe.AsRef(MemoryMarshal.Cast<byte, ushort>(page)[0]);
// actual usable range
var entries = MemoryMarshal.CreateSpan(ref MemoryMarshal.Cast<byte, long>(page)[1], page.Length / sizeof(long) - 1);
var keys = entries[..numberOfItems];
var values = entries[(entries.Length - numberOfItems)..];
var idx = keys.BinarySearch(key);
if (idx >= 0) // update
{
val = values[idx];
return true;
}
val = default;
return false;
}
view raw
SimpleArray.cs
hosted with ❤ by GitHub
As you can see, we are creating two arrays, the keys are growing from the bottom of the page and the values are growing from the top. The idea is that I can utilize the BinarySearch() method to quickly find the index of a key (or where it ought) to go. From there, I can look at the corresponding values array to get the actual value. The fact that they are growing separately (and toward each other) means that I don’t need to move as much memory if I’m getting values out of order.
For now, I want to set up the playground in which we’ll operate. The type of data that you write into such a system is important. I decided to use the following code to generate the test set:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
IEnumerable<long> Generate()
{
var rand = new Random(2023_04_21);
while (true)
{
yield return rand.Next(100) switch
{
< 3 => rand.NextInt64(0, 1L << 7), // 3%
< 10 => rand.NextInt64(0, 1L << 15),// 7%
< 35 => rand.NextInt64(0, 1L << 23),// 25%
< 75 => rand.NextInt64(0, 1L << 31),// 35%
< 90 => rand.NextInt64(0, 1L << 39),// 15%
< 95 => rand.NextInt64(0, 1L << 47),// 5%
< 98 => rand.NextInt64(0, 1L << 55),// 3%
_ => rand.NextInt64(0, 1L << 62) // 2%
};
}
}
view raw
Generate.cs
hosted with ❤ by GitHub
The idea is that we’ll generate a random set of numbers, in the given distribution. Most of the values are in the range of 8MB to 512GB, representing a pretty good scenario overall, I think.
And with that, we just need to figure out what metrics we want to use for this purpose. My goal is to push as many values as I can into the buffer, while maintaining the ability to get a value by its key as fast as possible.
The current approach, for example, does a binary search on a sorted array plus an extra lookup to the companion values array. You really can’t beat this, if you allow to store arbitrary keys. Here is my test bench:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
Span<byte> buffer = new byte[8192];
var enumerator = Generate().GetEnumerator();
var index = 0;
while (true)
{
enumerator.MoveNext();
var k = enumerator.Current;
enumerator.MoveNext();
var v = enumerator.Current;
if (Set(buffer, k, v) == false)
{
Console.WriteLine("Inserted: " + Validate(buffer, index));
break;
}
index++;
Validate(buffer, index);
}
int Validate(Span<byte> buffer, int index)
{
var enumerator = Generate().GetEnumerator();
var dic = new Dictionary<long, long>();
for (int i = 0; i < index; i++)
{
enumerator.MoveNext();
var k = enumerator.Current;
enumerator.MoveNext();
var v = enumerator.Current;
dic[k] = v; // need to handle update to the same value
}
foreach (var (k,expected) in dic)
{
if (TryGetValue(buffer, k, out var actual) == false || expected != actual)
{
Console.WriteLine($"Missed: {k}! {expected} != {actual}");
break;
}
}
return dic.Count;
}
view raw
Tester.cs
hosted with ❤ by GitHub
This will insert key/value pairs into the page until it is full. Note that we allow duplicates (we’ll just update the value), so we need to keep track of the number of entries inserted, not just the number of insertions. We also validate the structure at any step of the way, to ensure that we always get the right behavior.
This code runs as expected and we can put 511 values into the page before it gives up. This approach works, it is simple to reason about and has very few flaws. It is also quite wasteful in terms of information density. I would like to do better than 511 entries / pager. Is it possible to drop below 16 bytes per entry?
Give it some thought, I’m going to present several ways of doing just that in my next post…
There are many reasons for commands to fail. Some of them provide retry options, but others don't. In this post, I describe how to retry a bash command multiple times until it succeeds.To retry a command in Bash, you can use a while loop. The loop will execute the command and if the command returns
A few years ago I wrote about how you can use a bitmap to efficiently find space to allocate. I ran into a similar scenario recently and had another thought about that. Let’s say that we want to allocate from a buffer, which you can see below:
Set bits in the buffer (marked by X) are busy, and cleared bits are empty. We can find a range of cleared bits to allocate easily enough, as I have shown in my previous post.
The question we now need to answer is one of freeing the memory. In the image above, you can see that we have two allocations, the yellow one (the first item) and the green one (which is composed of two cells).
The problem is how do we know what to free in this case? In the case of yellow it is really easy since we can see that it is a single element. But freeing the green element is much harder since we can’t tell its size. Usually you need to keep that size somewhere, and usually you store it into the memory itself (taking some part of the allocation to yourself as metdata overhead).
The advantage of bitmaps is that they are simple, memory efficient, and quite fast. The problem is that they are really limited in what information they give us. The buffer above shows us busy or free, nothing more.
What if we’ll make it more interesting? Instead of using a single bit per cell, we’ll use two. Then we have the following states:
0b00 – free
0b11 – allocated (and has next)
0b10 – end of allocation
This doubles the memory we use, but also gives us a really important property. Given an index in the bitmap, we can easily ask what the allocated is. We just need to find the next cleared bit, and compute the distance. Let’s consider this simple bitmap:
0b_00_01_00_01_11_11_00_01_11_01_01
As you can see, we have the following allocations (I separated each two bits to make it easy to show):
0 – 1 cell
1 – 1 cells
2 – 2 cells
5 – 3 cell
9 – 1 cell
The code to do the actual accounting looks like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
int AllocationSize(ulong bitmap, int pos)
{
var offset = pos * 2;
if (((bitmap >> offset) & 1) == 0)
return 0;
var mask = ulong.MaxValue << (offset + 1);
ulong bitset = ~bitmap & mask;
var nextSetBit = BitOperations.TrailingZeroCount(bitset);
return (nextSetBit + 1 - offset) / 2;
}
view raw
double-bitmap.cs
hosted with ❤ by GitHub
And now we have a bitmap where the cost of tracking the size of the allocation is a single additional bit.
If you are expecting a lot of big allocations, then this may not be worth it, but if you have many small allocations, removing the need to track this is a major benefit.