#What's HTTP/3?HTTP/3 is a new version of HTTP. HTTP/3 protocol is supported by most modern browsers and servers. This update should bring performance benefits mostly for mobile users or unreliable connections. The main idea is to replace TCP by a new protocol, QUIC, which removes some issues of TC
One of my favorite activities is teaching. I love explaining how things work and passing on knowledge. Another good way to pass the time is to learn, which can be a source of great joy and incredible frustration.
Recently I had a conversation with a university student in the Computer Science track talking about databases. As you might imagine, that is a topic that is near and dear to my heart, so I had a lot of interest in hearing from the student what kind of databases they have been exposed to and what impression they had on them.
The student in question was familiar with MySQL from his courses and had run into PostgreSQL in some job interviews. He was very much in the PostgreSQL camp, that isn’t an uncommon reaction, but the reason for that was interesting to me. In his course, they had to setup and configure MySQL from scratch. That was annoying and hard, especially since they used the command line to make queries to the database.
In the PostgreSQL case, however, he was given access to a web front end to the database and was assigned tasks to complete. They could get started right away doing the most important aspects of their task.
When I’m teaching, part of the job is data dump (here are the things you need to know), part of the job is to answer questions, make sure that they understand the material, etc. A crucial part of teaching is getting the students to do things, making sure that the students are actually exercising the new knowledge. In such cases, I noted that I provide them with the baseline and they need to complete just the parts that are missing, the core pieces.
That is pretty much the same thing that the student ran into during their interview.
In retrospect, for teaching, I think that this approach is a serious issue.
One of the most important hurdles that I see for new developers is their ability to deal with problems. Whether it is actually reading the errors, composing some mental model for what is going on or just being able to dig deeper and understand what is happening. I’m not talking about experience, mind. I’m talking about the approach. If most of the tasks that they have dealt with so far were ones which were “fill the missing pieces”, they are likely never had the experience of dealing with everything else. And in many cases, the issues aren’t in the areas that you are thinking, they can be somewhere else completely.
I remember many times where I had to do something, and I ran into a wall. That was incredibly frustrating, especially when the issue was somewhere completely orthogonal to what I’m trying to do. A great example recently was having to figure out how to do cross compilation in GitHub action using GCC. That took a lot of time and all I wanted to do is to just call a single function from native code.
As frustrating as that is, I think that there is a huge amount of value in those kinds of “side quests”. That is especially true when someone is in the early stages of their career. Those are just the sort of hurdles that can teach you not only what is going on at your level but about the ecosystem in general and the layers beneath you.
A great example of lack of such knowledge is a candidate in an interview recently that was asked: “Given a static HTML page and a domain name that was registered, what do you need to setup a website for that page in that domain?” The candidate had no idea what was actually involved (nothing about DNS, routing, servers, etc). They were likely able to write an application using modern practices, but everything about pushing to production, what is a website, what is a domain name or IPs… nope.
And that makes sense, they never had even run into something like that.
On the other hand, I remember building websites using FTP and Apache / IIS in the 90s. It wasn’t fun, but I had very little abstraction to deal with and was exposed to the working of the engine.
And that sort of thing matters, because you will need to understand details such as DNS propagation times and its impact on what your system is doing, for example.
After implementing the memory management in the previous post, I set out to handle the actual I/O primitives that we need. As a reminder, we are separating the concerns here. We managed memory and reference counting in the previous post and now I want to focus on how we can read and write from the disk in as efficient a manner as possible. Before we get to the guts of the code, I want to explain a bit about the environment that I have in mind. Most of the time, the pager is able to provide the requested page from memory directly. If it can’t do that, it needs to consider the fact that there may be multiple threads that are trying to load that page. At the same time, while we are loading the page, we want to be free to do other things as well.
I decided to implement the I/O routine using async I/O. Here is the rough sketch of the API I have in mind:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
pub const BackgroundRing = struct {
const IoRingQueueSize = 32;
const WorkReqStack = std.atomic.Stack(Work);
pub const CallbackFn = fn (*Work) void;
pub const Work = struct {
tag: enum { Read, Write },
fd: std.os.fd_t,
buffer: []u8,
offset: u64,
context: u64,
result: struct {
bytes: u64,
err: ?i32,
},
callback: CallbackFn,
};
allocator: std.mem.Allocator,
event_fd: i32,
worker: std.Thread,
background_error: ?anyerror,
pending: WorkReqStack,
running: bool,
pub fn init(allocator: std.mem.Allocator) !*BackgroundRing {
var self = try allocator.create(FileRing);
errdefer allocator.destroy(self);
self.allocator = allocator;
self.pending = WorkReqStack.init();
self.event_fd = try std.os.eventfd(0, 0);
self.running = true;
self.worker = try std.Thread.spawn(.{}, background_worker_wrapper, .{self});
return self;
}
pub fn deinit(self: *BackgroundRing) void {
self.running = false;
@fence(.Acquire);
self.wake_worker() catch {};
self.worker.join();
while (self.pending.pop()) |node| {
self.allocator.destroy(node);
}
std.os.close(self.event_fd);
self.allocator.destroy(self);
}
pub fn submit(self: *BackgroundRing, work: Work) !void {
var ptr = try self.allocator.create(WorkReqStack.Node);
ptr.data = work;
self.pending.push(ptr);
try self.wake_worker();
}
fn background_worker_wrapper(self: *BackgroundRing) void {
self.background_worker() catch |err| {
self.background_error = err;
};
}
fn background_worker(self: *BackgroundRing) !void {
while (self.running) {
try self.wait_for_work();
while (self.pending.pop()) |node| {
const bytes = switch (node.data.tag) {
.Read => std.os.pread(node.data.fd, node.data.buffer, node.data.offset),
.Write => std.os.pwrite(node.data.fd, node.data.buffer, node.data.offset),
} catch |err| val: {
node.data.result.err = @errorToInt(err);
break :val 0;
};
node.data.result.bytes += bytes;
if (bytes != 0 and bytes < node.data.buffer.len) {
node.data.offset += bytes;
node.data.buffer = node.data.buffer[bytes..];
self.pending.push(node); // retry...
continue;
}
node.data.callback(&node.data);
self.allocator.destroy(node);
}
}
}
fn wake_worker(self: *BackgroundRing) !void {
var w: u64 = 1;
_ = try std.os.write(self.event_fd, std.mem.asBytes(&w));
}
fn wait_for_work(self: *BackgroundRing) !void {
var val: u64 = undefined;
_ = try std.os.read(self.event_fd, std.mem.asBytes(&val));
}
};
view raw
BackgroundRing.zig
hosted with ❤ by GitHub
The idea is simple, we use a struct to which we can submit work in an asynchronous manner. At some later point in time, the work will complete and our read or write will be done. At that point we’ll be able to invoke the provided callback for the user. The code above is about as simple as you can manage, it spawns a dedicated thread to manage the I/O and then just issues those operations directly. To save myself some headache, I’m using an eventfd as a synchronization mechanism, I don’t strictly need this, but it will be useful down the road.
In terms of the API, I can now write the following code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
var file = try std.fs.openFileAbsolute(filename, .{ .write = true, .read = true });
defer file.close();
try ring.submit(.{
.tag = .Read,
.buffer = &buffer,
.offset = 0,
.fd = file.handle,
.context = @ptrToInt(&waiter),
.callback = handle_completion,
.result = .{ .bytes = 0, .err = null },
});
view raw
usage.zig
hosted with ❤ by GitHub
The basic idea is that the BackgroundRing struct doesn’t manage file descriptors or buffers. It is a strict way to manage I/O. The API is pretty piss poor as well, in terms of usability. No one will ever want to write generic I/O routines using this method, but we aren’t trying to do generic I/O, we are trying to ensure usability in a very constrained manner, inside the pager.
About the only nice thing that we do in this implementation is handle partial reads and writes. If we were asked to read more than what we got, we’ll repeat the operation until we get to the end of the file or succeed.
In terms of implementation, as well, this is a really bad idea. We look like we are doing async I/O, but we are actually just passing it all to a background thread that will do the work off a queue. That means that it will be very hard to make full use of the hardware capabilities. But I think that you can guess from the name that I’m not going to leave things like that. I’m going to use the new io_uring API in Linux to handle most of those concerns. That idea is that we’ll allocate a command buffer in the kernel and allow the kernel to handle asynchronous execution of the I/O operations. We still retrain the same rough structure, in that we are going to have a dedicated background thread to manage the commands, however. Amusing enough, the io_uring API is meant to be used from a single thread, since otherwise you’ll need to orchestrate writes to the ring buffer from multiple providers, which is much harder than a single consumer, single producer scenario.
The use of io_uring is also why we are using the eventfd model. We are registering that file descriptor in the io_uring so it will let us know when event completes. This also does double duty as the method that we can use to wake the background thread when we have more work for it to do. The most major change is inside the background worker, of course. Here is how this looks like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
fn background_worker(self: *PagerRing) !void {
var cqes = std.mem.zeroes([IoRingQueueSize]std.os.linux.io_uring_cqe);
while (self.running) {
try self.wait_for_work();
var shouldWake: bool = false;
while (self.pending.pop()) |node| {
const op = switch (node.data.tag) {
.Read => IO_Uring.read,
.Write => IO_Uring.write,
};
_ = op(
&self.ring,
@ptrToInt(node),
node.data.fd,
node.data.buffer,
node.data.offset,
) catch |e| switch (e) {
error.SubmissionQueueFull => {
self.pending.push(node);
shouldWake = true;
break;
},
else => {
self.allocator.destroy(node);
return e;
},
};
}
_ = self.ring.submit() catch |err| switch (err) {
error.CompletionQueueOvercommitted => shouldWake = true,
error.SignalInterrupt => shouldWake = true,
else => return err,
};
// now let's process the completed values
const n = try self.ring.copy_cqes(cqes[0..], 0);
for (cqes[0..n]) |cqe| {
var node = @intToPtr(*WorkReqStack.Node, cqe.user_data);
if (cqe.res > 0 and cqe.res < node.data.buffer.len) {
// partial operation, need to resubmit
var bytes = @intCast(usize, cqe.res);
node.data.buffer = node.data.buffer[bytes..];
node.data.offset += bytes;
node.data.result.bytes += bytes;
self.pending.push(node);
shouldWake = true;
continue;
}
defer self.allocator.destroy(node);
if (cqe.res < 0) {
node.data.result.err = -cqe.res;
} else {
node.data.result.bytes += @intCast(usize, cqe.res);
}
node.data.callback(&node.data);
}
if (shouldWake) {
try self.wake_worker();
}
}
}
view raw
Ring.zig
hosted with ❤ by GitHub
We create the ring in the init function (see full code listing below) and in the background thread we are simply waiting for an event using the eventfd. When a caller submits some work, we’ll register that on the io_uring and wait for it to complete (also using the eventfd). You can see that I’m handling some basic states (partial reads, full queues, etc). The code itself ends up being pretty small. Once we are done with the operation, we let the user know about the completion.
There are a few things here that are interesting to note. We are actually allowing interleaving of operations, so we may have many outstanding operations at any given point. We aren’t trying to guarantee any kind of ordering between the operations, nor are we providing anything but the most bare bones interface for the caller. Even so, there is quite a bit of power in the manner in which we are working here. We need to complete a few more components first, and then we can bring it all together…
Here is the full listing of the PagerRing code. In the next post, I want to focus on the actual overall pager, when we are managing multiple files and how we work with all of them together. In particular, we want to understand how we manage the memory budget across all of them.
Software developers deal with abstractions every day. But just what is an abstraction? There are differing definitions that can sometimes…Keep Reading →
After writing the post about handling chunk metadata, I started thinking about the overall approach. Both the method using compressed pointers and the baseline computation felt… off to me. They were certainly workable, but it was too complex and felt fragile.
I don’t like dealing with a high level of complexity, I would rather put a lot of effort into simplifying the solution. The overall approach may be complex, but the system should be nice to work with. Usually, we can get away with a great deal of simplification if we accept some constraints on what we want to do with the system. For now, I’m going to assume the following constraints:
We are using 64 bits OS (and can assume effectively unlimited address space).
We want to go with a file pager (instead of the memory mapped one) because I want to be able to control the I/O behavior better.
The files we use are limited to 8 GB in size (can use more than a single file, of course).
The last one deserves some additional words. When thinking about a storage solution, accepting a maximum size is generally a bad idea (640KB, anyone?). However, if we decide that our storage solution is going to be composed of files of specific size, we can combine them to reach any size needed.
But why accept this limitation? Why say that a single file will not exceed 8 GB? It turns out that this has several advantages.
Let’s assume that we have a dataset that is 100GB in size, using 8 GB files, that would be 13 files to a total of 104 GB of used disk space. Now we want to delete some of that data. What do we do with the actual used disk space? It is actually quite hard to release disk space back to the operating system if you have a single file. You might need to run compaction of the data, or use advanced API such as hole punching (see FALLOC_FL_PUNCH_HOLE). Advanced API is something that I would like to avoid, too easy to fall into some pitfall that no one else has run into. Working with sparse files (with holes in them) also typically requires you to utilize dedicated tools and can be awkward. If we split the data into separate files, we can retain most of the same benefits, and give ourselves a simpler environment for the user to work with.
With the 8GB limitation in place, I can choose to manage the paging using the following manner:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
pub const FileChunks = struct {
pub const MaxFileSize = 8 * 1024 * 1024 * 1024; // 8 GB
pub const ChunkSize = 2 * 1024 * 1024; // 2 MB
pub const MaxChunksInFile = MaxFileSize / ChunkSize; // 4096
pub const PageSize = 8 * 1024; // 8 KB
pub const PagesInChunk = ChunkSize / PageSize; // 256
pub const ChunkMetadata = packed union {
pub const Tag = enum(u2) {
Empty = 0b00,
Error = 0b01,
Loading = 0b10,
Value = 0b11,
};
raw: u64,
futex: packed struct {
value: u32, // tag & version
references: u32, // ignored
},
v: packed struct {
version: u30,
tag: Tag,
// references == 0 - this is unused
// references == 1 - just the pager is holding this
// refereces >= 2 - external entity is holding this
references: u32,
},
};
comptime {
if (@sizeOf(ChunkMetadata) != @sizeOf(u64)) {
@compileError("ChunkMetadata should be exactly 64 bits in length");
}
}
chunks: [MaxChunksInFile]ChunkMetadata,
ptr: []align(mem.page_size) u8,
allocator: mem.Allocator,
};
view raw
FileChunks.zig
hosted with ❤ by GitHub
The idea is pretty simple. Instead of trying to stitch together the memory for the file, we are going to just allocate a single 8GB range of virtual memory. This can be done using the following command:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
var ptr = try os.mmap(
null,
MaxFileSize,
os.PROT.NONE,
os.MAP.ANONYMOUS | os.MAP.PRIVATE,
-1,
0,
);
view raw
mmap.zig
hosted with ❤ by GitHub
This reserves (but does not use) 8GB of address space. We can now allocate ranges from that safely. This is important because if we have a request to two sequential chunks, they will reside in memory right next to one another. Note that we also don’t need to handle any pointers, since we can rely on a stable base address for the whole file. The nice thing about this is that we aren’t actually allocating memory, just reserving it.
Let’s see how that will work? The chunks array is used to control references to the chunks in the file. The chunk metadata is a 64 bits value that has several responsibilities at the same time. It stores the tag of a chunk, which indicate its status (loaded, error, empty, etc) and the number of outstanding references to the chunk. That uses up 34 bits in the value, the rest of the bits are used as a version field, which is incremented on each change. That allows us to avoid the ABA problem. The actual data, of course, is managed using the ptr value.
Here is how we can get a chunk from this struct:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
pub fn tryGet(self: *FileChunks, chunk: u64) !?[]align(mem.page_size) u8 {
while (true) {
var copy = self.chunks[chunk];
var origin = copy;
switch (copy.getTag()) {
.Empty => return null,
.Error => return @intToError(@intCast(u16, copy.v.references)),
.Loading => return error.ValueIsLoading,
.Value => {},
}
try copy.addRef();
if (self.chunks[chunk].tryUpdate(origin, copy)) {
var offset = chunk * ChunkSize;
return self.ptr[offset..(offset + ChunkSize)];
}
}
}
view raw
tryGet.zig
hosted with ❤ by GitHub
What we are doing here is checking that the value is loaded to memory, and if it is, we increment the reference and then return it. This code runs in a loop, because we assume that multiple threads may run it in the same time. This handles just getting data that is already loaded. If the data isn’t loaded, what will happen? We’ll get a null back. Here is the blocking version of this method:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
pub fn getBlocking(self: *FileChunks, chunk: u64, timeout: ?u64) ![]align(mem.page_size) u8 {
while (true) {
var maybechunk = self.tryGet(chunk) catch |e| {
if (e == error.ValueIsLoading) {
var copy = self.chunks[chunk];
if (copy.getTag() == .Empty) {
try self.chunks[chunk].wait(copy, timeout);
}
continue;
}
return e;
};
if (maybechunk) |c| {
return c;
}
return error.ValueIsNotLoading;
}
}
view raw
getBlocking.zig
hosted with ❤ by GitHub
Just based on those two methods, you should be able to draw some conclusions. If the value isn’t loaded, we’ll always return null, but there is this Loading stage as well, in that case, we may want to wait for it. How is that going to work?
This works using two important functions: markLoading() and markLoaded(), the idea is that we’ll first try to call tryGet() to load a chunk, if there is no value, we need to load it from disk. At that point, remember, there may be multiple threads accessing the relevant chunk. So all of them would be competing on the markLoading function, like so:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
pub fn markLoading(self: *FileChunks, chunk: u64) !?[]align(mem.page_size) u8 {
while (true) {
var copy = self.chunks[chunk];
var origin = copy;
switch (copy.v.tag) {
.Value => return error.ValueAlreadyExists,
.Error => return error.ValueInErrorState,
.Loading => return null, // already marked..
.Empty => {},
}
copy.setTag(.Loading);
if (self.chunks[chunk].tryUpdate(origin, copy)) {
var offset = chunk * ChunkSize;
const c = self.ptr[offset..(offset + ChunkSize)];
_ = try os.mmap(
c.ptr,
ChunkSize,
os.PROT.READ | os.PROT.WRITE,
os.MAP.ANONYMOUS | os.MAP.PRIVATE | os.MAP.FIXED,
-1,
0,
);
return c;
}
}
}
view raw
markLoading.zig
hosted with ❤ by GitHub
The code itself is pretty simple, we are updating the tag of the chunk and try to update it optimistically. We are moving the state of the chunk from Empty to Loading in a thread safe manner. If we are successful in doing so, we know that we are the only thread that owns the loading portion of the chunk. Note that part of the markLoading process is to ask the OS to give us the memory for the chunk (in the range that we previously allocated).
At this point, we can load the data from disk somehow and then we’ll call the markLoaded function, which completes the process:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
pub fn markLoaded(self: *FileChunks, chunk: u64) ![]align(mem.page_size) u8 {
while (true) {
var copy = self.chunks[chunk];
var origin = copy;
switch (copy.v.tag) {
.Value => return error.ValueAlreadyExists,
.Error => return error.ValueInErrorState,
.Empty => return error.ValueIsNotLoading,
.Loading => {},
}
copy.setTag(.Value);
try copy.addRef(); // ownership by the pager
try copy.addRef(); // ownership by the caller
if (self.chunks[chunk].tryUpdate(origin, copy)) {
return self.getLoadedChunk(chunk);
}
}
}
view raw
markLoaded.zig
hosted with ❤ by GitHub
The idea is that we are splitting the responsibility for managing the chunks references from how we load the data to memory.
In other words, the expected usage of this struct is something like this:
Call tryGet() a page in a given chunk.
If successful, do the work you wanted to do.
If not successful, compete to be the loader for this data by calling markLoading().
If you lost, call getBlocking() to wait for the winner to get the data.
Somehow, load the data from the disk and call markLoaded().
Proceed to make use of the data.
Another important aspect that we have to deal with is when we want to discard the data. Basically, if we filled our memory budget and we need to load a value from the disk, what can we do then? The answer is that we need to evict the data somehow, before we can do that, we need to know what data is currently in use. That is why we have the calls to addRef() and release(). We use those (using atomic operations) to track the usage of the various chunks. When we need to evict data from memory, we’ll need to have some sort of a policy to do so. I’m deferring the actual policy to a later point in time, right now I want to discuss how do we know what we can evict and how that is going to work.
Here is the code to handle eviction, currently implementing a policy of simple scanning (not ideal by a long shot):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
fn tryClaimOwnership(self: *FileChunks, index: u64) ?ChunkMetadata {
while (true) {
var copy = self.chunks[index];
if (copy.mayReclaim() == false)
return null;
var modified = copy.reset(.Loading); // means that we own it for the duration...
if (self.chunks[index].tryUpdate(copy, modified))
return modified;
}
}
fn trySetToEmpty(self: *FileChunks, index: u64) !void {
while (true) {
var modified = self.chunks[index];
if (modified.getTag() != .Loading) {
// someone else modified it while we where releasing the memory
return error.ValueIsNotLoading;
}
var released = modified.reset(.Empty);
if (self.chunks[index].tryUpdate(modified, released))
break;
}
}
pub fn reclaim(self: *FileChunks) !u64 {
var reclaimed: u64 = 0;
var index: u64 = 0;
while (index < self.chunks.len) : (index += 1) {
var modified: ChunkMetadata = undefined;
if (self.tryClaimOwnership(index)) |m| {
// at this point, m is owned by us, and no one else can use it...
modified = m;
} else {
continue;
}
reclaimed += ChunkSize;
_ = try os.mmap(
self.getLoadedChunk(index).ptr,
ChunkSize,
os.PROT.NONE,
os.MAP.ANONYMOUS | os.MAP.PRIVATE | os.MAP.FIXED,
-1,
0,
);
try self.trySetToEmpty(index);
}
return reclaimed;
}
view raw
eviction.zig
hosted with ❤ by GitHub
In the reclaim method, we are scanning through the chunks. To be able to reclaim a chunk, the following conditions need to hold:
The chunk holds a value.
There are no outstanding references to the chunk, only the pager is holding a reference to the chunk.
Note that in order to do this safely, we have to assume that while we are trying to reclaim a chunk, another thread is trying to use it. This behavior complicates our lives a bit. We handle that by doing a racy update of the chunk, trying to move it to a loading state. The idea is that the Loading state is meant to be used as a busy signal. While the chunk is in Loading state, the rest of the system knows that it cannot use this and needs to wait. Note that this means that we have the following transitions:
Most of the code that we have in the struct is there to handle concurrency from multiple threads dealing with the system at once, note. The actual behavior is fairly simple. We check if we can reclaim the chunk (no one is looking), we take a lock on by trying to move its state to Loading. Then we can discard the memory by calling mmap on the chunk’s memory with PROT_NONE.
For fun, we are using 2MB chunks because that fits well into huge pages. On a properly setup system, we can significantly reduce the paging metadata overhead inside the kernel by allocating a single 2MB page for each chunk.
You can see the entire implementation here. In the next post, I want to look into handling the I/O portion of reading the data from the disk. After that we’ll talk about how we can implement a proper eviction policy.
The topic of this post is a bug in RavenDB, a pretty serious one. The end result is that a user reported that they got an error from RavenDB that they are unable to read a stored document. In some cases, RavenDB needs to read a document on startup, which means that it wasn’t able to start up if that document had this behavior.
As you can imagine, this is one of those issues that gets our full and immediate attention. The error itself gave us a lot of information:
Dictionary mismatch on Dic #375
at Voron.Data.Tables.ZstdLib.AssertSuccess(UIntPtr v, CompressionDictionary dictionary)
This is related to RavenDB’s document compression behavior. In order to get a great compression ratio from our documents, we train RavenDB on the recent documents that you have and generate a compression dictionary. The problem at hand is that the compression dictionary we have and the compression dictionary that was actually used are different. As you can see from the error, we are using zstd as the compression algorithm. When zstd generates a dictionary it will (by default) generate an id from that document that is mostly based on the xxhash64 of its content, rounded to 32 bits. You can see the relevant part here. This is pretty nice, since it means that there is a good chance that we’ll detect the wrong dictionary.
So now we know what is going on, but we don’t understand why.
When we wrote this feature, we were quite aware that we’ll not be able to make any sort of sense from the documents if we don’t have the right dictionary. For that reason, we store the dictionaries three times. Once inside of RavenDB itself and twice in ancillary files, which we can use during recovery. This sort of error should be utterly impossible. And yet, we had run into that in production, so we have to dig deeper still.
The primary suspect was the dictionary training portion. One of the things that RavenDB does on a continuous basis is measure the compression ratio of the documents, if we aren’t able to hit a good compression ratio, RavenDB will try to generate a new dictionary from the most recent documents and see if that new dictionary can do better. This can be very helpful in maintaining good compression rates. As your documents change, RavenDB will detect that and realize that it can do better, retrain on the recent data and compress even further. The problem is that this code path is also quite tricky, we first compress the document using the current dictionary, then we try generating a new dictionary and see if compressing with the new dictionary is better. If that is the case, we can install the new dictionary for future operations, otherwise, we need to discard it.
I suspected that the issue was somewhere around that area, we might not be handling the rejection of the new dictionary properly. So I went into the code and started digging, but I found absolutely nothing. The entire process is covered in tests and has been in production for close to 18 months, so this isn’t something that obvious.
After spending quite a bit of time on the issue, I decided that the code is perfect, it handled everything properly and taken into account all the right behaviors.
Clearly the fault was elsewhere. Before setting out to blame the nearest cat (you can never trust those), I had an idea, what if the problem wasn’t during the training process, but afterward?
Well, that doesn’t really matter, does it? RavenDB is a transactional database, if we had a failure after the training process, we’ll have to discard some of the data, for sure, but that would be about it. Unless, what if we have some state that wasn’t transactional? As part of looking at the compression training code, I ran into just such a scenario. Running the training to generate a new compression dictionary is an expensive proposition, so we don’t want to do that often. As such, we’ll do that for only about 1K document changes where we exceed the desired compression ratio by over 10%. How do we know to act every 1K documents? Well, we have a counter that we increment on every change. That value is incremented using Interlocked.Increment() and isn’t part of the transactional state. If the transaction is aborted, the value is still incremented. The actual value doesn’t matter, mind, only that it is moving forward, so that isn’t an issue.
I mentioned the dictionary id before, but I should clarify that this is the zstd’s dictionary id. Internally, RavenDB uses a different value. That value is simply the sequence number of the dictionary, RavenDB counts the number of generated dictionaries and gives the new dictionary the next available value. That value, by the way, is part of the transaction. If we rollback a transaction, we’ll use the same dictionary id. But that doesn’t matter, of course.
When using compression dictionaries, we need to load them from a buffer. There is quite a bit of work that is involved in that, there is memory allocation, entropy tables to load, etc. In order to save repeated work, RavenDB caches the compression dictionaries (after all, their whole point is to be used repeatedly). That cache can be used by multiple transactions at the same time (two read transactions using the same dictionary will use the same instance).
Given all of this information, here is the sequence of events that we need to get the error in question:
The user enabled documents compression.
The user runs a transaction with at least four commands, which needs to satisfy the following conditions.
A document write as the first action.
Then a write to document whose compression ratio exceeded the expected ratio by over 10%, as a result, RavenDB tried to train a new compression dictionary.
That dictionary had a better compression ratio and was accepted as the new default compression dictionary.
RavenDB persisted the new dictionary and used that to compress the new document.
Another command (in the same transaction) had stored a document in the same collection, now RavenDB will read the new dictionary and store that in a cache.
A third command runs, but this one throws an error (such as optimistic concurrency violation).
At this point, RavenDB will rollback the entire transaction and return the error to the user. Let’s say the user has chosen to submit the same two documents again, shall we?
For the first command, we’ll again discover that the compression ratio (of the old compression dictionary) is insufficient. We will not generate a new compression dictionary, why is that? Remember the counter that we increment using Interlocked? That one was not rolled back, so we’ll need to wait for another 1K documents for the stars to properly align for us. That doesn’t impact correctness in any way, shape or form, however.
At this stage, the stage is set, but everything is still okay. The problem will happen on the next time that we’ll trigger a new dictionary. At that point, we’ll again scan the most recent documents, build a dictionary, etc. However, the dictionary id that RavenDB will use will be identical to the dictionary id that we previously discarded. The data that dictionary was trained on, however, will almost certainly be different. We persist the new dictionary to disk and everyone is happy, the new document that we wrote will use the new compression dictionary and we are perfectly fine.
The next write for this collection, however, will run into a problem. It will need to use the current (the new one) dictionary when we want to make a write. In order to do that, it will load the value using the cache, but there is already a value for that dictionary in the cache, the same dictionary that was discarded. At this point, RavenDB will start compressing documents using the in memory dictionary while the on disk dictionary is different.
If you’ll try to access the document which triggered the new dictionary, you’ll get an error, but documents that were modified later will continue working with no issue. Until you restart, of course.
On restart, we’ll read the dictionary from disk, where we wrote the new dictionary, at this point, all those documents that we wrote will give us the error above. Note that the sequence of events has to be very exact, you need to have a dictionary training as part of a multi act transaction which failed after the dictionary training has been successful and wrote additional documents. In a year and a half of production usage and very heavy load, that happened only a couple of times, it seems.
The issue has been fixed, of course and we’ll be rolling it out to both users and cloud customers. We’ll now rollback such in memory state on a transaction rollback as well, avoiding this issue entirely. It is amazing to me that despite very careful planning, it wasn’t the code itself that caused a problem, but a sequence of independent operations and failure modes that we never even considered about this.
FizzBuzz is a well known test to show that you can program. To be rather more exact, it is a simple test that does not tell you if you can program well, but if you cannot do FizzBuzz, you cannot program. This is a fail only kind of metric. We need this thing because sadly, we see people that fail FizzBuzz coming to interviews.
I have another test, which I feel is simpler than FizzBuzz, which can significantly reduce the field of candidates. I show them this code and ask them to analyze what is going on here:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
public class ControllerBase : Controller
{
public static bool IsAdminUser;
}
view raw
snap.cs
hosted with ❤ by GitHub
Acceptable answers include puking, taking a few moments to breathe into a paper bag and mild to moderate professional swearing.
This is something that I actually run into (about 15 years ago, in the WebForms days) and I have used it ever since. That is a great way to measure just how much a candidate knows about the environment in which they operate.
I run into the following code during code review and had an immediate and visceral reaction.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
private readonly List<string> _messages;
public IReadOnlyList<string> Messages
{
get
{
lock (this)
{
return _messages;
}
}
}
view raw
horror.cs
hosted with ❤ by GitHub
This is a (bad) attempt to add thread safety, because you are getting a value through a read only interface, but there is still the mutable instance to work with at the source, and now you have someone that observes the instance while it is being mutated, outside the lock.
The proper way to handle this is to copy the list (under the lock) and return a distinct copy.
The end of the year is closing fast, and I run into the following metric (below). What you can see here is one of our RavenDB production instances over the past year. We are continuously dogfooding our own software, and there is a clear indication of the results.What you can see here is the total memory used by RavenDB (production load, fairly constant over time) for the past year. As we update RavenDB, we benefit from various optimizations, and the trend line is very encouraging.Around August, we had a change that saved us a single allocation in some cases, here is the chance, you can see the impact it had:We also started using a new feature in production around December, and that seems to have an additional memory cost, so we optimized that as well:You can see the new build deployed around the 17th of the month.
We use cookies to analyze our website traffic and provide a better browsing experience. By
continuing to use our site, you agree to our use of cookies.