Page 97 • Relatively General .NET

Implementing a file pager in Zig

by Oren Eini

posted on: January 10, 2022

This is my 7th post in this series, and we are now starting to get into the really interesting bits. So far we worked with the individual components, each of them doing just one thing, which makes it hard to see how they are all put together. In this post, we are going to start plugging things together. As a reminder, the whole point of this series of posts is to explore what we need to do in order to ask for a page (8KB, in our case) from the data file and work with it in memory. We have most of everything ready, let’s put them back together. The first thing that we need to do is actually tie a file to the FileChunks structure that we previously created. This is as simple as this structure: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters const PagerFile = struct { chunks: *FileChunks, fd: std.os.fd_t, pub fn init(allocator: std.mem.Allocator, file: []const u8) !PagerFile { var chunks = try FileChunks.init(allocator); errdefer chunks.deinit(); var fd = try std.os.open( file, std.os.O.RDWR | std.os.O.CREAT | std.os.O.CLOEXEC | std.os.O.DIRECT | std.os.O.DSYNC, std.os.S.IRUSR | std.os.S.IWUSR, ); return PagerFile{ .chunks = chunks, .fd = fd, }; } pub fn deinit(self: *PagerFile) void { self.chunks.deinit(); std.os.close(self.fd); } }; view raw PagerFile.zig hosted with ❤ by GitHub We are simply initializing the FileChunks and opening the file, nothing else. Remember that a single pager is going to be responsible for multiple files. Furthermore, all the data structures that we are dealing with now are meant for concurrent use. That means that we should be prepared for this type of code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters // on thread #1 try pager.addFile("/data/orders/data.003"); // on thread #2 try pager.getPage(&tx.owned,2034, 32, 15 * std.time.us_in_s); view raw concurrent.zig hosted with ❤ by GitHub When we use a managed language, that is actually fairly simple to work with. In an unmanaged language, those two lines of code are tough. Why is that? Let’s look at the raw data members for the Pager structure, shall we? This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub const Pager = struct { allocator: std.mem.Allocator, ring: *PagerRing, files: []PagerFile, filesCapacity: usize, pendingFree: PagerFileList, const PagerFileList = std.ArrayListUnmanaged([]PagerFile); } view raw Pager.zig hosted with ❤ by GitHub Of particular interest to us is the files member. This is the list of files that are being managed by the pager. Each one of them has a maximum size of 8GB in size. The problem is that we may have one thread accessing the list at the same time that another thread wants to increase its size. How would that work? The simplest option is that we’ll reallocate the array, but that will move it. What would the first thread be doing in that scenario? The good thing from our point of view is that we don’t need to worry about concurrent modifications. There is only a single thread that is allowed to modify the Pager’s state at any given point in time. Trying to find solutions for this problem leads into a rabid rabbit’s hole. You go into hazard pointers, epoch GC and other fun stuff. Also, the call to getPage() is one of the most important ones in a database, anything that we can do to reduce its cost will be done. As such, we can’t typically use any form of locking. A reader/writer lock can be a killer. Here is a good example for how that can happen. I thought about how to resolve this, and I decided against a generic solution, instead, let’s look at the actual behavior that we need. The files array is going to be accessed a lot, it has an entry per 8GB of disk space that we take. That means that it isn’t going to be experiencing any rapid growth. It is also only going to grow. We also need to worry only when we grow the physical backing store for this array, if we overprovision and use less than we need, that is perfectly fine. Each element in the array is 8 bytes in size, so if we allocate a single memory page (4KB) we can store 512 file references in it. That represents 4 TB(!) of data, so we can probably just accept the additional cost and allocate it and not think about it. Databases with > 4TB of disk size do exist, and we don’t want to have this artificial limit on us, do we? Instead, we can use another approach. We’ll start by allocating the array with a minimum size of 8 elements (sufficient until you get to 64GB). But what happens when we reach that size? This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn addFile(self: *Pager, file: []const u8) !void { // caller holds a write lock, we don't care about readers, they can follow // either new or old pointers and are guarantee to only access the old range if (self.files.len + 1 >= self.filesCapacity) { var nextCapacity = self.filesCapacity * 2; var newFiles = try self.allocator.alloc(PagerFile, nextCapacity); errdefer self.allocator.free(newFiles); try self.pendingFree.append(self.allocator, newFiles); std.mem.copy(PagerFile, newFiles, self.files); self.files.ptr = newFiles.ptr; self.filesCapacity = nextCapacity; } var cur = self.files.len; self.files.len += 1; self.files[cur] = try PagerFile.init(self.allocator, file); } view raw addFile.zig hosted with ❤ by GitHub What we do here is cheat. We don’t need to free the memory immediately, when we reach the limit of the size of the array, we’ll double its size, copy the data that we have and register it to be freed when we close the Pager. At that point, the caller already needs to ensure that there are no other users of the pager. Because we copy the values from the old array, but keep it around, old readers may use the old or new arrays, but we don’t actually care. The memory remains valid and accessible. In terms of wasted space, if our database went from a single file to being 128 TB in one run, we’ll have an array with 16,384 elements (whose size is 128KB). Along the way, we had to double the size of the array a dozen times and we “waste” 128KB of unused buffers. This seems like a pretty reasonable cost to significantly reduce the level of complexity of concurrent access. Using this method, we can avoid any sort of synchronization on the read side. That is certainly a plus. Here are the init() and deinit() calls for the Pager, to complete the picture: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn init(allocator: std.mem.Allocator, capacity: usize) !*Pager { var self = try allocator.create(Pager); errdefer allocator.destroy(self); self.allocator = allocator; self.filesCapacity = std.math.max(8, nextPowerOfTwo(capacity)); var files = try allocator.alloc(PagerFile, self.filesCapacity); errdefer allocator.free(files); self.pendingFree = try PagerFileList.initCapacity(allocator, 1); errdefer self.pendingFree.deinit(allocator); self.ring = try PagerRing.init(allocator); errdefer self.ring.deinit(); try self.pendingFree.append(allocator, files); self.files = files; self.files.len = 0; // no values yet... return self; } pub fn deinit(self: *Pager) void { for (self.files) |*file| { file.deinit(); } while (self.pendingFree.popOrNull()) |toFree| { self.allocator.free(toFree); } self.pendingFree.deinit(self.allocator); self.ring.deinit(); self.allocator.destroy(self); } view raw Pager.zig hosted with ❤ by GitHub As you can see, we allocate a PagerRing for the pager, which will deal with the actual I/O. The actual disposal of the files array is managed using the pendingFree list. That is a small cost to pay, to reduce the cost of adding a new file. In the deinit() routine, note that there is a distinction between releasing the files themselves (where we close the FilesChunk, release the relevant memory, close the file handle, etc) and releasing the arrays that hold the files themselves. I’m quite pleased with how this turned out, zero cost for reads (important) and negligible memory cost for most scenarios). In my next post, I’ll get started with actually reading the data from disk and putting that in the pager. So far, that is a 7 post series, and we haven’t completed the first part. That simple scenario is surprisingly tricky.

Using HTTP/3 (QUIC) in .NET

by Gérald Barré

posted on: January 10, 2022

#What's HTTP/3?HTTP/3 is a new version of HTTP. HTTP/3 protocol is supported by most modern browsers and servers. This update should bring performance benefits mostly for mobile users or unreliable connections. The main idea is to replace TCP by a new protocol, QUIC, which removes some issues of TC

Teaching, learning and tearing your hair out

by Oren Eini

posted on: January 06, 2022

One of my favorite activities is teaching. I love explaining how things work and passing on knowledge. Another good way to pass the time is to learn, which can be a source of great joy and incredible frustration. Recently I had a conversation with a university student in the Computer Science track talking about databases. As you might imagine, that is a topic that is near and dear to my heart, so I had a lot of interest in hearing from the student what kind of databases they have been exposed to and what impression they had on them. The student in question was familiar with MySQL from his courses and had run into PostgreSQL in some job interviews. He was very much in the PostgreSQL camp, that isn’t an uncommon reaction, but the reason for that was interesting to me. In his course, they had to setup and configure MySQL from scratch. That was annoying and hard, especially since they used the command line to make queries to the database. In the PostgreSQL case, however, he was given access to a web front end to the database and was assigned tasks to complete. They could get started right away doing the most important aspects of their task. When I’m teaching, part of the job is data dump (here are the things you need to know), part of the job is to answer questions, make sure that they understand the material, etc. A crucial part of teaching is getting the students to do things, making sure that the students are actually exercising the new knowledge. In such cases, I noted that I provide them with the baseline and they need to complete just the parts that are missing, the core pieces. That is pretty much the same thing that the student ran into during their interview. In retrospect, for teaching, I think that this approach is a serious issue. One of the most important hurdles that I see for new developers is their ability to deal with problems. Whether it is actually reading the errors, composing some mental model for what is going on or just being able to dig deeper and understand what is happening. I’m not talking about experience, mind. I’m talking about the approach. If most of the tasks that they have dealt with so far were ones which were “fill the missing pieces”, they are likely never had the experience of dealing with everything else. And in many cases, the issues aren’t in the areas that you are thinking, they can be somewhere else completely. I remember many times where I had to do something, and I ran into a wall. That was incredibly frustrating, especially when the issue was somewhere completely orthogonal to what I’m trying to do. A great example recently was having to figure out how to do cross compilation in GitHub action using GCC. That took a lot of time and all I wanted to do is to just call a single function from native code. As frustrating as that is, I think that there is a huge amount of value in those kinds of “side quests”. That is especially true when someone is in the early stages of their career. Those are just the sort of hurdles that can teach you not only what is going on at your level but about the ecosystem in general and the layers beneath you. A great example of lack of such knowledge is a candidate in an interview recently that was asked: “Given a static HTML page and a domain name that was registered, what do you need to setup a website for that page in that domain?” The candidate had no idea what was actually involved (nothing about DNS, routing, servers, etc). They were likely able to write an application using modern practices, but everything about pushing to production, what is a website, what is a domain name or IPs… nope. And that makes sense, they never had even run into something like that. On the other hand, I remember building websites using FTP and Apache / IIS in the 90s. It wasn’t fun, but I had very little abstraction to deal with and was exposed to the working of the engine. And that sort of thing matters, because you will need to understand details such as DNS propagation times and its impact on what your system is doing, for example.

Implementing a file pager in Zig

by Oren Eini

posted on: January 05, 2022

After implementing the memory management in the previous post, I set out to handle the actual I/O primitives that we need. As a reminder, we are separating the concerns here. We managed memory and reference counting in the previous post and now I want to focus on how we can read and write from the disk in as efficient a manner as possible. Before we get to the guts of the code, I want to explain a bit about the environment that I have in mind. Most of the time, the pager is able to provide the requested page from memory directly. If it can’t do that, it needs to consider the fact that there may be multiple threads that are trying to load that page. At the same time, while we are loading the page, we want to be free to do other things as well. I decided to implement the I/O routine using async I/O. Here is the rough sketch of the API I have in mind: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub const BackgroundRing = struct { const IoRingQueueSize = 32; const WorkReqStack = std.atomic.Stack(Work); pub const CallbackFn = fn (*Work) void; pub const Work = struct { tag: enum { Read, Write }, fd: std.os.fd_t, buffer: []u8, offset: u64, context: u64, result: struct { bytes: u64, err: ?i32, }, callback: CallbackFn, }; allocator: std.mem.Allocator, event_fd: i32, worker: std.Thread, background_error: ?anyerror, pending: WorkReqStack, running: bool, pub fn init(allocator: std.mem.Allocator) !*BackgroundRing { var self = try allocator.create(FileRing); errdefer allocator.destroy(self); self.allocator = allocator; self.pending = WorkReqStack.init(); self.event_fd = try std.os.eventfd(0, 0); self.running = true; self.worker = try std.Thread.spawn(.{}, background_worker_wrapper, .{self}); return self; } pub fn deinit(self: *BackgroundRing) void { self.running = false; @fence(.Acquire); self.wake_worker() catch {}; self.worker.join(); while (self.pending.pop()) |node| { self.allocator.destroy(node); } std.os.close(self.event_fd); self.allocator.destroy(self); } pub fn submit(self: *BackgroundRing, work: Work) !void { var ptr = try self.allocator.create(WorkReqStack.Node); ptr.data = work; self.pending.push(ptr); try self.wake_worker(); } fn background_worker_wrapper(self: *BackgroundRing) void { self.background_worker() catch |err| { self.background_error = err; }; } fn background_worker(self: *BackgroundRing) !void { while (self.running) { try self.wait_for_work(); while (self.pending.pop()) |node| { const bytes = switch (node.data.tag) { .Read => std.os.pread(node.data.fd, node.data.buffer, node.data.offset), .Write => std.os.pwrite(node.data.fd, node.data.buffer, node.data.offset), } catch |err| val: { node.data.result.err = @errorToInt(err); break :val 0; }; node.data.result.bytes += bytes; if (bytes != 0 and bytes < node.data.buffer.len) { node.data.offset += bytes; node.data.buffer = node.data.buffer[bytes..]; self.pending.push(node); // retry... continue; } node.data.callback(&node.data); self.allocator.destroy(node); } } } fn wake_worker(self: *BackgroundRing) !void { var w: u64 = 1; _ = try std.os.write(self.event_fd, std.mem.asBytes(&w)); } fn wait_for_work(self: *BackgroundRing) !void { var val: u64 = undefined; _ = try std.os.read(self.event_fd, std.mem.asBytes(&val)); } }; view raw BackgroundRing.zig hosted with ❤ by GitHub The idea is simple, we use a struct to which we can submit work in an asynchronous manner. At some later point in time, the work will complete and our read or write will be done. At that point we’ll be able to invoke the provided callback for the user. The code above is about as simple as you can manage, it spawns a dedicated thread to manage the I/O and then just issues those operations directly. To save myself some headache, I’m using an eventfd as a synchronization mechanism, I don’t strictly need this, but it will be useful down the road. In terms of the API, I can now write the following code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters var file = try std.fs.openFileAbsolute(filename, .{ .write = true, .read = true }); defer file.close(); try ring.submit(.{ .tag = .Read, .buffer = &buffer, .offset = 0, .fd = file.handle, .context = @ptrToInt(&waiter), .callback = handle_completion, .result = .{ .bytes = 0, .err = null }, }); view raw usage.zig hosted with ❤ by GitHub The basic idea is that the BackgroundRing struct doesn’t manage file descriptors or buffers. It is a strict way to manage I/O. The API is pretty piss poor as well, in terms of usability. No one will ever want to write generic I/O routines using this method, but we aren’t trying to do generic I/O, we are trying to ensure usability in a very constrained manner, inside the pager. About the only nice thing that we do in this implementation is handle partial reads and writes. If we were asked to read more than what we got, we’ll repeat the operation until we get to the end of the file or succeed. In terms of implementation, as well, this is a really bad idea. We look like we are doing async I/O, but we are actually just passing it all to a background thread that will do the work off a queue. That means that it will be very hard to make full use of the hardware capabilities. But I think that you can guess from the name that I’m not going to leave things like that. I’m going to use the new io_uring API in Linux to handle most of those concerns. That idea is that we’ll allocate a command buffer in the kernel and allow the kernel to handle asynchronous execution of the I/O operations. We still retrain the same rough structure, in that we are going to have a dedicated background thread to manage the commands, however. Amusing enough, the io_uring API is meant to be used from a single thread, since otherwise you’ll need to orchestrate writes to the ring buffer from multiple providers, which is much harder than a single consumer, single producer scenario. The use of io_uring is also why we are using the eventfd model. We are registering that file descriptor in the io_uring so it will let us know when event completes. This also does double duty as the method that we can use to wake the background thread when we have more work for it to do. The most major change is inside the background worker, of course. Here is how this looks like: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters fn background_worker(self: *PagerRing) !void { var cqes = std.mem.zeroes([IoRingQueueSize]std.os.linux.io_uring_cqe); while (self.running) { try self.wait_for_work(); var shouldWake: bool = false; while (self.pending.pop()) |node| { const op = switch (node.data.tag) { .Read => IO_Uring.read, .Write => IO_Uring.write, }; _ = op( &self.ring, @ptrToInt(node), node.data.fd, node.data.buffer, node.data.offset, ) catch |e| switch (e) { error.SubmissionQueueFull => { self.pending.push(node); shouldWake = true; break; }, else => { self.allocator.destroy(node); return e; }, }; } _ = self.ring.submit() catch |err| switch (err) { error.CompletionQueueOvercommitted => shouldWake = true, error.SignalInterrupt => shouldWake = true, else => return err, }; // now let's process the completed values const n = try self.ring.copy_cqes(cqes[0..], 0); for (cqes[0..n]) |cqe| { var node = @intToPtr(*WorkReqStack.Node, cqe.user_data); if (cqe.res > 0 and cqe.res < node.data.buffer.len) { // partial operation, need to resubmit var bytes = @intCast(usize, cqe.res); node.data.buffer = node.data.buffer[bytes..]; node.data.offset += bytes; node.data.result.bytes += bytes; self.pending.push(node); shouldWake = true; continue; } defer self.allocator.destroy(node); if (cqe.res < 0) { node.data.result.err = -cqe.res; } else { node.data.result.bytes += @intCast(usize, cqe.res); } node.data.callback(&node.data); } if (shouldWake) { try self.wake_worker(); } } } view raw Ring.zig hosted with ❤ by GitHub We create the ring in the init function (see full code listing below) and in the background thread we are simply waiting for an event using the eventfd. When a caller submits some work, we’ll register that on the io_uring and wait for it to complete (also using the eventfd). You can see that I’m handling some basic states (partial reads, full queues, etc). The code itself ends up being pretty small. Once we are done with the operation, we let the user know about the completion. There are a few things here that are interesting to note. We are actually allowing interleaving of operations, so we may have many outstanding operations at any given point. We aren’t trying to guarantee any kind of ordering between the operations, nor are we providing anything but the most bare bones interface for the caller. Even so, there is quite a bit of power in the manner in which we are working here. We need to complete a few more components first, and then we can bring it all together… Here is the full listing of the PagerRing code. In the next post, I want to focus on the actual overall pager, when we are managing multiple files and how we work with all of them together. In particular, we want to understand how we manage the memory budget across all of them.

What are Abstractions in Software Development

by Ardalis

posted on: January 05, 2022

Software developers deal with abstractions every day. But just what is an abstraction? There are differing definitions that can sometimes…Keep Reading →

Implementing a file pager in Zig

by Oren Eini

posted on: January 04, 2022

After writing the post about handling chunk metadata, I started thinking about the overall approach. Both the method using compressed pointers and the baseline computation felt… off to me. They were certainly workable, but it was too complex and felt fragile. I don’t like dealing with a high level of complexity, I would rather put a lot of effort into simplifying the solution. The overall approach may be complex, but the system should be nice to work with. Usually, we can get away with a great deal of simplification if we accept some constraints on what we want to do with the system. For now, I’m going to assume the following constraints: We are using 64 bits OS (and can assume effectively unlimited address space). We want to go with a file pager (instead of the memory mapped one) because I want to be able to control the I/O behavior better. The files we use are limited to 8 GB in size (can use more than a single file, of course). The last one deserves some additional words. When thinking about a storage solution, accepting a maximum size is generally a bad idea (640KB, anyone?). However, if we decide that our storage solution is going to be composed of files of specific size, we can combine them to reach any size needed. But why accept this limitation? Why say that a single file will not exceed 8 GB? It turns out that this has several advantages. Let’s assume that we have a dataset that is 100GB in size, using 8 GB files, that would be 13 files to a total of 104 GB of used disk space. Now we want to delete some of that data. What do we do with the actual used disk space? It is actually quite hard to release disk space back to the operating system if you have a single file. You might need to run compaction of the data, or use advanced API such as hole punching (see FALLOC_FL_PUNCH_HOLE). Advanced API is something that I would like to avoid, too easy to fall into some pitfall that no one else has run into. Working with sparse files (with holes in them) also typically requires you to utilize dedicated tools and can be awkward. If we split the data into separate files, we can retain most of the same benefits, and give ourselves a simpler environment for the user to work with. With the 8GB limitation in place, I can choose to manage the paging using the following manner: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub const FileChunks = struct { pub const MaxFileSize = 8 * 1024 * 1024 * 1024; // 8 GB pub const ChunkSize = 2 * 1024 * 1024; // 2 MB pub const MaxChunksInFile = MaxFileSize / ChunkSize; // 4096 pub const PageSize = 8 * 1024; // 8 KB pub const PagesInChunk = ChunkSize / PageSize; // 256 pub const ChunkMetadata = packed union { pub const Tag = enum(u2) { Empty = 0b00, Error = 0b01, Loading = 0b10, Value = 0b11, }; raw: u64, futex: packed struct { value: u32, // tag & version references: u32, // ignored }, v: packed struct { version: u30, tag: Tag, // references == 0 - this is unused // references == 1 - just the pager is holding this // refereces >= 2 - external entity is holding this references: u32, }, }; comptime { if (@sizeOf(ChunkMetadata) != @sizeOf(u64)) { @compileError("ChunkMetadata should be exactly 64 bits in length"); } } chunks: [MaxChunksInFile]ChunkMetadata, ptr: []align(mem.page_size) u8, allocator: mem.Allocator, }; view raw FileChunks.zig hosted with ❤ by GitHub The idea is pretty simple. Instead of trying to stitch together the memory for the file, we are going to just allocate a single 8GB range of virtual memory. This can be done using the following command: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters var ptr = try os.mmap( null, MaxFileSize, os.PROT.NONE, os.MAP.ANONYMOUS | os.MAP.PRIVATE, -1, 0, ); view raw mmap.zig hosted with ❤ by GitHub This reserves (but does not use) 8GB of address space. We can now allocate ranges from that safely. This is important because if we have a request to two sequential chunks, they will reside in memory right next to one another. Note that we also don’t need to handle any pointers, since we can rely on a stable base address for the whole file. The nice thing about this is that we aren’t actually allocating memory, just reserving it. Let’s see how that will work? The chunks array is used to control references to the chunks in the file. The chunk metadata is a 64 bits value that has several responsibilities at the same time. It stores the tag of a chunk, which indicate its status (loaded, error, empty, etc) and the number of outstanding references to the chunk. That uses up 34 bits in the value, the rest of the bits are used as a version field, which is incremented on each change. That allows us to avoid the ABA problem. The actual data, of course, is managed using the ptr value. Here is how we can get a chunk from this struct: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn tryGet(self: *FileChunks, chunk: u64) !?[]align(mem.page_size) u8 { while (true) { var copy = self.chunks[chunk]; var origin = copy; switch (copy.getTag()) { .Empty => return null, .Error => return @intToError(@intCast(u16, copy.v.references)), .Loading => return error.ValueIsLoading, .Value => {}, } try copy.addRef(); if (self.chunks[chunk].tryUpdate(origin, copy)) { var offset = chunk * ChunkSize; return self.ptr[offset..(offset + ChunkSize)]; } } } view raw tryGet.zig hosted with ❤ by GitHub What we are doing here is checking that the value is loaded to memory, and if it is, we increment the reference and then return it. This code runs in a loop, because we assume that multiple threads may run it in the same time. This handles just getting data that is already loaded. If the data isn’t loaded, what will happen? We’ll get a null back. Here is the blocking version of this method: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn getBlocking(self: *FileChunks, chunk: u64, timeout: ?u64) ![]align(mem.page_size) u8 { while (true) { var maybechunk = self.tryGet(chunk) catch |e| { if (e == error.ValueIsLoading) { var copy = self.chunks[chunk]; if (copy.getTag() == .Empty) { try self.chunks[chunk].wait(copy, timeout); } continue; } return e; }; if (maybechunk) |c| { return c; } return error.ValueIsNotLoading; } } view raw getBlocking.zig hosted with ❤ by GitHub Just based on those two methods, you should be able to draw some conclusions. If the value isn’t loaded, we’ll always return null, but there is this Loading stage as well, in that case, we may want to wait for it. How is that going to work? This works using two important functions: markLoading() and markLoaded(), the idea is that we’ll first try to call tryGet() to load a chunk, if there is no value, we need to load it from disk. At that point, remember, there may be multiple threads accessing the relevant chunk. So all of them would be competing on the markLoading function, like so: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn markLoading(self: *FileChunks, chunk: u64) !?[]align(mem.page_size) u8 { while (true) { var copy = self.chunks[chunk]; var origin = copy; switch (copy.v.tag) { .Value => return error.ValueAlreadyExists, .Error => return error.ValueInErrorState, .Loading => return null, // already marked.. .Empty => {}, } copy.setTag(.Loading); if (self.chunks[chunk].tryUpdate(origin, copy)) { var offset = chunk * ChunkSize; const c = self.ptr[offset..(offset + ChunkSize)]; _ = try os.mmap( c.ptr, ChunkSize, os.PROT.READ | os.PROT.WRITE, os.MAP.ANONYMOUS | os.MAP.PRIVATE | os.MAP.FIXED, -1, 0, ); return c; } } } view raw markLoading.zig hosted with ❤ by GitHub The code itself is pretty simple, we are updating the tag of the chunk and try to update it optimistically. We are moving the state of the chunk from Empty to Loading in a thread safe manner. If we are successful in doing so, we know that we are the only thread that owns the loading portion of the chunk. Note that part of the markLoading process is to ask the OS to give us the memory for the chunk (in the range that we previously allocated). At this point, we can load the data from disk somehow and then we’ll call the markLoaded function, which completes the process: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn markLoaded(self: *FileChunks, chunk: u64) ![]align(mem.page_size) u8 { while (true) { var copy = self.chunks[chunk]; var origin = copy; switch (copy.v.tag) { .Value => return error.ValueAlreadyExists, .Error => return error.ValueInErrorState, .Empty => return error.ValueIsNotLoading, .Loading => {}, } copy.setTag(.Value); try copy.addRef(); // ownership by the pager try copy.addRef(); // ownership by the caller if (self.chunks[chunk].tryUpdate(origin, copy)) { return self.getLoadedChunk(chunk); } } } view raw markLoaded.zig hosted with ❤ by GitHub The idea is that we are splitting the responsibility for managing the chunks references from how we load the data to memory. In other words, the expected usage of this struct is something like this: Call tryGet() a page in a given chunk. If successful, do the work you wanted to do. If not successful, compete to be the loader for this data by calling markLoading(). If you lost, call getBlocking() to wait for the winner to get the data. Somehow, load the data from the disk and call markLoaded(). Proceed to make use of the data. Another important aspect that we have to deal with is when we want to discard the data. Basically, if we filled our memory budget and we need to load a value from the disk, what can we do then? The answer is that we need to evict the data somehow, before we can do that, we need to know what data is currently in use. That is why we have the calls to addRef() and release(). We use those (using atomic operations) to track the usage of the various chunks. When we need to evict data from memory, we’ll need to have some sort of a policy to do so. I’m deferring the actual policy to a later point in time, right now I want to discuss how do we know what we can evict and how that is going to work. Here is the code to handle eviction, currently implementing a policy of simple scanning (not ideal by a long shot): This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters fn tryClaimOwnership(self: *FileChunks, index: u64) ?ChunkMetadata { while (true) { var copy = self.chunks[index]; if (copy.mayReclaim() == false) return null; var modified = copy.reset(.Loading); // means that we own it for the duration... if (self.chunks[index].tryUpdate(copy, modified)) return modified; } } fn trySetToEmpty(self: *FileChunks, index: u64) !void { while (true) { var modified = self.chunks[index]; if (modified.getTag() != .Loading) { // someone else modified it while we where releasing the memory return error.ValueIsNotLoading; } var released = modified.reset(.Empty); if (self.chunks[index].tryUpdate(modified, released)) break; } } pub fn reclaim(self: *FileChunks) !u64 { var reclaimed: u64 = 0; var index: u64 = 0; while (index < self.chunks.len) : (index += 1) { var modified: ChunkMetadata = undefined; if (self.tryClaimOwnership(index)) |m| { // at this point, m is owned by us, and no one else can use it... modified = m; } else { continue; } reclaimed += ChunkSize; _ = try os.mmap( self.getLoadedChunk(index).ptr, ChunkSize, os.PROT.NONE, os.MAP.ANONYMOUS | os.MAP.PRIVATE | os.MAP.FIXED, -1, 0, ); try self.trySetToEmpty(index); } return reclaimed; } view raw eviction.zig hosted with ❤ by GitHub In the reclaim method, we are scanning through the chunks. To be able to reclaim a chunk, the following conditions need to hold: The chunk holds a value. There are no outstanding references to the chunk, only the pager is holding a reference to the chunk. Note that in order to do this safely, we have to assume that while we are trying to reclaim a chunk, another thread is trying to use it. This behavior complicates our lives a bit. We handle that by doing a racy update of the chunk, trying to move it to a loading state. The idea is that the Loading state is meant to be used as a busy signal. While the chunk is in Loading state, the rest of the system knows that it cannot use this and needs to wait. Note that this means that we have the following transitions: Most of the code that we have in the struct is there to handle concurrency from multiple threads dealing with the system at once, note. The actual behavior is fairly simple. We check if we can reclaim the chunk (no one is looking), we take a lock on by trying to move its state to Loading. Then we can discard the memory by calling mmap on the chunk’s memory with PROT_NONE. For fun, we are using 2MB chunks because that fits well into huge pages. On a properly setup system, we can significantly reduce the paging metadata overhead inside the kernel by allocating a single 2MB page for each chunk. You can see the entire implementation here. In the next post, I want to look into handling the I/O portion of reading the data from the disk. After that we’ll talk about how we can implement a proper eviction policy.

Customising generated code with marker attributes

by Andrew Lock

posted on: January 04, 2022

Creating a source generator - Part 4

Production postmortem

by Oren Eini

posted on: January 03, 2022

The topic of this post is a bug in RavenDB, a pretty serious one. The end result is that a user reported that they got an error from RavenDB that they are unable to read a stored document. In some cases, RavenDB needs to read a document on startup, which means that it wasn’t able to start up if that document had this behavior. As you can imagine, this is one of those issues that gets our full and immediate attention. The error itself gave us a lot of information: Dictionary mismatch on Dic #375 at Voron.Data.Tables.ZstdLib.AssertSuccess(UIntPtr v, CompressionDictionary dictionary) This is related to RavenDB’s document compression behavior. In order to get a great compression ratio from our documents, we train RavenDB on the recent documents that you have and generate a compression dictionary. The problem at hand is that the compression dictionary we have and the compression dictionary that was actually used are different. As you can see from the error, we are using zstd as the compression algorithm. When zstd generates a dictionary it will (by default) generate an id from that document that is mostly based on the xxhash64 of its content, rounded to 32 bits. You can see the relevant part here. This is pretty nice, since it means that there is a good chance that we’ll detect the wrong dictionary. So now we know what is going on, but we don’t understand why. When we wrote this feature, we were quite aware that we’ll not be able to make any sort of sense from the documents if we don’t have the right dictionary. For that reason, we store the dictionaries three times. Once inside of RavenDB itself and twice in ancillary files, which we can use during recovery. This sort of error should be utterly impossible. And yet, we had run into that in production, so we have to dig deeper still. The primary suspect was the dictionary training portion. One of the things that RavenDB does on a continuous basis is measure the compression ratio of the documents, if we aren’t able to hit a good compression ratio, RavenDB will try to generate a new dictionary from the most recent documents and see if that new dictionary can do better. This can be very helpful in maintaining good compression rates. As your documents change, RavenDB will detect that and realize that it can do better, retrain on the recent data and compress even further. The problem is that this code path is also quite tricky, we first compress the document using the current dictionary, then we try generating a new dictionary and see if compressing with the new dictionary is better. If that is the case, we can install the new dictionary for future operations, otherwise, we need to discard it. I suspected that the issue was somewhere around that area, we might not be handling the rejection of the new dictionary properly. So I went into the code and started digging, but I found absolutely nothing. The entire process is covered in tests and has been in production for close to 18 months, so this isn’t something that obvious. After spending quite a bit of time on the issue, I decided that the code is perfect, it handled everything properly and taken into account all the right behaviors. Clearly the fault was elsewhere. Before setting out to blame the nearest cat (you can never trust those), I had an idea, what if the problem wasn’t during the training process, but afterward? Well, that doesn’t really matter, does it? RavenDB is a transactional database, if we had a failure after the training process, we’ll have to discard some of the data, for sure, but that would be about it. Unless, what if we have some state that wasn’t transactional? As part of looking at the compression training code, I ran into just such a scenario. Running the training to generate a new compression dictionary is an expensive proposition, so we don’t want to do that often. As such, we’ll do that for only about 1K document changes where we exceed the desired compression ratio by over 10%. How do we know to act every 1K documents? Well, we have a counter that we increment on every change. That value is incremented using Interlocked.Increment() and isn’t part of the transactional state. If the transaction is aborted, the value is still incremented. The actual value doesn’t matter, mind, only that it is moving forward, so that isn’t an issue. I mentioned the dictionary id before, but I should clarify that this is the zstd’s dictionary id. Internally, RavenDB uses a different value. That value is simply the sequence number of the dictionary, RavenDB counts the number of generated dictionaries and gives the new dictionary the next available value. That value, by the way, is part of the transaction. If we rollback a transaction, we’ll use the same dictionary id. But that doesn’t matter, of course. When using compression dictionaries, we need to load them from a buffer. There is quite a bit of work that is involved in that, there is memory allocation, entropy tables to load, etc. In order to save repeated work, RavenDB caches the compression dictionaries (after all, their whole point is to be used repeatedly). That cache can be used by multiple transactions at the same time (two read transactions using the same dictionary will use the same instance). Given all of this information, here is the sequence of events that we need to get the error in question: The user enabled documents compression. The user runs a transaction with at least four commands, which needs to satisfy the following conditions. A document write as the first action. Then a write to document whose compression ratio exceeded the expected ratio by over 10%, as a result, RavenDB tried to train a new compression dictionary. That dictionary had a better compression ratio and was accepted as the new default compression dictionary. RavenDB persisted the new dictionary and used that to compress the new document. Another command (in the same transaction) had stored a document in the same collection, now RavenDB will read the new dictionary and store that in a cache. A third command runs, but this one throws an error (such as optimistic concurrency violation). At this point, RavenDB will rollback the entire transaction and return the error to the user. Let’s say the user has chosen to submit the same two documents again, shall we? For the first command, we’ll again discover that the compression ratio (of the old compression dictionary) is insufficient. We will not generate a new compression dictionary, why is that? Remember the counter that we increment using Interlocked? That one was not rolled back, so we’ll need to wait for another 1K documents for the stars to properly align for us. That doesn’t impact correctness in any way, shape or form, however. At this stage, the stage is set, but everything is still okay. The problem will happen on the next time that we’ll trigger a new dictionary. At that point, we’ll again scan the most recent documents, build a dictionary, etc. However, the dictionary id that RavenDB will use will be identical to the dictionary id that we previously discarded. The data that dictionary was trained on, however, will almost certainly be different. We persist the new dictionary to disk and everyone is happy, the new document that we wrote will use the new compression dictionary and we are perfectly fine. The next write for this collection, however, will run into a problem. It will need to use the current (the new one) dictionary when we want to make a write. In order to do that, it will load the value using the cache, but there is already a value for that dictionary in the cache, the same dictionary that was discarded. At this point, RavenDB will start compressing documents using the in memory dictionary while the on disk dictionary is different. If you’ll try to access the document which triggered the new dictionary, you’ll get an error, but documents that were modified later will continue working with no issue. Until you restart, of course. On restart, we’ll read the dictionary from disk, where we wrote the new dictionary, at this point, all those documents that we wrote will give us the error above. Note that the sequence of events has to be very exact, you need to have a dictionary training as part of a multi act transaction which failed after the dictionary training has been successful and wrote additional documents. In a year and a half of production usage and very heavy load, that happened only a couple of times, it seems. The issue has been fixed, of course and we’ll be rolling it out to both users and cloud customers. We’ll now rollback such in memory state on a transaction rollback as well, avoiding this issue entirely. It is amazing to me that despite very careful planning, it wasn’t the code itself that caused a problem, but a sequence of independent operations and failure modes that we never even considered about this.

Beating FizzBuzz for detecting qualified candidates

by Oren Eini

posted on: December 31, 2021

FizzBuzz is a well known test to show that you can program. To be rather more exact, it is a simple test that does not tell you if you can program well, but if you cannot do FizzBuzz, you cannot program. This is a fail only kind of metric. We need this thing because sadly, we see people that fail FizzBuzz coming to interviews. I have another test, which I feel is simpler than FizzBuzz, which can significantly reduce the field of candidates. I show them this code and ask them to analyze what is going on here: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public class ControllerBase : Controller { public static bool IsAdminUser; } view raw snap.cs hosted with ❤ by GitHub Acceptable answers include puking, taking a few moments to breathe into a paper bag and mild to moderate professional swearing. This is something that I actually run into (about 15 years ago, in the WebForms days) and I have used it ever since. That is a great way to measure just how much a candidate knows about the environment in which they operate.

Code review horror in 4 lines of code

by Oren Eini

posted on: December 30, 2021

I run into the following code during code review and had an immediate and visceral reaction. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters private readonly List<string> _messages; public IReadOnlyList<string> Messages { get { lock (this) { return _messages; } } } view raw horror.cs hosted with ❤ by GitHub This is a (bad) attempt to add thread safety, because you are getting a value through a read only interface, but there is still the mutable instance to work with at the source, and now you have someone that observes the instance while it is being mutated, outside the lock. The proper way to handle this is to copy the list (under the lock) and return a distinct copy.