skip to content
Relatively General .NET

Don’t assume the result of read()

by Oren Eini

posted on: January 20, 2022

I read this post and it took me very little time to spot a pretty nasty bug. Here is the relevant section: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters loop { let read_count = encrypted_file.read(&mut buffer)?; if read_count == BUFFER_LEN { let plaintext = stream_decryptor .decrypt_next(buffer.as_slice()) .map_err(|err| anyhow!("Decrypting large file: {}", err))?; dist_file.write(&plaintext)?; } else if read_count == 0 { break; } else { let plaintext = stream_decryptor .decrypt_last(&buffer[..read_count]) .map_err(|err| anyhow!("Decrypting large file: {}", err))?; dist_file.write(&plaintext)?; break; } } view raw bug.rs hosted with ❤ by GitHub The bug is on line #4. The code assumes that a call to read() will return less than the requested number of bytes only at the end of the file. The problem with that approach is that this is explicitly documented to not work this way: It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal. This is a super common error in many cases. And in the vast majority of the cases, that would work. Except when it wouldn’t. The underlying implementation of File::read() will call read() or ReadFile(). ReadFile() (Windows) is documented to read as much as you requested, unless you hit the end of file. The read() call, on Unix, is documented to allow returning less than requested: It is not an error if this number is smaller than the number of bytes requested Aside from signals, the file system is free to do a partial read if it has some of the data in memory and some not. I’m not sure if this is implemented in this manner, but it is allowed to do so. And the results for the code above in this case are absolutely catastrophic (decryption will fail, encryption will emit partial information with no error, etc). I’m writing this blog post because reading the code made the error jump at me. Was bitten by this assumption too many times.

Generate SSH RSA Key Pairs on Windows with WSL

by Ardalis

posted on: January 20, 2022

Secure Shell Protocol (SSH) keys provide an alternate way to authenticate with many services like GitHub. Creating them on Windows is simple…Keep Reading →

Implementing a file pager in Zig

by Oren Eini

posted on: January 19, 2022

In the previous post I outlined some ideas about how to implement a more efficient write behind. The idea is that whenever we write pages to the Pager, we’ll not trigger an immediate write to the disk. Instead, we’ll keep the data in memory and only write to the disk when we hit a certain threshold. In many write scenarios, there are certain pages that are modified a lot (like the root page in a B+Tree) and pages that are modified rarely (a leaf page that got entries and will not be modified again). There is no point in writing the popular page to the disk, we’ll likely get another write to them shortly anyway. That calls to a Least Frequently Modified approach. We don’t need to use a more complex approach, (like the Clock Sweep algorithm we use for the Pager), because we don’t have to deal with the same scenarios. There are not likely to be cases similar to scans, which throws a lot of complexities of buffer pool implementations. Writes operations are far more predictable in general and follow a pretty strict power law distribution. The task is simple: we have the list of pages that were modified, and at capacity, we’ll select some to send to the disk. The question is how to make that decision. The simplest option is to go with the least recently used model. That is trivial to implement, The idea is that we have the following sequence of writes (assuming we have a capacity of 4 pages): 1, 2, 3, 2, 3, 1, 2, 3, 4, 2, 1, 2, 3, 4, 4, 2, 1, 6 In this case, the page that will be selected for eviction is #3, since it wasn’t modified the longest. The other alternative is to use least frequently used, in which case we have the following frequency table: Page Usages 1 4 2 5 3 4 4 3 6 1 In this case, we’ll want to select page #4 for eviction. Since it is the one least used. (We don't consider #6 because it is the one we just inserted). I can make arguments for both sides, to be frank. It makes sense that the least frequently used is going to be the most relevant, right? The problem is that we need to also account for decaying usage over time. What do I mean by this? We may have a page that is very hot, it gets used a lot for a certain period of time. After that point, however, it is no longer being written to, but because it was frequently used, it will take a long time to evict from the pool. A good example of such a scenario is when we have a B+Tree and we are inserting values in ascending orders. All the values for the tree are going to be placed in the same page, so if we have a lot of inserts, that page is going to be hot. Once it is full, however, we’ll start using another page as the target and then the page will reside in memory until some other page will have more usage.  A good discussion of least frequency used implementation is in this blog post. A nice way to deal with the issue of decaying priorities over time in an LFU setup is to use the following formula to compute the priority of the pages: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub const NodePriority = struct { prevAccess: u64, deltaSum : u64, usages : u64, priority : u64 fn computePriority(self: *NodePriority) void { var now = @intCast(u64, std.time.milliTimestamp()); if(self.prevAccess == 0) { self.prevAccess = now - std.time.ms_per_s * 60; } var delta = now - self.prevAccess; self.deltaSum += delta; self.usages += 1; self.priority = now - (self.deltaSum / self.usages); } }; view raw priority.zig hosted with ❤ by GitHub The idea is that we compute the priority of the page based on the last access, so we are very close to the most recently used option. However, note that we compute the distance between accesses to the page. A page that is infrequently accessed will have low usages and a high delta sum. That will reduce its priority. Conversely, a page that is heavily used will have a low delta sum and high usage, so its value will be near the top. Another option is to go with another clock sweep option. In this case, we use a simple least recently used model, but we keep count of the frequency of usages. In that case, if we have to evict a page that is heavily used, we can reduce its usages and give it another chance. The advantage here is that this is a far simpler model to work with, but gives roughly the same results. Another option we have is to use the midpoint insertion LRU. There is also another consideration to take. The I/O cost isn’t linear. If I’m writing page #3 to disk, it is basically free from my perspective to write nearby pages. It is the same exact cost, after all, so why not do that? We’ll need to write our own doubly linked list. The Zig’s standard library only contains a single linked list. It doesn’t take long to write such a data structure, but it is fun to do so. I absolutely get why implementing linked lists used to be such a common practice in interviews: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn DoublyLinkedList(comptime T: type) type { return struct { const Self = @This(); pub const Node = struct { next: ?*Node, prev: ?*Node, item: T, pub fn removeFromSiblings(node: *Node) void { if (node.prev) |prev| { prev.next = node.next; } if (node.next) |next| { next.next = node.prev; } } }; head: ?*Node = null, tail: ?*Node = null, pub fn deinit(self: *Self, allocator: std.mem.Allocator) void { while (self.head) |n| { self.head = n.next; allocator.destroy(n); } } pub fn prepend(self: *Self, allocator: std.mem.Allocator, item: T) !*Node { var node = try allocator.create(Node); errdefer allocator.destroy(node); node.* = .{ .next = self.head, .prev = null, .item = item }; if (self.head) |head| { head.prev = node; } if (self.tail == null) { self.tail = node; } self.head = node; return node; } pub fn append(self: *Self, allocator: std.mem.Allocator, item: T) !*Node { var node = try allocator.create(Node); errdefer allocator.destroy(node); node.* = .{ .next = null, .prev = self.tail, .item = item }; if (self.tail) |tail| { tail.next = node; } if (self.head == null) { self.head = node; } self.tail = node; return node; } pub fn moveFirst(self: *Self, node: *Node) void { if (self.head == node) return; if (self.tail == node) { self.tail = node.prev; } node.removeFromSiblings(); if (self.head) |h| { h.prev = node; } node.next = self.head; self.head = node; node.prev = null; } pub fn moveLast(self: *Self, node: *Node) void { if (self.tail == node) return; if (self.head == node) { self.head = node.next; } node.removeFromSiblings(); if (self.tail) |t| { t.next = node; } node.prev = self.tail; self.tail = node; node.next = null; } pub fn remove(self: *Self, allocator: std.mem.Allocator, node: *Node) void { defer allocator.destroy(node); node.removeFromSiblings(); if (self.head == node) { self.head = node.next; } if (self.tail == node) { self.tail = node.prev; } } pub const IteratorDirection = enum { forward, backward }; pub const Iterator = struct { it: ?*Node, direction: IteratorDirection, pub fn next(self: *Iterator) ?T { if (self.it) |cur| { self.it = if (self.direction == .forward) cur.next else cur.prev; return cur.item; } return null; } }; pub fn iterate(self: *Self, direction: IteratorDirection) Iterator { return .{ .direction = direction, .it = if (direction == .forward) self.head else self.tail, }; } pub fn popLast(self: *Self, allocator: std.mem.Allocator) ?T { if (self.tail) |t| { var item = t.item; self.remove(allocator, t); return item; } return null; } pub fn popFirst(self: *Self, allocator: std.mem.Allocator) ?T { if (self.head) |h| { var item = h.item; self.remove(allocator, h); return item; } return null; } }; } view raw DoublyLinkedList.zig hosted with ❤ by GitHub There isn’t really much to discuss here, to be honest. There is a bunch of code here, but it is fairly simple. I just had to implement a few operations. The code itself is straightforward. It is a lot more interesting when we see it being used to implement the LRU: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn Lru(comptime T: type) type { return struct { const Self = @This(); const Item = struct { val: T, usage: usize }; list: DoublyLinkedList(Item) = .{}, map: std.AutoHashMapUnmanaged(T, *DoublyLinkedList(Item).Node) = .{}, capacity: usize, pub fn init(capacity: usize) Self { return .{ .capacity = std.math.max(capacity, 2) }; } pub fn deinit(self: *Self, allocator: std.mem.Allocator) void { self.list.deinit(allocator); self.map.deinit(allocator); } pub fn push(self: *Self, allocator: std.mem.Allocator, val: T) !?T { if (self.map.get(val)) |existing| { existing.item.usage += 1; self.list.moveFirst(existing); return null; } var node = try self.list.prepend(allocator, .{ .val = val, .usage = 1 }); try self.map.put(allocator, val, node); if (self.map.count() <= self.capacity) return null; while (self.list.tail) |last| { if (last.item.usage > 1 or last == node) { // let's give it another try... last.item.usage /= 2; self.list.moveFirst(last); continue; } var dropped = last.item.val; self.list.remove(allocator, last); _ = self.map.remove(dropped); return dropped; } unreachable; } }; } view raw Lru.zig hosted with ❤ by GitHub The push() method is where it all happens. We have two options here: We have a page already inside the LRU. In that case, we increment its usage counter and move it to the front of the list. This is a new page, so we have to add it to the LRU. If there is enough capacity, we can just add it to the front of the list and be done with it. However, things get interesting when we are at capacity. At that point, we actually need to select a page to evict. How can we do that? We scan the end of the list (the oldest page) and check its usage. If it has more than a single usage, we half its usage counter and move it to the front. We continue to work on the tail in this manner. In essence, high usage counter will get reset rather quickly, but this will still give us a fairly balanced approach, with more popular pages remaining in the pool for longer. When we evict a page, we can return it back to the caller, which can then write it to the disk. Of course, you probably don’t want to just write a single page. We need to check if we have additional pages nearby, so we can consolidate all of them at once to the disk. I’ll touch on that in my next post.

Implementing a file pager in Zig

by Oren Eini

posted on: January 18, 2022

In the last blog post I presented the manner in which the Pager can write data to disk. Here is a reminder: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters var w = pager.acquireWriter(); defer w.release(); try w.write(0, buffer); try w.flushWrites(); view raw usage.zig hosted with ❤ by GitHub We acquire the writer (and a lock on it), call write on the pages we want to write on (passing the buffer to write on them), and finalize the process by calling flushWrites() to actually write the data to disk. As a reminder, we assume that the caller of the Pager is responsible for coordination. While we are writing to a specific page, it is the responsibility of the caller to ensure that there are no reads to that page. The API above is intentionally simplistic , it doesn’t give us a lot of knobs to play with. But that is sufficient to do some fairly sophisticated things. One of the interesting observations is that we split the process of updating the data file into discrete steps. There is the part in which we are updating the in memory data, which allows other threads to immediately observe it (since they’ll read the new details from the Pager’s cache). Separately, there is the portion in which we write to the disk. The reason that I built the API in this manner is that it provides me with the flexibility to make decisions. Here are some of the things that I can do with the current structure: I can decide not to write the data to the disk. If the amount of modified pages is small (very common if I’m continuously modifying the same set of pages) I can skip the I/O costs entirely and do everything in memory. Flushing the data to disk can be done in an asynchronous manner. In fact, it is already done in an asynchronous manner, but we are waiting for it to complete. That isn’t actually required. The way the Pager works, we deposit the writes in the pager, and at some future point the Pager will persist them to disk. The durability aspect of a database is not reliant on the Pager, it is a property of the Write Ahead Log, usually. If I wanted to implement a more sophisticated approach for writing to the disk, I could implement a least recently used cache for the written pages. When the number of pages in memory exceeds a certain size, we’ll start writing the oldest to disk. That keeps the most used pages in memory and avoids needless I/O.  At certain points, we can ask the Pager to flush everything to the disk, this gives us a checkpoint, where we can safely trim the Write Ahead Log. A good place to do that is whenever we reach the file size limit of the log and need to create a new one. So far, by the way, you’ll notice that I’m not actually talking about durability, just writing to the disk. The durability aspect is coming from something we did long ago, but didn’t really pay attention to. Let’s look at how we are opening files, shall we: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters const PagerFile = struct { chunks: *FileChunks, fd: std.os.fd_t, pub fn init(allocator: std.mem.Allocator, file: []const u8) !PagerFile { var chunks = try FileChunks.init(allocator); errdefer chunks.deinit(); var fd = try std.os.open( file, std.os.O.RDWR | std.os.O.CREAT | std.os.O.CLOEXEC | std.os.O.DIRECT | std.os.O.DSYNC, std.os.S.IRUSR | std.os.S.IWUSR, ); return PagerFile{ .chunks = chunks, .fd = fd, }; } pub fn deinit(self: *PagerFile) void { self.chunks.deinit(); std.os.close(self.fd); } }; view raw PagerFile.zig hosted with ❤ by GitHub Take a look at the flags that we pass to the open() command, we are asking the OS to use direct I/O (bypassing the buffer pool, since we’ll use our own) as well as using DSYNC write mode. The two together means that the write will skip any buffering / caching along the way and hit the disk in a durable manner. The fact that we are using async I/O means that we need to ensure that the buffers we write are not modified while we are saving them. As we currently have the API, there is a strong boundary for consistency. We acquire the writer, write whatever pages we need and flush immediately. A more complex system would be needed to manage higher performance levels. The issue is that in order to do that, we have to give up a level of control. Instead of knowing exactly where something will happen, we can have a more sophisticated approach, but  we’ll need to be aware that we don’t really know at which point the data will be persisted. At this point, however, there is a good reason to ask, do we even need to write durably? If we are limiting the consistency of the data to specific times requested by the caller (such as when we replace the Write Ahead Log), we can just call fsync() at the appropriate times, no? That would allow us to use buffered writes from most I/O. I don’t think that this would be a good idea. Remember that we are using multiple files. If we’ll use buffered I/O and fsync(), we’ll need to issue multiple fsync() calls, which can be quite expensive. It also means higher memory usage on the system because of the file system cache, for data we determine is no longer in memory. It is simpler to use direct I/O for the whole thing, after all. In the next post, I’m going to show how to implement a more sophisticated write-behind algorithm and discuss some of the implications of such a design.

Guard Clauses and Exceptions or Validation?

by Ardalis

posted on: January 18, 2022

Guard Clauses provide an elegant way to ensure code inputs are valid, typically by throwing exceptions. Validation provides a solution to a…Keep Reading →

Implementing a file pager in Zig

by Oren Eini

posted on: January 17, 2022

At long last, we are now at the point where we can write data back to the disk.  Before we can do that, however, we need to figure out what sort of writes we want to allow. The idea that I have in mind for the Pager is to follow the same path as Voron does. In other words, for writes, we can make the following assumptions: There is only a single thread doing writes to the pager. There are no readers for the writes until the write is completed. There is a distinction between writing the data to the pager and writing the data to the disk. Let’s break those assumptions apart and see what they bring to the table. The fact that we can assume only a single writer thread at any given point is pretty important. It means that the level of complexity that we have to face is greatly reduced. In the same sense, the fact that we don’t need to deal with concurrent readers or any consistency boundary for the data while it is being written will greatly simplify things for us. Finally, we make a distinction between writing to the pager and writing to the disk. Writing to the disk is _slow_, so we want to avoid doing that at any critical areas and push that to the background. Finally, there is another aspect to consider. Internally, the Pager works with 2MB chunks, but to the outside world, it is using 8KB pages. When we write, we always write at the 8KB pages, not chunks. How would that work for the Pager? The Pager itself is concurrent, but we only allow a single writer at a time, we can achieve this by centralizing all the write activities in the Writer struct, like so: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters // inside Pager's struct... pub fn acquireWriter(self: *Pager) *Writer { self.writer.lock.lock(); return &self.writer; } pub const Writer = struct { writeResults: std.ArrayListUnmanaged(anyerror), writtenPages: std.AutoArrayHashMapUnmanaged(u64, u32), loadedChunksForWrites: ChunksSet, writeFlushCompleted: std.atomic.Atomic(u32), lock: std.Thread.Mutex, parent: *Pager, pub fn release(self: *Writer) void { self.lock.unlock(); } } view raw Writer.zig hosted with ❤ by GitHub For now, I’m going to ignore the fields in the Writer struct, we’ll touch on them in detail later. In order to use the writer, you need to acquire it, write as many pages as you need, then release it. Here is a usage example: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters var w = pager.acquireWriter(); defer w.release(); try w.write(0, buffer); try w.flushWrites(); view raw usage.zig hosted with ❤ by GitHub The basic idea is fairly simple. With the writer, we operate at the page boundary to write as many pages as we need, once we are done, the call to flushWrites() persists the data to disk and then we can release the writer. Let’s dig a bit deeper and see how that works, shall we? This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn write(self: *Writer, page: u64, buffer: []align(mem.page_size) u8) !void { std.debug.assert(buffer.len % FileChunks.PageSize == 0); var count = @intCast(u32, buffer.len / FileChunks.PageSize); var dst = try self.parent.getPage(&self.loadedChunksForWrites, page, count, .{ .forWrites = true }); std.mem.copy(u8, dst, buffer); try self.writtenPages.put(self.parent.allocator, page, count); } view raw write.zig hosted with ❤ by GitHub The write() call is about as basic as you can get. We use the getPage() function to get the right page, memcpy the data and that is about it, right? There are only two other things here that are important: We record which chunk (the 2MB chunk of memory, out of which we carve the 8KB pages) at the writer’s level, is using the loadedChunksForWrites value. We remember which pages we wrote to using the writtenPages hash table. This is intentionally bare bones, because that is actually sufficient for our needs. The  fact that we remember which chunks we loaded (and keep a reference to them) will prevent us from reclaiming them, so even though we just wrote to memory, another thread can get the data and start using it without waiting for the disk. Of course, we still need to hit the disk eventually, that is what flushWrites() is about. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn flushWrites(self: *Writer) !void { var count = self.writtenPages.count(); if (count == 0) return; var tmp = try self.parent.allocator.alloc(u64, count); defer self.parent.allocator.free(tmp); for (self.writtenPages.keys()) |i, p| { tmp[i] = p; } try self.writeResults.ensureTotalCapacity(self.parent.allocator, count); std.sort.sort(u64, tmp, {}, comptime std.sort.asc(u64)); var page = tmp[0]; var fileIdx = page / FileChunks.PagesInFile; var len = self.writtenPages.get(page) orelse unreachable; var index: usize = 1; while (index < tmp.len) : (index += 1) { if (page + len == tmp[index] and fileIdx == (tmp[index] / FileChunks.PagesInFile)) { // consecutive and in same file, just increase... len += self.writtenPages.get(page) orelse unreachable; continue; } try self.writePageToDisk(page, len); } // last one... try self.writePageToDisk(page, len); try self.waitForAllDiskWritesToComplete(); } view raw flushWrites.zig hosted with ❤ by GitHub There is a lot that is going on here, let’s break it up. We start by allocating a temporary array and copying the keys from the writtenPages hash table to it. We then sort the array. This is done so we’ll be able to process the writes in a sequential manner, which is likely to be faster, even with async I/O. We then scan the list of pages in order, trying to merge writes together. The idea is to issue the minimum number of write calls. Finally, we’ll wait for all the writes to complete. Okay, maybe it isn’t that complex.  There is a bunch of code here, but it is mostly straightforward. Note that we also prepare the writeResults list to accept the results of the write to the disk. As for writing to the disk, this is done using the PagerRing we previously looked at: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters fn writePageToDisk(self: *Writer, page: u64, count: u32) !void { var fileIdx = page / FileChunks.PagesInFile; var file = self.parent.files[fileIdx] orelse return error.FileIsNotLoaded; var pageBuffer = try self.parent.getPage(&self.loadedChunksForWrites, page, count, .{}); _ = self.writeFlushCompleted.fetchAdd(1, .Release); try self.parent.ring.submit(.{ .tag = .Write, .fd = file.fd, .buffer = pageBuffer, .offset = page % FileChunks.PagesInFile, .context = @ptrToInt(self), .callback = completeFlush, }); } fn completeFlush(work: *PagerRing.Work) void { var self = @intToPtr(*Pager.Writer, work.context); if (work.result.err) |err| { // need to report the error before releasing waiters self.writeResults.appendAssumeCapacity(err); } if (self.writeFlushCompleted.fetchSub(1, .Release) == 1) { std.Thread.Futex.wake(&self.writeFlushCompleted, std.math.maxInt(u32)); } } view raw writePageToDisk.zig hosted with ❤ by GitHub To write a buffer to the disk, we simply get the buffer from the Pager (reusing all the work we did in getPage()), increment the number of outstanding writes and then submit the work for the ring for processing. We setup the completeFlush as the callback function on completion. The PagerRing will call us when it is done writing to the disk. If there is an error, we’ll record it and reduce the number of outstanding writes. If there are no more outstanding writes, we’ll wake any waiters. That part is handled in the waitForAllDiskWritesToComplete(). This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters fn waitForAllDiskWritesToComplete(self: *Writer) !void { while (true) { var cur = self.writeFlushCompleted.load(.Acquire); if (cur == 0) // no outstanding writes... break; try std.Thread.Futex.wait(&self.writeFlushCompleted, cur, null); } self.writtenPages.shrinkRetainingCapacity(0); var it = self.loadedChunksForWrites.keyIterator(); while (it.next()) |chunk| { try self.parent.releaseChunk(chunk.*); } self.loadedChunksForWrites.clearRetainingCapacity(); defer self.writeResults.clearRetainingCapacity(); if (self.writeResults.items.len != 0) { return self.writeResults.items[0]; } } view raw waitForAllDiskWritesToComplete.zig hosted with ❤ by GitHub We start by waiting for the outstanding writes to complete, waiting if needed. Then we can reset the state of the Writer. We start by resetting the written pages and then iterate over all the loaded chunks and release them. After the call, the Pager may decide to remove them from memory. This is fine, since they were already written to disk. Except… if there was an error. You might have noticed that we are gathering the errors on each individual write operation we send, but we are actually only looking at the first one. For that matter, we clear the state of the Writer regardless if there were errors or not. In general, an I/O error from the disk is not something that is recoverable. What you can do at this stage is to raise the error higher and run whatever recovery you have on startup. In the next post, I’m going to be talking about durability and the overall expected performance of the system under this sort of model.

Why Use DateTimeOffset

by Ardalis

posted on: January 17, 2022

Raise your hand if you've stored entity values in a database as DateTime. Ok, everybody has their hand up. You can put your hand down - you…Keep Reading →

Deleting GitHub Actions artifacts using the GitHub REST API

by Gérald Barré

posted on: January 17, 2022

GitHub Actions supports sharing data between jobs in any workflow as artifacts. This is very convenient but the storage is limited (or you have to pay for it). If you use too much storage, you may get one of those notifications:You've used 75% of included services for GitHub Storage (GitHub Actions

re

by Oren Eini

posted on: January 14, 2022

I was pointed to this paper on twitter: Are You Sure You Want to Use MMAP in Your Database Management System? As you can imagine, this is a topic near and dear to my heart. This is especially the case since I am currently writing the Implementing a file pager in Zig posts series. I implemented the same low level mechanics using mmap, using mmap, I have < 100 lines of code and can start building higher level concepts almost immediately. Writing my own pager is currently a 10 posts series and the end doesn’t seem to be in sight. I’m going to use this post to respond to the article. As a reminder, I’m the founder of RavenDB and I wrote Voron, a mmap based storage engine, and has been running that across hundreds of clients and literally tens of millions of instances in production. I am also writing a book about building a storage engine that uses mmap internally. The paper itself does a great job of outlining the issue of using mmap as the buffer pool in DBMS. What it doesn’t cover, however, is the alternative. I will touch on specific points from the paper shortly, but I want to point out that the article compares apples to camels in the benchmarks and conclusions. Note that I don’t necessarily disagree with some of the statements, mmap certainly has challenges that you need to deal with, but if you avoid that, you can’t have wave everything that it brings to the table. When building a database, using mmap has the following advantages, the OS will take care of: Reading the data from disk Concurrency between different threads reading the same data Caching and buffer management Eviction of pages from memory Playing nice with other processes in the machine Tracking dirty pages and writing to disk* I put an asterisk on the last one because it probably requires your attention as well. If you aren’t using mmap, on the other hand, you still need to handle all those issues. That is a key point that I believe isn’t addressed in the paper. Solving those issues properly (and efficiently) is a seriously challenging task. Given that you are building a specialized solution, you can probably do better than the generic mmap, but it will absolutely have a cost. That cost is both in terms of runtime overhead as well as increased development time. The comparison that was made by the paper was done using fio benchmark tool, which is great if you want to test your storage system, but is pretty much irrelevant if you are trying to benchmark a buffer pool. Consider the following: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters // pool.h void* getPage(pool* p, long id); // mmap_pool.c void* getPage(pool* p, long id){ return p->base_ptr + (id * PAGE_SIZE); } // fio_pool.c void* getPage(pool* p, long id) { // * lookup if the page is in memory // * if not, schedule its load // *** block the current thread? return an error and have it try later? // *** other threads that need this page should not initiate another load // * record that the page is used (to handle evictions later) // * record that this page is in use (to avoid evicting a page while it is in use) } view raw buffer_pool.c hosted with ❤ by GitHub For the mmap version, we need to compute the address of the page and that is pretty much it. For the manual buffer pool, the list of tasks that we need to handle is long. And some of them require us to be thread safe. For example, if we handed a page to a transaction, we need to keep track of that page status as being in use. We cannot evict this page until the transaction is done with it. That means that we probably need to do atomic reference counting, which can have very high costs. There are other alternatives, of course, but they all have even higher degrees of complexity. In practice, data access within a database isn’t actually random, even if you are doing random reads. There are pages that are going to almost always be referenced. The root page in the B+Tree is a good example. It is always going to be used. Under atomic reference counting, that page is going to be a bottleneck. Ignoring such overhead of the buffer pool management means that you aren’t actually comparing equivalent structures. I should also point out that I’m probably forgetting a few other tasks that the buffer pool needs to manage as well, which complicate its life significantly. Here is an example of such a buffer pool implementation from what is effectively a random GitHub repository. You can see what the code is trying to do here. The reason I point to this is that there is a mutex there (and I/O under the lock), which is fairly typical for many buffer pools. And not accounting for the overhead of buffer pool management is seriously skewing the results of the paper. All of this said, I absolutely agree that mmap can be challenging. The paper outlines 4 different problems, which I want to address. Problem #1 – Transactional safety A database needs to know when the data is persisted to disk. When using mmap, we explicitly give up that knowledge. That can be a challenge, but I don’t see that as a seriously different one from not using mmap. Let’s consider the manner in which Postgres is working. It has its own buffer pool, and may modify the pages as a result of a write. Postgres may need to evict modified pages to disk before the transaction that modified them is committed. The overhead of managing that is just… part of the challenge that we need to deal with. For RavenDB, as the paper points out, we modify the pages outside of the mmap memory. This is actually not done for the reason the paper describes. I don’t actually care if the data is written to memory behind my back. What I care about is MVCC (a totally separate concern than buffer management). The fact that I’m copying the modified data to the side means that I Can support concurrent transactions with far greater ease. In a similar fashion, Postgres handles MVCC using multiple entries for the same row in the same page. When the transaction commits and older transactions no longer need the old version of the data, I can push the data from the modified buffers to the mmap region. That tends to be fairly fast (given that I’m basically doing memcpy(), which runs at memory speed) unless I have to page data in, more on that later. The paper mentions the issue of single writer in LMDB, and I wanted to point out that a single writer model is actually far more common (and again, not really related to the buffer pool issue). Off the top of my head, most embedded databases implement a single writer model. LMDB Voron (RavenDB’s storage engine) LevelDB Lucene The one that I can think that doesn’t have a single writer is RocksDB(where allow_concurrent_memtable_write is for writes to the memtable, not related to file I/O). The reason this matters is that embedded systems can typically assume that all operations in a transaction will complete as a unit. Compare to Postgres, where we may have a transaction spanning multiple network calls, interleaving writes is a must. If we could avoid such concurrency, that would be far preferable. You can get additional concurrency by having sharding writes, but that is usually not needed. Problem #2 – I/O Stalls The paper points out, quite correctly, that not having control over the I/O means that you may incur a page fault at any time. In particular, you may end up blocked on I/O without really noticing. This can be a killer especially if you are currently holding a lock and blocked on page fault. Indeed, I consider this to be the most serious issue that you have to deal with mmap based systems. In practice, however, the situation isn’t so clear cut. Until quite recently, the state of asynchronous I/O on Linux was quite iffy. Until the arrival of io_uring, certain operations that you expected to be async would block occasionally, ruining your day. The paper mentions that you can use async I/O to issue I/O requests to load the next pages (non sequentially) from the disk when you are performing certain operations. You can do the same with mmap as well, and RavenDB does just that. When you start a scan on a B+Tree, RavenDB will ask the OS to ensure that the memory we are interested in is in memory before we actually get to it. On Linux, this is done with madvise(WILL_NEED) call. That call may be blocking, so we actually have a dedicated thread that is meant to handle such a scenario.  In practice, this isn’t really that different from how you’ll handle it with async I/O. Another consideration to deal with is the cost of mapping at the kernel level. I’m not talking about the I/O cost, but if you have many threads that are faulting pages, you’ll run into problems with the page table lock. We have run into that before, this is considered an OS level bug, but it obviously has an impact on the database. In practice, however, the overhead of memory management is the same in most cases. If you are reading via mmap or allocating directly, you’ll need to orchestrate things. Note that the same page table lock is also in effect if you are heavily allocating / freeing, since you’re also modifying the process page table. Problem #3 – Error Handling Error handling is a serious concern for a database. The paper points out that databases such as SQL Server may run a checksum when reading data from disk. When you use a buffer pool, the boundary of reading from the disk is obvious and you can easily validate the read from the disk. Voron is using mmap exclusively, and we do have checksums. We validate the page from the disk the first time that we access it (there is an internal bitmap that is used for that).  There isn’t a big difference between the two anyway. We only check a given page once per run, because to do otherwise is meaningless. When you use read() to get data from the disk, you have no guarantees that the data wasn’t fetched from a cache along the way. So you may validate the data you read is “correct”, while the on disk representation is broken. For that reason, we only do the check once, instead of each time. A far greater issue to deal with is I/O errors. What do you do when a read or a write fails? If you are using system calls to manage that, you get a return code and can react accordingly. If you are using mmap, the system will generate a SIGBUS that you’ll have to (somehow) handle. For a database, dealing with I/O errors has a single correct answer. Crash and then run recovery from scratch. If the I/O system has returned an error, there is no longer any way to know what the state of that is. See: fsync-gate. The only way to recover is to stop, reload everything (apply the WAL, run recovery, etc) and get back into a stable state. SIGBUS isn’t my cup of tea with regards to handling this, but error handling for I/O error isn’t actually something that you do, so just restarting the process ends up more acceptable than you might initially think. Problem #4 – Performance issues The paper points out three common reasons for performance issues with mmap usage: page table contention single threaded page eviction TLB shootdowns The first issue is something that I have run into in the past. It was a bug in the operating system which was fixed. There is no longer a single page table in both Windows and Linux. The single threaded eviction, on the other hand, is something that we never run into. When using Voron, we map the memory using MAP_SHARED, and most of the time, the memory isn’t dirty. If the system needs memory, it can do that when it assigns a page by just discarding the memory of an unmodified shared page. In this model, we typically see most of the memory as shared, clean. So there isn’t a lot of pressure to evict things, and it can be done on as needed basis. The TLB shootdown issue is not something that we ever run into as a problem. We have run TB range databases on Raspberry PI with 4GB of RAM and hammered that in benchmarks (far exceeding the memory capacity). The interesting thing here is that the B+Tree nature means that the upper tiers of the tree were already in memory, so we mostly ended up with a single page fault per request. In order to actually observe the cost of TLS Shootdown in a significant manner, you need to have: really fast I/O working set that significantly exceeds memory no other work that needs to be done for processing a request In practice, if you have really fast I/O, you spent money on that, you’ll more likely get more RAM as well. And you typically need to do something with the data you read, which means that you won’t notice the TLB shootdown as much. Finally, going back to how I started this post. This assumes that there are no other costs of not using mmap and using direct IO. The benchmark doesn’t account for those extra costs. For example, without mmap, who is doing evictions? In practice, that will lead to the same sort of considerations that you’ll have when dealing with mmap. This is especially the case with TLS shootdown when we start talking about high memory traffic (which likely modifies page allocations for the process, leading to the same scenario). The paper has been quite interesting to read and it has presented a number of real problems that occur with mmap based systems, but I’m afraid that it doesn’t present the alternatives properly and vastly underestimates both costs and complexity of not using mmap and writing your own buffer pool.