skip to content
Relatively General .NET

What are Abstractions in Software Development

by Ardalis

posted on: January 05, 2022

Software developers deal with abstractions every day. But just what is an abstraction? There are differing definitions that can sometimes…Keep Reading →

Implementing a file pager in Zig

by Oren Eini

posted on: January 04, 2022

After writing the post about handling chunk metadata, I started thinking about the overall approach. Both the method using compressed pointers and the baseline computation felt… off to me. They were certainly workable, but it was too complex and felt fragile. I don’t like dealing with a high level of complexity, I would rather put a lot of effort into simplifying the solution. The overall approach may be complex, but the system should be nice to work with. Usually, we can get away with a great deal of simplification if we accept some constraints on what we want to do with the system. For now, I’m going to assume the following constraints: We are using 64 bits OS (and can assume effectively unlimited address space). We want to go with a file pager (instead of the memory mapped one) because I want to be able to control the I/O behavior better. The files we use are limited to 8 GB in size (can use more than a single file, of course). The last one deserves some additional words. When thinking about a storage solution, accepting a maximum size is generally a bad idea (640KB, anyone?). However, if we decide that our storage solution is going to be composed of files of specific size, we can combine them to reach any size needed. But why accept this limitation? Why say that a single file will not exceed 8 GB? It turns out that this has several advantages. Let’s assume that we have a dataset that is 100GB in size, using 8 GB files, that would be 13 files to a total of 104 GB of used disk space. Now we want to delete some of that data. What do we do with the actual used disk space? It is actually quite hard to release disk space back to the operating system if you have a single file. You might need to run compaction of the data, or use advanced API such as hole punching (see FALLOC_FL_PUNCH_HOLE). Advanced API is something that I would like to avoid, too easy to fall into some pitfall that no one else has run into. Working with sparse files (with holes in them) also typically requires you to utilize dedicated tools and can be awkward.  If we split the data into separate files, we can retain most of the same benefits, and give ourselves a simpler environment for the user to work with. With the 8GB limitation in place, I can choose to manage the paging using the following manner: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub const FileChunks = struct { pub const MaxFileSize = 8 * 1024 * 1024 * 1024; // 8 GB pub const ChunkSize = 2 * 1024 * 1024; // 2 MB pub const MaxChunksInFile = MaxFileSize / ChunkSize; // 4096 pub const PageSize = 8 * 1024; // 8 KB pub const PagesInChunk = ChunkSize / PageSize; // 256 pub const ChunkMetadata = packed union { pub const Tag = enum(u2) { Empty = 0b00, Error = 0b01, Loading = 0b10, Value = 0b11, }; raw: u64, futex: packed struct { value: u32, // tag & version references: u32, // ignored }, v: packed struct { version: u30, tag: Tag, // references == 0 - this is unused // references == 1 - just the pager is holding this // refereces >= 2 - external entity is holding this references: u32, }, }; comptime { if (@sizeOf(ChunkMetadata) != @sizeOf(u64)) { @compileError("ChunkMetadata should be exactly 64 bits in length"); } } chunks: [MaxChunksInFile]ChunkMetadata, ptr: []align(mem.page_size) u8, allocator: mem.Allocator, }; view raw FileChunks.zig hosted with ❤ by GitHub The idea is pretty simple. Instead of trying to stitch together the memory for the file, we are going to just allocate a single 8GB range of virtual memory. This can be done using the following command: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters var ptr = try os.mmap( null, MaxFileSize, os.PROT.NONE, os.MAP.ANONYMOUS | os.MAP.PRIVATE, -1, 0, ); view raw mmap.zig hosted with ❤ by GitHub This reserves (but does not use) 8GB of address space. We can now allocate ranges from that safely. This is important because if we have a request to two sequential chunks, they will reside in memory right next to one another. Note that we also don’t need to handle any pointers, since we can rely on a stable base address for the whole file. The nice thing about this is that we aren’t actually allocating memory, just reserving it. Let’s see how that will work? The chunks array is used to control references to the chunks in the file. The chunk metadata is a 64 bits value that has several responsibilities at the same time. It stores the tag of a chunk, which indicate its status (loaded, error, empty, etc) and the number of outstanding references to the chunk. That uses up 34 bits in the value, the rest of the bits are used as a version field, which is incremented on each change. That allows us to avoid the ABA problem. The actual data, of course, is managed using the ptr value. Here is how we can get a chunk from this struct: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn tryGet(self: *FileChunks, chunk: u64) !?[]align(mem.page_size) u8 { while (true) { var copy = self.chunks[chunk]; var origin = copy; switch (copy.getTag()) { .Empty => return null, .Error => return @intToError(@intCast(u16, copy.v.references)), .Loading => return error.ValueIsLoading, .Value => {}, } try copy.addRef(); if (self.chunks[chunk].tryUpdate(origin, copy)) { var offset = chunk * ChunkSize; return self.ptr[offset..(offset + ChunkSize)]; } } } view raw tryGet.zig hosted with ❤ by GitHub What we are doing here is checking that the value is loaded to memory, and if it is, we increment the reference and then return it. This code runs in a loop, because we assume that multiple threads may run it in the same time. This handles just getting data that is already loaded. If the data isn’t loaded, what will happen? We’ll get a null back. Here is the blocking version of this method: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn getBlocking(self: *FileChunks, chunk: u64, timeout: ?u64) ![]align(mem.page_size) u8 { while (true) { var maybechunk = self.tryGet(chunk) catch |e| { if (e == error.ValueIsLoading) { var copy = self.chunks[chunk]; if (copy.getTag() == .Empty) { try self.chunks[chunk].wait(copy, timeout); } continue; } return e; }; if (maybechunk) |c| { return c; } return error.ValueIsNotLoading; } } view raw getBlocking.zig hosted with ❤ by GitHub Just based on those two methods, you should be able to draw some conclusions. If the value isn’t loaded, we’ll always return null, but there is this Loading stage as well, in that case, we may want to wait for it. How is that going to work? This works using two important functions: markLoading() and markLoaded(), the idea is that we’ll first try to call tryGet() to load a chunk, if there is no value, we need to load it from disk. At that point, remember, there may be multiple threads accessing the relevant chunk. So all of them would be competing on the markLoading function, like so: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn markLoading(self: *FileChunks, chunk: u64) !?[]align(mem.page_size) u8 { while (true) { var copy = self.chunks[chunk]; var origin = copy; switch (copy.v.tag) { .Value => return error.ValueAlreadyExists, .Error => return error.ValueInErrorState, .Loading => return null, // already marked.. .Empty => {}, } copy.setTag(.Loading); if (self.chunks[chunk].tryUpdate(origin, copy)) { var offset = chunk * ChunkSize; const c = self.ptr[offset..(offset + ChunkSize)]; _ = try os.mmap( c.ptr, ChunkSize, os.PROT.READ | os.PROT.WRITE, os.MAP.ANONYMOUS | os.MAP.PRIVATE | os.MAP.FIXED, -1, 0, ); return c; } } } view raw markLoading.zig hosted with ❤ by GitHub The code itself is pretty simple, we are updating the tag of the chunk and try to update it optimistically. We are moving the state of the chunk from Empty to Loading in a thread safe manner. If we are successful in doing so, we know that we are the only thread that owns the loading portion of the chunk. Note that part of the markLoading process is to ask the OS to give us the memory for the chunk (in the range that we previously allocated). At this point, we can load the data from disk somehow and then we’ll call the markLoaded function, which completes the process: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn markLoaded(self: *FileChunks, chunk: u64) ![]align(mem.page_size) u8 { while (true) { var copy = self.chunks[chunk]; var origin = copy; switch (copy.v.tag) { .Value => return error.ValueAlreadyExists, .Error => return error.ValueInErrorState, .Empty => return error.ValueIsNotLoading, .Loading => {}, } copy.setTag(.Value); try copy.addRef(); // ownership by the pager try copy.addRef(); // ownership by the caller if (self.chunks[chunk].tryUpdate(origin, copy)) { return self.getLoadedChunk(chunk); } } } view raw markLoaded.zig hosted with ❤ by GitHub The idea is that we are splitting the responsibility for managing the chunks references from how we load the data to memory. In other words, the expected usage of this struct is something like this: Call tryGet() a page in a given chunk. If successful, do the work you wanted to do. If not successful, compete to be the loader for this data by calling markLoading(). If you lost, call getBlocking() to wait for the winner to get the data. Somehow, load the data from the disk and call markLoaded(). Proceed to make use of the data. Another important aspect that we have to deal with is when we want to discard the data. Basically, if we filled our memory budget and we need to load a value from the disk, what can we do then? The answer is that we need to evict the data somehow, before we can do that, we need to know what data is currently in use. That is why we have the calls to addRef() and release(). We use those (using atomic operations) to track the usage of the various chunks. When we need to evict data from memory, we’ll need to have some sort of a policy to do so. I’m deferring the actual policy to a later point in time, right now I want to discuss how do we know what we can evict and how that is going to work. Here is the code to handle eviction, currently implementing a policy of simple scanning (not ideal by a long shot): This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters fn tryClaimOwnership(self: *FileChunks, index: u64) ?ChunkMetadata { while (true) { var copy = self.chunks[index]; if (copy.mayReclaim() == false) return null; var modified = copy.reset(.Loading); // means that we own it for the duration... if (self.chunks[index].tryUpdate(copy, modified)) return modified; } } fn trySetToEmpty(self: *FileChunks, index: u64) !void { while (true) { var modified = self.chunks[index]; if (modified.getTag() != .Loading) { // someone else modified it while we where releasing the memory return error.ValueIsNotLoading; } var released = modified.reset(.Empty); if (self.chunks[index].tryUpdate(modified, released)) break; } } pub fn reclaim(self: *FileChunks) !u64 { var reclaimed: u64 = 0; var index: u64 = 0; while (index < self.chunks.len) : (index += 1) { var modified: ChunkMetadata = undefined; if (self.tryClaimOwnership(index)) |m| { // at this point, m is owned by us, and no one else can use it... modified = m; } else { continue; } reclaimed += ChunkSize; _ = try os.mmap( self.getLoadedChunk(index).ptr, ChunkSize, os.PROT.NONE, os.MAP.ANONYMOUS | os.MAP.PRIVATE | os.MAP.FIXED, -1, 0, ); try self.trySetToEmpty(index); } return reclaimed; } view raw eviction.zig hosted with ❤ by GitHub In the reclaim method, we are scanning through the chunks. To be able to reclaim a chunk, the following conditions need to hold: The chunk holds a value. There are no outstanding references to the chunk, only the pager is holding a reference to the chunk. Note that in order to do this safely, we have to assume that while we are trying to reclaim a chunk, another thread is trying to use it. This behavior complicates our lives a bit. We handle that by doing  a racy update of the chunk, trying to move it to a loading state. The idea is that the Loading state is meant to be used as a busy signal. While the chunk is in Loading state, the rest of the system knows that it cannot use this and needs to wait. Note that this means that we have the following transitions: Most of the code that we have in the struct is there to handle concurrency from multiple threads dealing with the system at once, note. The actual behavior is fairly simple. We check if we can reclaim the chunk (no one is looking), we take a lock on by trying to move its state to Loading. Then we can discard the memory by calling mmap on the chunk’s memory with PROT_NONE. For fun, we are using 2MB chunks because that fits well into huge pages. On a properly setup system, we can significantly reduce the paging metadata overhead inside the kernel by allocating a single 2MB page for each chunk. You can see the entire implementation here. In the next post, I want to look into handling the I/O portion of reading the data from the disk. After that we’ll talk about how we can implement a proper eviction policy.

Production postmortem

by Oren Eini

posted on: January 03, 2022

The topic of this post is a bug in RavenDB, a pretty serious one. The end result is that a user reported that they got an error from RavenDB that they are unable to read a stored document. In some cases, RavenDB needs to read a document on startup, which means that it wasn’t able to start up if that document had this behavior. As you can imagine, this is one of those issues that gets our full and immediate attention. The error itself gave us a lot of information: Dictionary mismatch on Dic #375 at Voron.Data.Tables.ZstdLib.AssertSuccess(UIntPtr v, CompressionDictionary dictionary) This is related to RavenDB’s document compression behavior. In order to get a great compression ratio from our documents, we train RavenDB on the recent documents that you have and generate a compression dictionary. The problem at hand is that the compression dictionary we have and the compression dictionary that was actually used are different. As you can see from the error, we are using zstd as the compression algorithm. When zstd generates a dictionary it will (by default) generate an id from that document that is mostly based on the xxhash64 of its content, rounded to 32 bits. You can see the relevant part here. This is pretty nice, since it means that there is a good chance that we’ll detect the wrong dictionary. So now we know what is going on, but we don’t understand why. When we wrote this feature, we were quite aware that we’ll not be able to make any sort of sense from the documents if we don’t have the right dictionary. For that reason, we store the dictionaries three times. Once inside of RavenDB itself and twice in ancillary files, which we can use during recovery. This sort of error should be utterly impossible. And yet, we had run into that in production, so we have to dig deeper still. The primary suspect was the dictionary training portion. One of the things that RavenDB does on a continuous basis is measure the compression ratio of the documents, if we aren’t able to hit a good compression ratio, RavenDB will try to generate a new dictionary from the most recent documents and see if that new dictionary can do better. This can be very helpful in maintaining good compression rates. As your documents change, RavenDB will detect that and realize that it can do better, retrain on the recent data and compress even further. The problem is that this code path is also quite tricky, we first compress the document using the current dictionary, then we try generating a new dictionary and see if compressing with the new dictionary is better. If that is the case, we can install the new dictionary for future operations, otherwise, we need to discard it. I suspected that the issue was somewhere around that area, we might not be handling the rejection of the new dictionary properly. So I went into the code and started digging, but I found absolutely nothing. The entire process is covered in tests and has been in production for close to 18 months, so this isn’t something that obvious. After spending quite a bit of time on the issue, I decided that the code is perfect, it handled everything properly and taken into account all the right behaviors. Clearly the fault was elsewhere. Before setting out to blame the nearest cat (you can never trust those), I had an idea, what if the problem wasn’t during the training process, but afterward? Well, that doesn’t really matter, does it? RavenDB is a transactional database, if we had a failure after the training process, we’ll have to discard some of the data, for sure, but that would be about it. Unless, what if we have some state that wasn’t transactional? As part of looking at the compression training code, I ran into just such a scenario. Running the training to generate a new compression dictionary is an expensive proposition, so we don’t want to do that often. As such, we’ll do that for only about 1K document changes where we exceed the desired compression ratio by over 10%. How do we know to act every 1K documents? Well, we have a counter that we increment on every change. That value is incremented using Interlocked.Increment() and isn’t part of the transactional state. If the transaction is aborted, the value is still incremented.  The actual value doesn’t matter, mind, only that it is moving forward, so that isn’t an issue. I mentioned the dictionary id before, but I should clarify that this is the zstd’s dictionary id. Internally, RavenDB uses a different value. That value is simply the sequence number of the dictionary, RavenDB counts the number of generated dictionaries and gives the new dictionary the next available value. That value, by the way, is part of the transaction. If we rollback a transaction, we’ll use the same dictionary id. But that doesn’t matter, of course. When using compression dictionaries, we need to load them from a buffer. There is quite a bit of work that is involved in that, there is memory allocation, entropy tables to load, etc. In order to save repeated work, RavenDB caches the compression dictionaries (after all, their whole point is to be used repeatedly). That cache can be used by multiple transactions at the same time (two read transactions using the same dictionary will use the same instance). Given all of this information, here is the sequence of events that we need to get the error in question: The user enabled documents compression. The user runs a transaction with at least four commands, which needs to satisfy the following conditions. A document write as the first action. Then a write to document whose compression ratio exceeded the expected ratio by over 10%, as a result, RavenDB tried to train a new compression dictionary. That dictionary had a better compression ratio and was accepted as the new default compression dictionary. RavenDB persisted the new dictionary and used that to compress the new document. Another command (in the same transaction) had stored a document in the same collection, now RavenDB will read the new dictionary and store that in a cache. A third command runs, but this one throws an error (such as optimistic concurrency violation). At this point, RavenDB will rollback the entire transaction and return the error to the user. Let’s say the user has chosen to submit the same two documents again, shall we? For the first command, we’ll again discover that the compression ratio (of the old compression dictionary) is insufficient. We will not generate a new compression dictionary, why is that? Remember the counter that we increment using Interlocked? That one was not rolled back, so we’ll need to wait for another 1K documents for the stars to properly align for us. That doesn’t impact correctness in any way, shape or form, however. At this stage, the stage is set, but everything is still okay. The problem will happen on the next time that we’ll trigger a new dictionary. At that point, we’ll again scan the most recent documents, build a dictionary, etc. However, the dictionary id that RavenDB will use will be identical to the dictionary id that we previously discarded. The data that dictionary was trained on, however, will almost certainly be different. We persist the new dictionary to disk and everyone is happy, the new document that we wrote will use the new compression dictionary and we are perfectly fine. The next write for this collection, however, will run into a problem. It will need to use the current (the new one) dictionary when we want to make a write. In order to do that, it will load the value using the cache, but there is already a value for that dictionary in the cache, the same dictionary that was discarded. At this point, RavenDB will start compressing documents using the in memory dictionary while the on disk dictionary is different. If you’ll try to access the document which triggered the new dictionary, you’ll get an error, but documents that were modified later will continue working with no issue. Until you restart, of course. On restart, we’ll read the dictionary from disk, where we wrote the new dictionary, at this point, all those documents that we wrote will give us the error above. Note that the sequence of events has to be very exact, you need to have a dictionary training as part of a multi act transaction which failed after the dictionary training has been successful and wrote additional documents. In a year and a half of production usage and very heavy load, that happened only a couple of times, it seems. The issue has been fixed, of course and we’ll be rolling it out to both users and cloud customers. We’ll now rollback such in memory state on a transaction rollback as well, avoiding this issue entirely. It is amazing to me that despite very careful planning, it wasn’t the code itself that caused a problem, but a sequence of independent operations and failure modes that we never even considered about this.

Beating FizzBuzz for detecting qualified candidates

by Oren Eini

posted on: December 31, 2021

FizzBuzz is a well known test to show that you can program. To be rather more exact, it is a simple test that does not tell you if you can program well, but if you cannot do FizzBuzz, you cannot program. This is a fail only kind of metric. We need this thing because sadly, we see people that fail FizzBuzz coming to interviews. I have another test, which I feel is simpler than FizzBuzz, which can significantly reduce the field of candidates. I show them this code and ask them to analyze what is going on here: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public class ControllerBase : Controller { public static bool IsAdminUser; } view raw snap.cs hosted with ❤ by GitHub Acceptable answers include puking, taking a few moments to breathe into a paper bag and mild to moderate professional swearing. This is something that I actually run into (about 15 years ago, in the WebForms days) and I have used it ever since. That is a great way to measure just how much a candidate knows about the environment in which they operate.

Code review horror in 4 lines of code

by Oren Eini

posted on: December 30, 2021

I run into the following code during code review and had an immediate and visceral reaction. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters private readonly List<string> _messages; public IReadOnlyList<string> Messages { get { lock (this) { return _messages; } } } view raw horror.cs hosted with ❤ by GitHub This is a (bad) attempt to add thread safety, because you are getting a value through a read only interface, but there is still the mutable instance to work with at the source, and now you have someone that observes the instance while it is being mutated, outside the lock. The proper way to handle this is to copy the list (under the lock) and return a distinct copy.

A year or monitoring production

by Oren Eini

posted on: December 29, 2021

The end of the year is closing fast, and I run into the following metric (below). What you can see here is one of our RavenDB production instances over the past year. We are continuously dogfooding our own software, and there is a clear indication of the results.What you can see here is the total memory used by RavenDB (production load, fairly constant over time)  for the past year. As we update RavenDB, we benefit from various optimizations, and the trend line is very encouraging.Around August, we had a change that saved us a single allocation in some cases, here is the chance, you can see the impact it had:We also started using a new feature in production around December, and that seems to have an additional memory cost, so we optimized that as well:You can see the new build deployed around the 17th of the month.

Implementing a file pager in Zig

by Oren Eini

posted on: December 28, 2021

The file pager needs to know what values it has in memory and what it needs from the disk. Instead of tracking values on a per page level, we are going to do that on a chunk basis, where each chunk in 2MB (256 pages). A single file is going to be limited to 8 GB in size, so we have a maximum of 4,096 chunks in a file. We can allocate a simple array of metadata for the entire file in a single shot. That means that we don’t have to do reallocation when we grow the size of the file (up to the 8GB maximum). Let’s consider what metadata we need to know about the chunks we have: What is the status of the chunk (in memory, on the disk, being loaded or errored). How many outstanding references we have for a chunk? Where do we find the actual chunk data in memory, when it is loaded? The whole thing is made complex because we have to consider concurrency. Multiple threads may try to load a chunk at the same time, we may need to release the memory of a chunk to make room for loading another, etc. We also need to consider issues such as I/O failures, optimizing I/O patterns, etc. For now, I/O will be handled by another post. I want to focus just on how we will deal with the metadata. A major PITA with concurrency is how to handle reference tracking. If a thread is reading from a chunk, we cannot release it. That leads us to reference counting, but that is tough to do atomically. You have to deal with the ABA problem, to start with. For that reason, we want to limit chunk metadata to 8 bytes in total. This will allow us to use atomic instructions to modify the metadata safely. Using just 8 bytes is a very low amount. We know that the chunks we’ll use are 2MB in size. We can assume that we’ll also align them on 2MB boundary. That means that the lower 20 bits are unused, we can repurpose them. On x64 and ARM64, the top 16 bits are also unused (not always true, since from 2019 we have IceLake that has PML5, which uses 57 bits, but very likely to be the case). In most systems, the 47th bit will be used for kernel vs. user memory, so that will be cleared as well. That means that we actually only need 64 – 17 – 20 = 27 bits to store the pointer value. We can repurpose the other 37 bits. There are actually several ways in which we can do this. The compressed pointer method is just one of them. I decided to not go that route. Instead, we are going to have the following structure: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters packed struct { // 64 bits in total tag: enum(u2) { Empty = 0b00, Error = 0b01, Loading = 0b10, Value = 0b11, }, version: u16, references: u20, offsetInPages: u26, } view raw metadata_struct.zig hosted with ❤ by GitHub This is a packed bit field struct which can fit into a 64 bits value. Note that we have fields for the type of the value, the version (for ABA) and the number of references. In addition to that, we also have the actual value, which is specified in offsetInPages. Let’s talk about sizes here. The tag field has four options, as you can see. The version field is 16 bits, which means that it can have 65,536 possible values. It will be incremented on every change to the value and used to avoid false successes when updating the value concurrently. The references field is 20 bits in size, giving us 1 million values here. That is the number of concurrent references that it can support. That looks like big enough value that we shouldn’t care about it. The offsetInPages field is 26 bits in size. Assuming 4 KB pages, we can reference up to 256 GB of memory. We’ll want to support machines with higher memory than that, which is why we’ll also add the concept of base. For a single file, all the allocations must come in the same 256 GB range. I don’t expect that to be a big problem, and different files can have different bases. The fact that all of that fits in 64 bits means that we can use simple Compare & Swap atomic operations and avoid the need for 128 bits atomic instructions. To be fair, cmpxchg16b has been around forever. I believe that you can do that on ARM as well, but I’m not sure how. At any rate, let’s look at the ChunkMetadata struct in all its glory, then we’ll discuss what is going on: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub const ChunkMetadata = packed union { v: packed struct { tag: enum(u2) { Empty = 0b00, Error = 0b01, Loading = 0b10, Value = 0b11, }, version: u16, references: u20, offsetInPages: u26, }, raw: u64, half: u32, comptime { if (@sizeOf(ChunkMetadata) != @sizeOf(u64)) { @compileError("ChunkMetadata must be 64 bits in size! was " + @sizeOf(ChunkMetadata)); } } pub fn get(self: *ChunkMetadata, base: usize) !?*u8 { while (true) { var cur = self; switch (cur.v.tag) { .Empty => return null, .Error => return @intToError(@intCast(u16, cur.v.offsetInPages)), .Loading => return error.ValueIsLoading, .Value => {}, } if (cur.v.offsetInPages == 0) { return error.ValueIsInvalid; } var val = @intToPtr(*u8, @intCast(u64, cur.v.offsetInPages) * std.mem.page_size + base); var updated = cur; if (updated.v.references == std.math.maxInt(@TypeOf(updated.v.references))) { // more than 255K concurrent references is unlikley return error.ChunkReferencesOverflow; } updated.v.references += 1; updated.v.version +%= 1; var result = @cmpxchgWeak(u64, &self.raw, cur.raw, updated.raw, .Monotonic, .Monotonic); if (result == null) return val; // successfully incremented the ref count } } pub fn trySetError(self: *ChunkMetadata, err: anyerror) !void { return trySet(self, null, err); } pub const PtrWithBase = struct { ptr: *u8, base: usize }; pub fn trySetValue(self: *ChunkMetadata, value: PtrWithBase) !void { return trySet(self, &value, null); } pub fn tryLoading(self: *ChunkMetadata) !void { return trySet(self, null, null); } fn trySet(self: *ChunkMetadata, value: ?*const PtrWithBase, err: ?anyerror) !void { while (true) { var cur = self; switch (cur.v.tag) { .Error => return @intToError(@intCast(u16, cur.v.offsetInPages)), .Value => return error.ValueIsAlreadySet, .Loading => { if (value == null and err == null) { return error.ValueIsAlreadyLoading; } }, .Empty => {}, } var updated = cur; updated.v.references = 1; updated.v.version +%= 1; if (value) |val| { updated.v.tag = .Value; var v = (@ptrToInt(val.ptr) - val.base) / std.mem.page_size; if ((try std.math.mod(usize, @ptrToInt(val.ptr), std.mem.page_size)) != 0) { return error.ValuePtrMustBePageAligned; } if (v == 0) { return error.ValuePtrCannotBeNullOrSamePageAsBase; } if (v > std.math.maxInt(@TypeOf(updated.v.offsetInPages))) { return error.ValuePtrIsTooFarFromBase; } updated.v.offsetInPages = @intCast(@TypeOf(updated.v.offsetInPages), v); } else if (err) |e| { updated.v.tag = .Error; updated.v.offsetInPages = @errorToInt(e); } else { updated.v.tag = .Loading; } var result = @cmpxchgWeak(u64, &self.raw, cur.raw, updated.raw, .Monotonic, .Monotonic); if (result == null) { std.Thread.Futex.wake(self.futexPtr(), std.math.maxInt(u32)); return; } } } pub fn release(self: *ChunkMetadata) void { while (true) { var cur = self; var updated = cur; if (updated.v.tag != .Value) { @panic("Attempted to release a chunk whose tag isn't set to Value."); } if (updated.v.references == 0) { @panic("Attempted to release the chunk more times than you got it"); } updated.v.references -= 1; updated.v.version +%= 1; var result = @cmpxchgWeak(u64, &self.raw, cur.raw, update.raw, .Monotonic, .Monotonic); if (result == null) { return; } } } pub fn waitForValue(self: *ChunkMetadata, base: usize, timeout: ?u64) !?*u8 { while (true) { var cur = self; if (cur.v.tag != .Loading) { return try self.get(base); } try std.Thread.Futex.wait(self.futexPtr(), cur.half, timeout); } } fn futexPtr(self: *ChunkMetadata) *std.atomic.Atomic(u32) { // this covers the tag, version & some of the references fields // given that the version field always changing, it is a good futex value return @ptrCast(*std.atomic.Atomic(u32), &self.half); } }; view raw metadata.zig hosted with ❤ by GitHub The ChunkMetadata can be in one of four states: Empty – there is no value Error – we tried to load the chunk, but failed for some reason. In that case, the actual error code is stored in offsetInPages. Loading – we are currently loading the chunk, and callers can decide to wait for this or try again later. Value – there is a value in the chunk and it is available immediately. When we get() a value we check what the current state of the metadata is and in all but the Value case we’ll return immediately. If there is a value, we can’t just return it to the caller. We need to increment the reference count. That is most of the code in the get() method. We increment the references, do a wrapping increment for the version (so each change will be unique) and then use an atomic operation to update the value. The idea is that two concurrent threads getting the value at the same time will always increment or decrement the references properly. That will be quite important later on. After you are done with the chunk, you can release() it, which will decrement the reference count. Note that reference count of 0 is wrong, we aren’t handling actual releasing of values yet. That will come in another post. The trySet() function is responsible for the other side, it will set the value or the error, taking care of the concurrency aspects of the ChunkMetadata. Of particular interest here, however, is the Futex.wake() call. That deserves some detail. Consider the sequence of events for accessing a chunk. We may have two threads that try to get a chunk, but they find that it is not resident in memory. It needs to be loaded, but we don’t want both threads to do so at once. Therefore, the threads will compete on moving the chunk from the Empty state to the Loading state. After which, the thread that won the race will need to schedule the actual I/O. What does the other thread do in the meantime? It needs to wait until the I/O is completed. This is done using the waitForValue() method, where we interpret the first half of the chunk metadata (the one holding the version field) as a Futex.wait  value. In other words, the thread will sleep until the trySet() call will wake it. That is enough talking about the ChunkMetadata, I think. We went over that in detail, for my next post, I want to talk about how we deal with what is likely to be the most interesting bit of the file pager, managing the actual chunks.

Implementing a file pager in Zig

by Oren Eini

posted on: December 27, 2021

In the previous post, I showed how we can get a pretty nice pager (important for building a storage system) in under 100 lines of code using mmap(). If that was all of it, it would be a pretty short series of posts. However, I want to explore what it would take to take ownership of that part of the storage system and build our own from scratch. Let’s see what it would take to build a pager when we are doing the I/O. In the mmap() implementation, I didn’t really have a lot of states. Just the mapping and that was pretty much it. When building our own, we need to track a whole lot more states. Off the top of my head, we need to: Track what pages we handed out to callers. Track usage of pages so we’ll know when to release them. Manage concurrency explicitly between threads. Handle several scenarios that were just… working on the mmap() implementation. For example, let’s talk about what kind of state I need. Zig comes with a hash table (and does that beautifully for an unmanaged language), so I can do this, right? pages: std.AutoHashMap(u64, Page), That would be a mapping between the pages in memory and the memory we allocated for them. Except… that it doesn’t quite work like that. One of the key issues that we have to deal with is the fact that while most of the time we will ask for a page, we can also ask for a continuous run of pages. We can safely assume that the caller is responsible for ensuring that there is no duplication. In other words, the following sequence of calls is invalid: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters var p1 = try pager.tryGet(5, 4); // get 4 pages, from the 5th page (32KB) var p2 = try pager.tryGet(6, 1); // get the 6th page, overlapping with p1 view raw invalid.zig hosted with ❤ by GitHub That is a very important limit to how much complexity we have to deal with, I have to note. Another thing to deal with is concurrency. How do we deal with scenarios where two threads want to get pages (which may or may not be the same)? Anther consideration is about reduce the overall cost of I/O, we don’t want to issue too many operations, both for reads and for writes. That pushes us toward batching operations as much as possible. Here is the overall design that I have for the file pager: In other words, even though we are dealing with 8KB pages, the pager itself will issue work with chunks of 2MB in size each time. The idea is that we can  amortize the cost of going to the disk by ensuring that we’ll do bulk I/O. That, in turn, means that we have to consider some aspects of our system very early on. In the case of the mmap pager, we didn’t really need to think about caching, that was the responsibility of the operating system. In the case of this pager, we must have a cache, and if we cache a chunk, we can probably benefit greatly from locality of reference, which is always nice. The 2MB chunk size design decision complicate our lives. The pager needs to handle both single pages access and work with values that may span multiple pages. As long as they reside in a single chunk, that is pretty easy. But we need to consider how we’ll manage to work with values that are bigger than 2MB in size. It’s interesting, because even at this very early stage, a design decision on how big the size we fetch from the disk will have impact for the implementation of the entire system. As early as we are, we can make the following assumption / requirements from our callers: Most of the access is going to be for single pages. Some of the accesses will be for multiple pages, but under the 2 MB chunk limit. Few accesses will need to work with multiple pages over the 2 MB limit. That is important because it impacts the way we think about the system. Earlier in this post, I mentioned using a hash map to store the references to the pages. With chunks, we can probably adjust slightly and be done with it, right? Except that we really can’t. One of the primary issues that we have to deal with is the fact that this is meant to be concurrent. A hash map isn’t going to support that and will need to be protected by a lock. Interestingly, most concurrent data structures pretty much require garbage collection of some sort and building them with an unmanaged system is quite complex. How do we deal with this issue? It turns out that it is far simpler to have an array to hold those references and access each element using atomic instructions. Here we run into another design decision. Are we going to have a single file or multiple files? That matters because if we have a single file, we need to deal with increasing the file size on the fly. That means that the array of references would need to grow, and that is also complex with concurrent access. If we have multiple files, we can just create a completely new file as needed. We can allocate a single array at the maximum file size and not worry about it. There are other reasons why we might want to use multiple files (such as making it easier to release space back to the file system), so we’ll go with multiple files. That means that we can reasonably set the maximum file size at 8GB (remember the big values issue, I think it is reasonable to set the max size of a value at 2GB, so 8GB is plenty). With 8GB files, we are talking about 4,096 chunks of 2 MB each. Assuming that we’ll use an 8 bytes struct to hold the data about each chunk, that means that we can safely allocate the maximum size of 32Kb upfront. If we need to increase the size of the file, we already allocated the place for its metadata. That gives us a far simpler system (no need to try to manage concurrent accesses) at a small memory cost. Now, we can require that page allocations that are below 2 MB in size will always be aligned inside a page boundary. But what happens when we have a value whose size exceeds 2MB? The answer to that is that we are going to require the calling code to follow specific patterns for that. We require that any value that is greater than 2MB will be aligned on a 2MB boundary from the end of the final chunk. Here is what this looks like, the yellow marked pages are allocated on two separate chunks, and you can see how we aligned this on the end: The nice thing about this approach is that we know that the caller will not do partial calls. If we asked for pages 5 - 10, there can be no call to page 6 on an independent basis. As such, when we ask for a value that is bigger than a single chunk, it will always be expressed as a load from the starting chunk to the end. That means that we can load the full value in a single I/O call.  Here, again, we have very low level concerns affecting how we lay out the data on disk. There are other aspects that we need to consider, such as eviction policies, how to handle concurrency, etc. But that is enough for one post, I intentionally want to limit the scope of what we do to avoid getting mired in the details. Expect more in the next post in the series.

Implementing a file pager in Zig

by Oren Eini

posted on: December 24, 2021

Now that we know what we want to implement, let’s dig a bit deeper and see how to do it. An interesting way to implement a file pager is to… not do that. Instead, we can rely on the OS’ memory mapping to do most of the heavy lifting. Let’s see how we can do that. The first thing that we need to manage is the setup and teardown of the pager, which you can see here: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub const MmapPager = struct { pub const PageSize = 8 * 1024; ptr: []align(std.mem.page_size) u8, len: u64, allocator: *std.mem.Allocator, pub fn init(fd: std.os.fd_t, allocator: *std.mem.Allocator) !MmapPager { var stats = try std.os.fstat(fd); var ptr = try std.os.mmap( null, @intCast(usize, stats.size), std.os.PROT_READ | std.os.PROT_WRITE, std.os.MAP_SHARED, fd, 0, ); return MmapPager{ .ptr = ptr, .len = @intCast(u64, stats.size), .allocator = allocator, }; } pub fn deinit(self: *MmapPager) void { std.os.munmap(self.ptr); self.ptr = undefined; self.len = 0; } }; view raw init.zig hosted with ❤ by GitHub There isn’t much here, we simply call mmap() and that is about… it. Let’s see how we can implement the actual pager behavior. We’ll start with the easy pieces here getting and releasing the memory from the pager: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn getBlocking(self: *MmapPager, page: u64, count: u32) !Page { return Page{ .buffer = self.ptr[page * PageSize .. (page * PageSize + count * PageSize)], .numberOfPages = count, .page = page, }; } pub fn release(self: *MmapPager, page: Page) void { _ = self; // we don't need to actually release anything here page.buffer = undefined; page.numberOfPages = undefined; page.page = undefined; } view raw get_release.zig hosted with ❤ by GitHub You’ll notice that we don’t actually have anything here? Even the act of checking that the page is within the bound of the mapped memory is done by slicing the ptr directly. What about the blocking part? How do we actually move the data to memory? The answer is that we aren’t. When you access the pointer we return from the get(), we’ll just get a page fault and the OS will read the data from the disk.  The release() function also doesn’t need to do much, all the behavior is inside the mmap() implementation, after all. A bit more complex is the part where we try to get the pages from the disk, here is the tryGet() implementation: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn tryGet(self: *MmapPager, page: u64, count: u32) !?Page { const buf = try self.allocator.alloc(u8, count); defer self.allocator.free(buf); std.mem.set(u8, buf, 0); const start = self.ptr[page * PageSize ..]; const size = count * PageSize; const rc = c.mincore(&start[0], size, &buf[0]); if (rc != 0) { return @intToError(@intCast(u16, std.os.errno(rc))); } for (buf) |b| { if (b & 1 == 0) { try std.os.madvise(@ptrCast([*]u8, start), size, std.os.MADV_WILLNEED); return null; // not all in memory } } // can return to the caller immediately return try getBlocing(self, page, count); } view raw tryGet.zig hosted with ❤ by GitHub That is quite a bit of code for not much in practice. We create a temporary array and then call mincore() on the range of memory that we’ll return. If the entire range is not already in memory, we’ll call madvice() to load it in the background and return null. If the range is already in memory, just return it. This isn’t 100% safe to do, by the way, there may be race conditions that would cause us to think that the data is in memory just as it is swapped to disk, but that is good enough for our needs. Especially because the whole thing is quite simple overall. The next stage is to handle writes and syncing to disk. This is simplicity itself, in this model. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn write(self: *MmapPager, page: Page) !void { // nothing to do, the data is already written to // the memory map } pub fn sync(self: *MmapPager) !void { if (c.msync(&self.ptr[0], self.ptr.len, c.MS_SYNC) != 0) { return @intToError(@intCast(u16, std.os.errno(rc))); } } view raw write_sync.zig hosted with ❤ by GitHub Since we handed out a buffer from the memory map itself, we don’t need to do any copying, we already modified that range of memory. And when we sync to the file, we can do that by a single msync() call. There are a few things to note here, though: Because we are writing directly to the memory mapped file, it is possible that our changes will show up in the file before write and sync are called. The msync() will sync the entire range, if we have smaller changes that we made, we can try to reduce the amount of memory that is synced by remembering what parts we have written to, but it ends up being quite a chore. And since the OS is already doing that for us, we can shell that to it directly. And that is pretty much it. The whole pager is under 100 lines of code. There are some things that I don’t handle, such as what happens if we want to extend the size of the file. That requires us to re-wire the mapping, if we are going by the strict reading of the API. But in both Linux & Windows, you can define a memory mapping that is greater than the file and that will automatically adjust as you grow the file. That is quite a nice feature for us and can save us a lot of management overhead internally. With that out of the way, we can start implementing higher level functions in a storage system. But notice how we moved pretty much everything to the OS? What would it look like if we wanted to build that ourselves?