A file pager is a component in database systems that is responsible for reading and writing pages (typically 8KB blocks) from the file system. The pager is responsible for the I/O operations and is crucial for the overall performance of the system. Ideally, it should manage details such as caching pages in memory, reduce I/O costs and continuously optimize the overall behavior of the storage.
That can be a pretty big chunk of a storage system, and it can have a significant impact on the way the storage system behaves. Here is the most basic version that I can think of:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
pub const Page = struct {
page: u64,
numberOfPages: u32,
buffer: []u8,
};
pub const Pager = struct {
pub fn tryGet(self: *Pager, page: u64, count: u32) !?Page {}
pub fn getBlocking(self: *Pager, page: u64, count: u32) !Page {}
pub fn release(self: *Pager, page: Page) void {}
pub fn write(self: *Pager, page: Page) !void {}
pub fn sync(self: *Pager) !void {}
};
view raw
Pager.zig
hosted with ❤ by GitHub
The idea is that whenever you need a particular page, you’ll call it using tryGet() which will return the document if it is already in memory, but it will not block. You can call getBlocking() to force the current thread to wait for the page to be in memory. That allows the calling code to perform some really nice optimizations.
Once we got the page, the Pager is charged with keeping it in memory until we will release it. Note that I’m talking about a Page, but that might actually contain multiple sequential pages. The release() call tells the Pager that the memory is no longer in active use, the Pager may decide to do something about that.
Finally, we have the write() method, which will write the data from the in-memory page to storage, and the sync() method, which will ensure that all previous writes are durable to disk.
There aren’t that many moving pieces, right? Not in particular that we don’t have the notion of transactions here, this is lower level than that. This API has the following properties:
The same page will always be represented in memory by the same location. However, if we release and get the page again, it may move.
The methods tryGet(), getBlocking() and release() have no locking or threading limits. You may call them in any context and the Pager will deal with any concurrency internally.
The write() and sync() calls, on the other hand, require synchronization by the client. There can be no concurrency between the two.
With that in place, we can build quite a sophisticated storage system. But we’ll focus on how the pager works for now.
There are a bunch of ways to implement this, so I’ll have at least a couple of posts on the topic. How would you approach implementing this?
I got into a good discussion about how RavenDB implements some optimizations with transaction handling. The details got big enough (and hopefully interesting enough) that they warrant their own post.
When we are talking about transactions, one of the key factors in the cost of a transaction is the amount of time that it takes to persist that. This is different for local and distributed transactions.
For a local transaction, we can consider the transaction committed if it is durably stored on the disk.
For a distributed transaction, we can consider the transaction committed if it is durably stored on a majority of the disks in the cluster.
That factor right there is the primary reason why a distributed transaction is more expensive. For a local transaction, you are waiting for the disk. For a distributed transaction, you are waiting for multiple disks and the network.
One of the core optimizations for speeding up transactions is the ability to batch things. The cost of writing to the disk is more or less the same, regardless of how much you write (within an order of magnitude or so). In other words, writing 8 KB and writing 32 KB has pretty much the same cost. Writing 1 MB and writing 100 MB does not, but writing 1 MB vs 4 MB isn’t meaningfully different (sequential durable write, which is what we care for in the case of transactions).
The point of this post is how this is actually handled. RavenDB utilizes a process called transaction merging to reduce the number of times that we have to go to the disk. Concurrent transactions will be bundled into the same write call, massively increasing our throughput. To give you some context, without transaction merging, you can peak at a few hundreds transactions per second. With transaction merging, you can jump to high thousands of transactions per second. Here is how this works:
RavenDB actually takes this further, in addition to transaction merging, we also apply something we call async commit. Take a look at the following timeline:
A transaction is actually composed of two separate steps. First we need to execute whatever commands we have in the transaction, then we have to write the transaction changes to disk.
RavenDB is able to start processing the next transaction as soon as the previous one started the write to the disk. The idea is to parallelize compute and I/O, and we are able to benefit greatly as a result. Note that this is safe to do, since the next transaction won’t be committed until the prior transaction has been durably stored.
How does this work in practice? Whenever we have a new transaction, we add it to a queue. A dedicated thread will merge those transactions and pull them from the queue, running the separate transactions as one big operation. When we run out of pending transactions or hit certain size / timing limits, we’ll commit the merged transaction and start working on the next one while the commit is completing in the background.
There are certain algorithms that try to maximize throughput, such as Nagle. They do that by waiting for additional transactions to arrive before actually going to the disk. RavenDB doesn’t use that approach. If a system is idle and we get a single transaction, we’ll immediately execute and commit it.
But the fact that we don’t explicitly do Nagle doesn’t mean that it isn’t useful. Because we have to wait for the disk, what ends up happening is that under load, we start getting more pending transactions in the queue. Which will then be executed as a merged unit. In other words, RavenDB implements a dynamic batching approach, affected by the actual I/O constraints and the load on the system. If we have independent transactions, we’ll execute them immediately. As the load increases, we’ll start making more merged transactions. This way we can keep a fairly consistent response time even when the load of the system grows by leaps and bounds.
The situation is similar when we are talking about distributed transactions. RavenDB uses the Raft protocol for managing its distributed behavior. I’m going to focus just on the batching aspect of the behavior. RavenDB will send an AppendEntries message to the other members in the cluster every 200 ms or so. However, if we have a new command to send to the cluster, it will go out immediately over the network. An important factor here is that we are using TCP, and we require acknowledgment from the other side before we send the next message. As a result of those behaviors, we have pretty much the same situation. Depending on the network latency and the processing time, we’ll send more entries in a single roundtrip.
In short, the overall behavior for RavenDB is that we’ll start the operation immediately on the first action (both for disk and network), and then we’ll batch anything that happens while the first operation is in flight and send that as a result.
After over a decade of working in this manner, I can tell that this has proven to be a highly adaptable system that results in the minimum number of knobs to mess with. It favors latency over throughput when there isn’t a lot of activity and shifts toward favoring throughput over latency as the load grows.
The phrase “work well under pressure” is something that I consider to be a red flag in a professional environment. My company builds a database that is used as the backend of business critical systems. If something breaks, there is a need to fix it. It costs money (sometimes a lot of money) for every minute of downtime.
Under such a scenario, I absolutely want the people handling the issue to remain calm, collected and analytical. In such a case, being able to work well under pressure is a huge benefit.
That is not how this term is typically used, however. The typical manner you’ll hear this phrase is to refer to the usual working environment. For example, working under time pressure to deliver certain functionality. That sort of pressure is toxic over time.
Excess stress is a well known contributor to health issues (mental and physical ones), it will cause you to make mistakes and it adds frictions all around.
From my perspective, the ability to work well under pressure is an absolutely important quality, which should be hoarded. You may need to utilize this ability in order to deal with a blocking customer issue, but should be careful not to spend that on non-critical stuff.
And by definition, most things are not critical. If everything is critical, you have a different problem.
That means that part of the task of the manager is to identify the places where pressure is applied and remove that. In the context of software, that may be delaying a release date or removing features to reduce the amount of work.
When working with technology, the most valuable asset you have is the people and the knowledge they have. And one of the easiest ways to lose that is to burn the candle at both ends. You get more light, sure, but you also get no candle.
A RavenDB user has accidentally deleted a collection. They intended to do something else, probably, but… They have a backup, but as you can imagine, this is a bad place to be in.
They talked to us and mentioned that they want a feature where deletion in the studio can be locked by a password or even two factor authentication, to prevent such a scenario.
We are not going to implement such a feature. From a technical perspective, this is a pretty easy thing to do, of course. My issue is that it doesn’t make sense for such a feature to exist. Let me dig into the details and explain what the problem is.
Locking deletes behind a password or two factor authentication is a security feature. That has a major impact on all aspects of the design. However, this is about preventing mistakes on the part of the user, not another security capability (this user can do deletes, this one cannot).
As such, this isn’t a security feature, but a UX one. The delete is already asking for confirmation, but it is the sort of thing that you rarely read, as we all know.
The distinction between a security feature and a UX feature is important. If this is a security feature, that means that I need to prevent doing mass deletes everywhere. As the result of queries, iterating over ids, in patch operations, etc. If this is a UX issue, that is a whole different level.
Looking at other such destructive operations, where the user is allowed to do the operation, but we want to prevent accidents leads me to consider something like this:
Where we require the user to perform some action if there is a major risk. That shifts the burden to the user, but it means that we now need to consider how to apply this.
Are we dealing with just mass deletes? What about update queries?
The purpose here isn’t to prevent the user from making the operation, but to have them stop and consider for a moment. The problem is that for common operations, that is something that you would add a significant amount of friction to your daily work.
When working on importing data, for example, it is common to delete the previous run each time that you run (think, development time, every 3 minutes). Adding hurdles along the way is a PITA.
When a user reports an error in a Roslyn Analyzer or a Source Generator, you may need to know the version of Roslyn and the language version that are currently being used. The compiler provides everything needed. Indeed, you simply need to add #error version to any source file and check the compile
I asked about a slow piece of code and why it is so slow. The answer is pretty simple, I messed up, take a look at this piece of code:When the Value is an int, I’m creating a span from the address of a long variable (initialized to zero). That means that I have a lot of hash collisions, which means that adding to the dictionary turned into a sequential operation, which means that the actual cost here is O(N**2).Interestingly enough, you can’t write this code without the use of unsafe. I actually didn’t know that the scope of the variable was even valid here to have its address taken. That was very hard to debug, because the problem was hidden away somewhere very far from where we looked at.
The following code takes a long time to run. In fact, I’m writing this blog post while this is running, and I’m not sure how long that will take.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
unsafe class Program
{
public class NumericValue
{
public string Name;
public object Value;
public override bool Equals(object obj)
{
return obj is NumericValue value &&
Name == value.Name &&
Equals(Value, value.Value);
}
public unsafe override int GetHashCode()
{
Span<byte> span;
if (Value is long l)
{
span = new Span<byte>(&l, sizeof(long));
}
else if (Value is int i)
{
span = new Span<byte>(&l, sizeof(int));
}
else
{
throw new NotSupportedException("Unknown value");
}
return (int)Standart.Hash.xxHash.xxHash32.ComputeHash(span,
span.Length, (uint)Name.GetHashCode());
}
}
static void Main(string[] args)
{
var dic = new Dictionary<NumericValue, int>();
var sp = Stopwatch.StartNew();
for (int i = 0; i < 100_000; i++)
{
dic.Add(new NumericValue
{
Name = "test",
Value = i
}, i);
}
Console.WriteLine(sp.ElapsedMilliseconds);
}
}
view raw
slow.cs
hosted with ❤ by GitHub
Update: This took over a minute to complete on my machine (which is pretty nice).The killer is that this code is correct, it does the right thing, but it can be slow. I stripped a much larger scenario to ~50 lines of code, can you see what is going on? And why?