I run into this fascinating blog post discussing the performance of BonsaiDb. To summarize the post, the author built a database and benchmarked it, but wasn’t actually calling fsync() to ensure durability. When they added the fsync() call at the right time, the performance didn’t degrade as expected. It turned out that they were running benchmarks on tmpfs, which is running in memory and where fsync() has no impact. Once they tested on a real system, there was a much higher cost, to the point where the author is questioning whether to continue writing the database library he has developed.
Forgetting to call fsync() is an understandable issue (they called the flush() method, but that didn’t translate to an fsync() call). I recall that at one point the C#’s API once had a bug where a Flush() would not call fsync() if you were making writes with 4KB alignment (that is… super bad).
After figuring out the issue, the author has out to figure exactly how to ensure durability. That is a journey that is fraught with peril, and he has some really interesting tidbits there. Including sync_file_range(), fdatasync() and how you cannot write a durable database for Mac or iOS.
From my perspective, you cannot really trust anything beyond O_DIRECT | O_DSYNC or fdatasync() for durability. Almost a decade ago I wrote about performance testing that I did for various databases. My code was the 2nd fastest around for the tested scenarios. It was able to achieve almost 23,000 writes, almost 25% of the next slowest database. However, the fastest database around was Esent, which clocked at 786,782 writes.
I dug deep into how this is done and I realized that there is a fundamental difference between how all other databases were working and how Esent was handling things. All other databases issued fsync() calls (or fdatasync()). While Esent skipped that altogether. Instead, it opened a file with FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_DIRECT (the Unix version is O_DIRECT | O_DSYNC). That change alone was responsible for a major performance difference. When using O_DIRECT | O_DSYNC, the write is sent directly to persistent medium, skipping all buffers. That means that you don’t have to flush anything else that is waiting to be written.
If you are interested, I wrote a whole chapter on the topic of durable writes. It is a big topic.
The other thing that has a huge impact on performance is whether you are doing transaction merging or not. If you have multiple operations running at roughly the same time, are you going to do a separate disk write for each one of them, or will you be able to do that in a single write. The best example that I can think of is the notion of taking the bus. If you send a whole bus for each passenger, you’ll waste a lot of time and fuel. If you pack the bus as much as possible, for almost the same cost, you’ll get a far better bang.
In other words, your design has to include a way for the database to coalesce such operations into a single write.
Yesterday there was an update to this task, which more or less followed that direction. The blog post covers quite a lot of ground and is going in the right direction, in my opinion. However, there are a few things there that I want to comment upon.
First, pre-allocation of disk space can make a huge impact on the overall performance of the system. Voron does that by allocating up to 1GB of disk space at a time, which dramatically reduces the amount of I/O work you have to do. Just to give some context, that turns a single disk write to multiple fsyncs that you have to do, on both your file and the parent directory, on each write. That is insanely expensive. The storage engine discussed here used append only mode, which makes this a bit of a problem, but not that much. You can still preallocate the disk space. You have to scan the file from the end on startup anyway, and the only risk here is the latency for reading the whole pre-allocation size on startup if we shut down immediately after the preallocation happened. It’s not ideal, but it is good enough.
Second, the way you manage writes shouldn’t rely on fsync and friends. That is why we have the log for, and you can get away with a lot by letting just the log handle the durability issue. The log is pre-allocated to a certain size (Voron uses dynamic sizes, with a max of 256MB) and written to using O_DIRECT | O_
O_DSYNC each time. But because this is expensive, we have something like this (Python code, no error handling, demo code, etc):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
class WriteAheadLog:
def __init__(self) -> None:
self.queue = []
self.hasWrites = threading.Event()
self.log = open("write-ahead.log", O_DIRECT | O_DSYNC)
th = threading.Thread(target=self.backgroundThread)
th.start()
def writeToLog(self, pages):
future = loop.create_future()
self.queue.append((pages, future))
return future
def backgroundThread(self):
while self.hasWrites.wait():
self.hasWrites.clear()
pending = self.queue
self.queue = []
pagesToWrite = []
for (pages, _) in pending:
pagesToWrite = pagesToWrite + pages
dataToWrite = CompressPages(pagesToWrite)
self.log.write(dataToWrite)
for (_, future) in pending:
future.set_result(True)
view raw
WriteAheadLog.py
hosted with ❤ by GitHub
The idea is that you can call writeToLog() each time and you’ll get a future on the write to the log file. You can continue with your transaction when the log holds the data. Note that in this model, if you have concurrent writes, they will be merged into a single disk write. You can also benefit significantly from reduced amount we write to disk by applying compression.
Third, something that has been raised as an option here is a new storage format. I’m not sure that I 100% get what is the intention, but… what I understand I don’t like. I think that looking at how LMDB does things would help a lot here. It is a COW model (which the append only is very similar to). The key difference is that the new model is going to store every header twice. Probably with a CURRENT and NEXT model, where you can switch between the two configurations. That… works, but it is pretty complex. Given that you have a log involved, there is no need for any of this. You can just store the most recent changes in the log and use that to find what the current version is of the data is.
I don’t like append only models, since they require you to do compaction at some point. A better model is what LMDB does, where it re-used the same pages (with copy on write). It requires you to manage a free list of pages, of course, but that isn’t that big a task.
If you're an employee, you probably have a boss, manager, supervisor, or similarly titled person to whom you report. Normally, you figure it…Keep Reading →
Round-robin DNS is a load-balancing technique where the DNS returns multiple IP addresses for a host. The client can choose any of the IP addresses to connect to the server. For example, bing.com has 2 IPv4 addresses:In .NET, the HttpClient connects to the server using the first IP address in the l
I’m teaching a college class about Cloud Computing and part of that is giving various assignments to build stuff on the cloud. That part is pretty routine.
One of my requests for the tasks is to identify failure mode in the system, and one of the students that I’m currently grading had this doozy scenario:
If you’ll run this code you may have to deal with this problem. Just nuke the system and try again, it only fails because of this once in a while.
The underlying issue is that he is setting up a Redis instance that is publicly accessible to the Internet with no password. On a regular basis, automated hacking tools will scan, find and ransom the relevant system. To the point where the student included a note on that in the exercise.
A great reminder that the Network is Hostile. And yes, I’m aware of Redis security model, but I don’t agree with it.
I’m honestly not sure how I should grade such an assignment. On the one hand, I don’t think that a “properly” secured system is reasonable to ask from a student. On the other hand, they actually got hacked during their development process.
I tried setting up a Redis honeypot to see how long it would take to get hacked, but no one bit during the ~10 minutes or so that I waited.
I do wonder if the fact that such attacks are so prevalent, immediate and destructive means that through the process of evolution, you’ll end up with a secured system (since unsecured isn’t going to be working).
A few weeks ago I wrote about the Hare language and its lack of generic data structures. I don’t want to talk about this topic again, instead I want to discuss something more generic (pun intended). In my view, any modern programming language that aims for high performance should have some form of generics in it. To not have that in place is a major mistake and a huge cause for additional complexity and loss of performance. One aspect of that is the fact that generic data structures get a lot more optimizations than one-off implementations. But I already talked about that in the previous post.
The other issue is that by not having generics, there is a huge barrier for optimizations in front of you. You lack the ability to build certain facilities at all. Case in point, let us take a topic that is near and dear to my heart, sorting. Working on sorted data is pretty much the one thing that makes databases work. Everything else is just details on top of that, nothing more. Let’s consider how you sort data (in memory) using a few programming languages, using their definitions
Using C:
void qsort (void *array, size_t count, size_t size, comparison_fn_t compare);int comparison_fn_t (const void *, const void *);
Using C++:
template <class RandomAccessIterator> void sort (RandomAccessIterator first, RandomAccessIterator last);
Using Java:
public static void sort(int [] a);public static void sort(long[] a);public static void sort(Object[] a);
Using C#:
public static void Sort<T> (T[] array);
Using Hare:
type cmpfunc = fn(a: const *void , b: const *void ) int ; fn sort([]void , size, *cmpfunc) void ;
Using Rust:
impl<T> [T] { pub fn sort(&mut self) where T: Ord,
}
Using Zig:
pub fn sort( comptime T: type, items: []T, context: anytype, comptime lessThan: fn (context: @TypeOf(context), lhs: T, rhs: T) bool, ) void
I’m looking only at the method declaration, not the implementation. In fact, I don’t care about how this is implemented at this point. Let’s assume that I want to sort an array of integers, what would be the result in all of those languages?
Well, they generally fall into one of a few groups:
C & Hare – will require you to write something like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
int cmp_asc_int(const void *a, const void *b) {
return *(int*)a > *(int*)b;
}
qsort(array, len, sizeof(int), cmp_asc_int);
view raw
sort.c
hosted with ❤ by GitHub
In other words, we are passing a function pointer to the sorting routine and we’ll invoke that on each comparison.
C++, C#, Rust, Zig – will specialize the routine for the call. On invocation, this will look like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
std::sort(array.begin(), array.end());
view raw
sort.cp
hosted with ❤ by GitHub
The idea is that the compiler is able to emit code specifically for the invocation we use. Instead of having to emit a function call on each invocation, the compare call will usually be inlined and the cost of invocation is completely eliminated.
Java is the only one on this list that has a different approach. Instead of using generics at compile time, it is actually doing a dispatch of the code to optimized routines based on runtime types. That does mean that they had to write the same sort code multiple times, of course.
Note that this isn’t anything new or novel. Here is a discussion on the topic when Go got generics, in the benchmark there, there is a 20% performance improvement from moving to the generics version. That results from avoiding the call overhead as well as giving the compiler more optimization opportunities.
Going back to the premise of this post, you can see how a relatively straightforward decision (having generics in the language) can have a huge impact on the performance of what is one of the most common scenarios in computer science.
The counter to this argument is that we can always specialize the code for our needs, right? Except… that this isn’t something that happens. If you have generics, you get this behavior for free. If you don’t, well, this isn’t being done.
I write databases for a living, and the performance of our sorting code is something that we analyze at the assembly level. Pretty much every database developer will have the same behavior, I believe. The performance of sorting is pretty key to everything a database does. I run into this post, talking about performance optimizations in Postgres, and one of the interesting ones there was exactly this topic. Changing the implementation of sorting from using function pointers to direct calls. You can see the commit here. Here is what the code looks like:
Postgres is 25 years old(!) and this is a very well known weakness of C vs. C++. Postgres is also making a lot of sorting calls, and this is the sort of thing that is a low hanging fruit for performance optimization.
As for the effect, this blog post shows 4% – 6% improvement in overall performance as a result of this change. That means that for those particular routines, the effect is pretty amazing.
I can think of very few scenarios where a relatively simple change can bring about 6% performance improvement on a well-maintained and actively worked-on 25-year-old codebase.
Why am I calling it out in this manner, however?
Because when I ran into this blog post and the optimization, it very strongly resonated with the previous discussion on generics. It is a great case study for the issue. Because the language (C, in the case of Postgres) isn’t supporting generics in any meaningful way, those sorts of changes aren’t happening, and they are very costly.
A modern language that is aiming for performance should take this very important aspect of language design into account. To not do so means that your users will have to do something similar to what Postgres is doing. And as we just saw, that sort of stuff isn’t done.
Not having generics means that you are forcing your users to leave performance on the table.
Indeed, pretty much all the modern languages that care for high performance have generics. The one exception that I can think of is Java, and that is because it chose backward compatibility when it added generics.
Adding this conclusion to the previous post about generics data structure, I think that the final result is glaringly obvious. If you want high-performance system, you should choose a language that allows you to express it easily and succinctly. And generics are mandatory tooling in the box for that.
Delegates are used to pass methods as arguments to other methods. The most common delegates are Action, Func<T>, and EventHandler. You can use a lambda expression to provide a delegate or you can use a method group. You can also cache the delegate into a field and reuse the instance when need