Page 13 • Relatively General .NET

.NET MAUI Performance Features in .NET 9

by Jonathan,Simon

posted on: February 20, 2025

Optimize .NET MAUI application size and startup times with trimming and NativeAOT. Learn about `dotnet-trace` and `dotnet-gcdump` for measuring performance.

RavenDB 7.1

by Oren Eini

posted on: February 19, 2025

I have been delaying the discussion about the performance numbers for a reason. Once we did all the work that I described in the previous posts, we put it to an actual test and ran it against the benchmark suite. In particular, we were interested in the following scenario:High insert rate, with about 100 indexes active at the same time. Target: Higher requests / second, lowered latencyPreviously, after hitting a tipping point, we would settle at under 90 requests/second and latency spikes that hit over 700(!) ms. That was the trigger for much of this work, after all. The initial results were quite promising, we were showing massive improvement across the board, with over 300 requests/second and latency peaks of 350 ms.On the one hand, that is a really amazing boost, over 300% improvement. On the other hand, just 300 requests/second - I hoped for much higher numbers. When we started looking into exactly what was going on, it became clear that I seriously messed up.Under load, RavenDB would issue fsync() at a rate of over 200/sec. That is… a lot, and it means that we are seriously limited in the amount of work we can do. That is weird since we worked so hard on reducing the amount of work the disk has to do. Looking deeper, it turned out to be an interesting combination of issues.Whenever Voron changes the active journal file, we’ll register the new journal number in the header file, which requires an fsync() call. Because we are using shared journals, the writes from both the database and all the indexes go to the same file, filling it up very quickly. That meant we were creating new journal files at a rate of more than one per second. That is quite a rate for creating new journal files, mostly resulting from the sheer number of writes that we funnel into a single file. The catch here is that on each journal file creation, we need to register each one of the environments that share the journal. In this case, we have over a hundred environments participating, and we need to update the header file for each environment. With the rate of churn that we have with the new shared journal mode, that alone increases the number of fsync() generated.It gets more annoying when you realize that in order to actually share the journal, we need to create a hard link between the environments. On Linux, for example, we need to write the following code:bool create_hard_link_durably(const char *src, const char *dest) { if (link(src, dest) == -1) return false; int dirfd = open(dest, O_DIRECTORY); if (dirfd == -1) return false; int rc = fsync(dirfd); close(dirfd); return rc != -1; }You need to make 4 system calls to do this properly, and most crucially, one of them is fsync() on the destination directory. This is required because the man page states:Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.Shared journals mode requires that we link the journal file when we record a transaction from that environment to the shared journal. In our benchmark scenario, that means that each second, we’ll write from each environment to each journal. We need an fsync() for the directory of each environment per journal, and in total, we get to over 200 fsync() per journal file, which we replace at a rate of more than one per second.Doing nothing as fast as possible…Even with this cost, we are still 3 times faster than before, which is great, but I think we can do better. In order to be able to do that, we need to figure out a way to reduce the number of fsync() calls being generated. The first task to handle is updating the file header whenever we create a new journal. We are already calling fsync() on the directory when we create the journal file, so we ensure that the file is properly persisted in the directory. There is no need to also record it in the header file. Instead, we can just use the directory listing to handle this scenario. That change alone saved us about 100 fsync() / second.The second problem is with the hard links, we need to make sure that these are persisted. But calling fsync() for each one is cost-prohibitive. Luckily, we already have a transactional journal, and we can re-use that. As part of committing a set of transactions to the shared journal, we’ll also record an entry in the journal with the associated linked paths. That means we can skip calling fsync() after creating the hard link, since if we run into a hard crash, we can recover the linked journals during the journal recovery. That allows us to skip the other 100 fsync() / second.Another action we can take to reduce costs is to increase the size of the journal files. Since we are writing entries from both the database and indexes to the same file, we are going through them a lot faster now, so increasing the default maximum size will allow us to amortize the new file costs across more transactions.The devil is in the detailsThe idea of delaying the fsync() of the parent directory of the linked journals until recovery is a really great one because we usually don’t recover. That means we delay the cost indefinitely, after all. Recovery is rare, and adding a few milliseconds to the recovery time is usually not meaningful. However… There is a problem when you look at this closely. The whole idea behind shared journals is that transactions from multiple storage environments are written to a single file, which is hard-linked to multiple directories. Once written, each storage environment deals with the journal files independently. That means that if the root environment is done with a journal file and deletes it before the journal file hard link was properly persisted to disk and there was a hard crash… on recovery, we won’t have a journal to re-create the hard links, and the branch environments will be in an invalid state.That is a set of circumstances that is going to be unlikely, but that is something that we have to prevent, nevertheless. The solution for that is to keep track of all the journal directories, and whenever we are about to delete a journal file, we’ll sync all the associated journal directories. The key here is that when we do that, we keep track of the current journal written to that directory. Instead of having to run fsync() for each directory per journal file, we can amortize this cost. Because we delayed the actual syncing, we have time to create more journal files, so calling fsync() on the directory ensures that multiple files are properly persisted.We still have to sync the directories, but at least it is not to the tune of hundreds of times per second.Performance resultsAfter making all of those changes, we run the benchmark again. We are looking at over 500 req/sec and latency peaks that hover around 100 ms under load. As a reminder, that is almost twice as much as the improved version (and a hugeimprovement in latency).If we compare this to the initial result, we increased the number of requests per second by over 500%.But I have some more ideas about that, I’ll discuss them in the next post…

Building .NET AI apps with Chroma

by Luis,Jiri

posted on: February 19, 2025

Get started building AI applications using Chroma DB using the C# client SDK.

Setting application environment variables in IIS without restarts

by Andrew Lock

posted on: February 18, 2025

In this post I outline how IIS works, describe how to add environment variables to app pools, and show how to prevent automatic app pool recycling on changes…

RavenDB 7.1

by Oren Eini

posted on: February 17, 2025

I wrote before about a surprising benchmark that we ran to discover the limitations of modern I/O systems. Modern disks such as NVMe have impressive capacity and amazing performance for everyday usage. When it comes to the sort of activities that a database engine is interested in, the situation is quite different.At the end of the day, a transactional database cares a lot about actually persisting the data to disk safely. The usual metrics we have for disk benchmarks are all about buffered writes, that is why we run our own benchmark. The results were really interesting (see the post), basically, it feels like there is a bottleneck writing to the disk. The bottleneck is with the number of writes, not how big they are.If you are issuing a lot of small writes, your writes will contend on that bottleneck and you’ll see throughput that is slow. The easiest way to think about it is to consider a bus carrying 50 people at once versus 50 cars with one person each. The same road would be able to transport a lot more people with the bus rather than with individual cars, even though the bus is heavier and (theoretically, at least) slower.Databases & StoragesIn this post, I’m using the term Storage to refer to a particular folder on disk, which is its own self-contained storage with its own ACID transactions. A RavenDB database is composed of many such Storages that are cooperating together behind the scenes.The I/O behavior we observed is very interesting for RavenDB. The way RavenDB is built is that a single database is actually made up of a central Storage for the data, and separate Storages for each of the indexes you have. That allows us to do split work between different cores, parallelize work, and most importantly, benefit from batching changes to the indexes. The downside of that is that a single transaction doesn’t cause a single write to the disk but multiple writes. Our test case is the absolute worst-case scenario for the disk, we are making a lot of individual writes to a lot of documents, and there are about 100 indexes active on the database in question. In other words, at any given point in time, there are many (concurrent) outstanding writes to the disk. We don’t actually care about most of those writers, mind you. The only writer that matters (and is on the critical path) is the database one. All the others are meant to complete in an asynchronous manner and, under load, will actually perform better if they stall (since they can benefit from batching).The problem is that we are suffering from this situation. In this situation, the writes that the user is actually waiting for are basically stuck in traffic behind all the lower-priority writes. That is quite annoying, I have to say. The role of the Journal in VoronThe Write Ahead Journal in Voron is responsible for ensuring that your transactions are durable. I wrote about it extensively in the past (in fact, I would recommend the whole series detailing the internals of Voron). In short, whenever a transaction is committed, Voron writes that to the journal file using unbuffered I/O. Remember that the database and each index are running their own separate storages, each of which can commit completely independently of the others. Under load, all of them may issue unbuffered writes at the same time, leading to congestion and our current problem.During normal operations, Voron writes to the journal, waits to flush the data to disk, and then deletes the journal. They are never actually read except during startup. So all the I/O here is just to verify that, on recovery, we can trust that we won’t lose any data.The fact that we have many independent writers that aren’t coordinating with one another is an absolute killer for our performance in this scenario. We need to find a way to solve this, but how?One option is to have both indexes and the actual document in the same storage. That would mean that we have a single journal and a single writer, which is great. However, Voron has a single writer model, and for very good reasons. We want to be able to process indexes in parallel and in batches, so that was a non-starter.The second option was to not write to the journal in a durable manner for indexes. That sounds… insane for a transactional database, right? But there is logic to this madness. RavenDB doesn’t actually need its indexes to be transactional, as long as they are consistent, we are fine with “losing” transactions (for indexes only, mind!). The reasoning behind that is that we can re-index from the documents themselves (who would be writing in a durable manner). We actively considered that option for a while, but it turned out that if we don’t have a durable journal, that makes it a lot more difficult to recover. We can’t rely on the data on disk to be consistent, and we don’t have a known good source to recover from. Re-indexing a lot of data can also be pretty expensive. In short, that was an attractive option from a distance, but the more we looked into it, the more complex it turned out to be. The final option was to merge the journals. Instead of each index writing to its own journal, we could write to a single shared journal at the database level. The problem was that if we did individual writes, we would be right back in the same spot, now on a single file rather than many. But our tests show that this doesn’t actually matter. Luckily, we are well-practiced in the notion of transaction merging, so this is exactly what we did. Each storage inside a database is completely independent and can carry on without needing to synchronize with any other. We defined the following model for the database:Root Storage: DocumentsBranch: Index - Users/SearchBranch: Index - Users/LastSuccessfulLoginBranch: Index - Users/ActivityThis root & branch model is a simple hierarchy, with the documents storage serving as the root and the indexes as branches. Whenever an index completes a transaction, it will prepare the journal entry to be written, but instead of writing the entry to its own journal, it will pass the entry to the root. The root (the actual database, I remind you) will be able to aggregate the journal entries from its own transaction as well as all the indexes and write them to the disk in a single system call. Going back to the bus analogy, instead of each index going to the disk using its own car, they all go on the bus together.We now write all the entries from multiple storages into the same journal, which means that we have to distinguish between the different entries. I wrote a bit about the challenges involved there, but we got it done.The end result is that we now have journal writes merging for all the indexes of a particular database, which for large databases can reduce the total number of disk writes significantly. Remember our findings from earlier, bigger writes are just fine, and the disk can process them at GB/s rate. It is the number of individual writes that matters most here.Writing is not even half the job, recovery (read) in a shared worldThe really tough challenge here wasn’t how to deal with the write side for this feature. Journals are never read during normal operations. Usually we only ever write to them, and they keep a persistent record of transactions until we flush all changes to the data file, at which point we can delete them.It is only when the Storage starts that we need to read from the journals, to recover all transactions that were committed to them. As you can imagine, even though this is a rare occurrence, it is one of critical importance for a database. This change means that we direct all the transactions from both the indexes and the database into a single journal file. Usually, each Storage environment has its own Journals/ directory that stores its journal files. On startup, it will read through all those files and recover the data file. How does it work in the shared journal model? For a root storage (the database), nothing much changes. We need to take into account that the journal files may contain transactions from a different storage, such as an index, but that is easy enough to filter. What about branch storage (an index) recovery? Well, it can probably just read the Journals/ directory of the root (the database), no?Well, maybe. Here are some problems with this approach. How do we encode the relationship between root & branch? Do we store a relative path, or an absolute path? We could of course just always use the root’s Journals/ directory, but that is problematic. It means that we could only open the branch storage if we already have the root storage open. Accepting this limitation means adding a new wrinkle into the system that currently doesn’t exist.It is highly desirable (for many reasons) to want to be able to work with just a single environment. For example, for troubleshooting a particular index, we may want to load it in isolation from its database. Losing that ability, which ties a branch storage to its root, is not something we want.The current state, by the way, in which each storage is a self-contained folder, is pretty good for us. Because we can play certain tricks. For example, we can stop a database, rename an index folder, and start it up again. The index would be effectively re-created. Then we can stop the database and rename the folders again, going back to the previous state. That is not possible if we tie all their lifetimes together with the same journal file.Additional complexity is not welcome in this projectBuilding a database is complicated enough, adding additional levels of complexity is a Bad Idea. Adding additional complexity to the recovery process (which by its very nature is both critical and rarely executed) is a Really Bad Idea.I started laying out the details about what this feature entails:A database cannot delete its journal files until all the indexes have synced their state. What happens if an index is disabled by the user?What happens if an index is in an error state?How do you manage the state across snapshot & restore?There is a well-known optimization for databases in which we split the data file and the journal files into separate volumes. How is that going to work in this model?Putting the database and indexes on separate volumes altogether is also a well-known optimization technique. Is that still possible?How do we migrate from legacy databases to the new shared journal model? I started to provide answers for all of these questions… I’ll spare you the flow chart that was involved, it looked something between abstract art and the remains of a high school party. The problem is that at a very deep level, a Voron Storage is meant to be its own independent entity, and we should be able to deal with it as such. For example, RavenDB has a feature called Side-by-Side indexes, which allows us to have two versions of an index at the same time. When both the old and new versions are fully caught up, we can shut down both indexes, delete the old one, and rename the new index with the old one’s path. A single shared journal would have to deal with this scenario explicitly, as well as many other different ones that all made such assumptions about the way the system works.Not my monkeys, not my circus, not my problemI got a great idea about how to dramatically simplify the task when I realized that a core tenet of Voron and RavenDB in general is that we should not go against the grain and add complexity to our lives. In the same way that Voron uses memory-mapped files and carefully designed its data access patterns to take advantage of the kernel’s heuristics. The idea is simple, instead of having a single shared journal that is managed by the database (the root storage) and that we need to access from the indexes (the branch storages), we’ll have a single shared journal with many references.The idea is that instead of having a single journal file, we’ll take advantage of an interesting feature: hard links. A hard link is just a way to associate the same file data with multiple file names, which can reside in separate directories. A hard link is limited to files running on the same volume, and the easiest way to think about them is as pointers to the file data. Usually, we make no distinction between the file name and the file itself, but at the file system level, we can attach a single file to multiple names. The file system will manage the reference counts for the file, and when the last reference to the file is removed, the file system will delete the file. The idea is that we’ll keep the same Journals/ directory structure as before, where each Storage has its own directory. But instead of having separate journals for each index and the database, we’ll have hard links between them. You can see how it will look like here, the numbers next to the file names are the inode numbers, you can see that there are multiple such files with the same inode number (indicating that there are multiple links to the same underlying file).. └── [ 40956] my.shop.db ├── [ 40655] Indexes │ ├── [ 40968] Users_ByName │ │ └── [ 40970] Journals │ │ ├── [ 80120] 0002.journal │ │ └── [ 82222] 0003.journal │ └── [ 40921] Users_Search │ └── [ 40612] Journals │ ├── [ 81111] 0001.journal │ └── [ 82222] 0002.journal └── [ 40812] Journals ├── [ 81111] 0014.journal └── [ 82222] 0015.journalWith this idea, here is how a RavenDB database manages writing to the journal. When the database needs to commit a transaction, it will write to its journal, located in the Journals/ directory. If an index (a branch storage) needs to commit a transaction, it does not write to its own journal but passes the transaction to the database (the root storage), where it will be merged with any other writes (from the database or other indexes), reducing the number of write operations.The key difference here is that when we write to the journal, we check if that journal file is already associated with this storage environment. Take a look at the Journals/0015.journal file, if the Users_ByName index needs to write, it will check if the journal file is already associated with it or not. In this case, you can see that Journals/0015.journal points to the same file (inode) as Indexes/Users_ByName/Journals/0003.journal. What this means is that the shared journals mode is only applicable for committing transactions, there have been no changes required for the reads / recovery side. That is a major plus for this sort of a critical feature since it means that we can rely on code that we have proven to work over 15 years.The single writer mode makes it workA key fact to remember is that there is always only a single writer to the journal file. That means that there is no need to worry about contention or multiple concurrent writes competing for access. There is one writer and many readers (during recovery), and each of them can treat the file as effectively immutable during recovery.The idea behind relying on hard links is that we let the operating system manage the file references. If an index flushes its file and is done with a particular journal file, it can delete that without requiring any coordination with the rest of the indexes or the database. That lack of coordination is a huge simplification in the process.In the same manner, features such as copying & moving folders around still work. Moving a folder will not break the hard links, but copying the folder will. In that case, we don’t actually care, we’ll still read from the journal files as normal. When we need to commit a new transaction after a copy, we’ll create a new linked file. That means that features such as snapshots just work (although restoring from a snapshot will create multiple copies of the “same” journal file). We don’t really care about that, since in short order, the journals will move beyond that and share the same set of files once again.In the same way, that is how we’ll migrate from the old system to the new one. It is just a set of files on disk, and we can just create new hard links as needed.Advanced scenarios behaviorI mentioned earlier that a well-known technique for database optimizations is to split the database file and the journals into separate volumes (which provides higher overall I/O throughput). If the database and the indexes reside on different volumes, we cannot use hard links, and the entire premise of this feature fails. In practice, at least for our users’ scenarios, that tends to be the exception rather than the rule. And shared journals are a relevant optimization for the most common deployment model.Additional optimizations - vectored I/OThe idea behind shared journals is that we can funnel the writes from multiple environments through a single pipe, presenting the disk with fewer (and larger) writes. The fact that we need to write multiple buffers at the same time also means that we can take advantage of even more optimizations.In Windows, we can use WriteFileGather to submit a single system call to merge multiple writes from many indexes and the database. On Linux, we use pwritev for the same purpose. The end result is additional optimizations beyond just the merged writes. I have been working on this set of features for a very long time, and all of them are designed to be completely invisible to the user. They either give us a lot more flexibility internallyor they are meant to just provide better performance without requiring any action from the user.I’m really looking forward to showing the performance results. We’ll get to that in the next post…

Detect missing migrations in Entity Framework Core

by Gérald Barré

posted on: February 17, 2025

Entity Framework Core allows to update the database schema using migrations. The migrations are created manually by running a CLI command. It's easy to forget to create a new migration when you change the model. To ensure the migrations are up-to-date, you can write a test that compares the current

RavenDB 7.1

by Oren Eini

posted on: February 14, 2025

After describing in detail the major refactoring we did for how RavenDB (via Voron, its storage engine) has gone through, there is one question remaining. What’s the point? The code is a lot simpler, of course, but the whole point of this much effort is to allow us to do interesting things.There is performance, of course, but we haven’t gotten around to testing that yet because something that is a lot more interesting came up: Disk space management. Voron allocates disk space from the operating system in batches of up to 1GB at a time. This is done to reduce file fragmentation and allow the file system to optimize the disk layout. It used to be something critical, but SSDs and NVMe made that a lot less important (but still a factor).What happens if we have a very large database, but we delete a big collection of documents? This is a case where the user’s expectations and Voron’s behavior diverge. A user who just deleted a few million documents would expect to see a reduction in the size of the database. But Voron will mark the area on the disk as “to-be-used-later” and won’t free the disk space back to the operating system. There were two reasons for this behavior:We designed Voron in an era where it was far more common for systems to have hard disks, where fragmentation was a very serious problem. It is really complicated to actually release disk space back to the operating system.The first reason is no longer that relevant, since most database servers can reasonably expect to run on SSD or NVMe these days, significantly reducing the cost of fragmentation. The second reason deserves a more in-depth answer.In order to release disk space back to the operating system, you have to do one of three things:Store the data across multiple files and delete a file where it is no longer in use.Run compaction, basically re-build the database from scratch in a compact form.Use advanced features such as sparse files (hole punching) to return space to the file system without changing the file size.The first option, using multiple files, is possible but pretty complex. Mostly because of the details of how you split to multiple files, whenever a single entry in an otherwise empty file will prevent its deletion, etc. There are also practical issues, such as the number of open file handles that are allowed, internal costs at the operating system level, etc.Compaction, on the other hand, requires that you have enough space available during the compaction to run. In other words, if your disk is 85% full, and you delete 30% of the data, you don’t have free space to run a compaction. Another consideration for compaction is that it can be really expensive. Running compaction on a 100GB database, for example, can easily take hours and in the cloud will very quickly exhaust your I/O credits.RavenDB & Voron have supported compaction for over a decade, but it was always something that you did on very rare occasions. A user had to manually trigger it, and the downsides are pretty heavy, as you can see.In most cases, I have to say, returning disk space back to the operating system is not something that is that interesting. That free space is handled by RavenDB and will be reused before we’ll allocate any additional new space from the OS. However, this is one of those features that keep coming up, because we go against users’ expectations. The final option I discussed is using hole punching or sparse files (the two are pretty much equivalent - different terms between operating systems). The idea is that we can go to the operating system and tell it that a certain range in the file is not used, and that it can make use of that disk space again. Any future read from that range will return zeroes. If you write to this region, the file system will allocate additional space for those writes.That behavior is problematic for RavenDB, because we used to use memory-mapped I/O to write to the disk. If there isn’t sufficient space to write, memory-mapped I/O is going to generate a segmentation fault / access violation. In general, getting an access violation because of a disk full is not okay by us, so we couldn’t use sparse files. The only option we were able to offer to reduce disk space was full compaction.You might have noticed that I used past tense in the last paragraph. That is because I am now no longer limited to using just memory-mapped I/O. Using normal I/O for this purpose works even if we run out of disk space, we will get the usual disk full error (which we are already handling anyway). Yes, that means that starting with RavenDB 7.1, we’ll automatically release free disk space directly back to the operating system, matching your likely expectations about the behavior. This is done in increments of 1MB, since we still want to reduce fragmentation and the number of file metadata that the file system needs to manage.The one MB triggerRavenDB will punch a hole in the file whenever there is a consecutive 1MB of free space. This is important to understand because of fragmentation. If you wrote 100 million documents, each 2 KB in size, and then deleted every second document, what do you think will happen? There won’t be any consecutive 1MB range for us to free.Luckily, that sort of scenario tends to be pretty rare, and it is far more common to have clustering of writes and deletes, which allow us to take advantage of locality and free the disk space back to the OS automatically.RavenDB will first use all the free space inside the file, reclaiming sparse regions as needed, before it will request additional disk space from the OS. When we do request additional space, we’ll still get it in large chunks (and without using sparse files). That is because it is far more likely to be immediately used, and we want to avoid giving the file system too much work.Note that the overall file size is going to stay the same, but the actually used disk space is going to be reduced. We updated the RavenDB Studio to report both numbers, but when browsing the files manually, you need to keep that in mind.I expect that this will be most noticeable for users who are running on cloud instances, where it is common to size the disks to be just sufficiently big enough for actual usage. It Just WorksThere is no action that you need to take to enable this behavior, and on first start of RavenDB 7.1, it will immediately release any free space already in the data files. The work was actually completed and merged in August 2024, but it is going to be released sometime in Q2/Q3 of 2025. You might have noticed that there have been a lot of low-level changes targeted at RavenDB 7.1. We need to run them through the wringer to make sure that everything works as it should.I’m looking forward to seeing this in action, there are some really nice indications about the sort of results we can expect. I’ll talk about that in more detail in another post, this one is getting long enough.

New Features for Enhanced Razor Productivity!

by Leslie Richardson

posted on: February 13, 2025

The Extract to Component refactoring and the Roslyn tokenizer are two new features designed to help improve your productivity in Razor files.

RavenDB 7.1

by Oren Eini

posted on: February 12, 2025

In the previous post, I talked about a massive amount of effort (2+ months of work) and about 25,000 lines of code changes. The only purpose of this task was to remove two locks from the system. During high load, we spent huge amounts of time contending for these locks, so removing them was well worth the effort.During this work, I essentially found myself in the guts of Voron (RavenDB’s storage engine) and mostly dealing with old code. I’m talking about code that was written between 10 and 15 years ago. I wrote a blog post about it at the time. Working with old code is an interesting experience, especially since most of this code was written by me. I can remember some of my thoughts from the time I wrote it.Old code is working code, and old code is also something that was built upon. Other parts of the codebase are making assumptions about the way the system behaves. And the more time a piece of code doesn't change, the more likely its behavior is going to ossify. Changing old code is hard because of the many ways that such dependencies can express themselves. I dug through all of this decade-plus old code and I realized something pretty horrible. It turns out that I made a mistake in understanding how Windows implements buffering for memory-mapped files. I realized my mistake around mid-2024, see the related post for theactual details.The TLDR summary, however, is that when using unbuffered file I/O with memory-mapped files on Windows, you cannot expect the mapped memory to reflect the data written using the file I/O API. Windows calls it coherence, and it was quite confusing when I first realized what the problem was. It turns out that this applies only to unbuffered I/O and there is no such problem with buffered I/O. The scenario I needed to work with can use buffered I/O, however, which has been a profound shock to me. Large portions of the architecture of Voron are actually shaped by this limitation. Because I thought that you couldn’t use both file I/O and memory-mapped files at the same time in Windows and get a consistent view of the data (the documentation literally says that, I have to add), RavenDB used memory-mapped I/O to write to the data file. That is a choice, certainly, but not one that I particularly liked. It was just that I was able to make things work and move on to other things.This is another tale of accidental complexity, actually. I had a problem and found a solution to it, which at the time I considered quite clever. Because I had a solution, I never tried to dig any deeper into it and figure out whether this is the only solution.This choice of using only memory-mapped I/O to write to the data file had consequences. In particular, it meant that:We had to map the data using read-write mode.There was no simple way to get an error if a write failed - since we just copied the data to memory, there was no actual write to fail. An error to write to disk would show up as a memory access violation (segmentation fault!) or just not show up at all.Writing to a page that isn’t in memory may require us to read it first (even if we are overwriting all of it).I accepted those limitations because I thought that this was the only way to do things. When I realized that I was wrong, that opened up so many possibilities. As far as the refactoring work, the way Voron did things changed significantly. We are now mapping the data file as read-only and writing to it using file I/O.That means we have a known point of failure if we fail to write. That probably deserves some explanation. Failure to write to the disk can come in a bunch of ways. In particular, successfully writing to a file is not enough to safely store data, you also need to sync the file before you can be assured that the data is safe. The key here is that write + sync ensures that you’ll know that this either succeeded or failed. Here is the old way we were writing to the data file. Conceptually, this looks like this:auto mem = EnsureFileSize(pagesToWrite[pagesToWriteLength - 1].EndPosition); for(auto i = 0; i < pagesToWriteLength; i++) { auto path = pagesToWrite[i]; memcpy(mem + page.Number * 8192, page.Buffer, page.Length); } // some later time if(FAILED(msync(mem)) return SYNC_FAILURE;And here is the first iteration of using the file I/O API for writes.fallocate_if_needed(pagesToWrite[pagesToWriteLength - 1].EndPosition); for(auto i = 0; i < pagesToWriteLength; i++) { auto path = pagesToWrite[i]; if(FAILED(pwrite(page.Number * 8192, page.Buffer, page.Length))) return WRITE_FAILURE; } // some time later if (FAILED(fdatasync(file)) return SYNC_FAILURE;Conceptually, this is just the same, but notice that we respond immediately to write failures here. When we started testing this feature, we realized something really interesting. The new version was much slower than the previous one, and it also generated a lot more disk writes.I moved mountains for this?Sometimes you get a deep sense of frustration when you look at benchmark results. The amount of work invested in this change is… pretty high. And from an architectural point of view, I’m loving it. The code is simpler, more robust, and allows us to cleanly do a lot more than we used to be able to.The code also should be much faster, but it wasn’t. And given that performance is a critical aspect of RavenDB, that may cause us to scrap the whole thing. Looking more deeply into the issue, it turned out that my statement about old code and the state of the system was spot on. Take a look at the two code snippets above and consider how they look from the point of view of the operating system. In the case of the memcpy() version, there is a strong likelihood that the kernel isn’t even involved (the pages are already paged in), and the only work done here is marking them as dirty (done by the CPU directly).That means that the OS will figure out that it has stuff to write to the disk either when we call msync() or when its own timer is called. On the other hand, when we call pwrite(), we involve the OS at every stage of the process, making it far more likely that it will start the actual write to the disk earlier. That means that we are wasting batching opportunities.In other words, because we used memory-mapped writes, we (accidentally, I might add) created a situation where we tried very hard to batch those writes in memory as much as possible. Another aspect here is that we are issuing a separate system call for each page. That means we are paying another high price. The good thing about this is that we now have a good way to deal with those issues. The pwrite() code above was simply the first version used to test things out. Since we now have the freedom to run, we can use whatever file I/O we want.In particular, RavenDB 7.1 now supports the notion of write modes, with the following possible options:mmap - exactly like previous versions, uses a writable memory map and memcpy() to write the values to the data file.file_io - uses pwrite() to write the data, onc page at a time, as shown above.vectored_file_io - uses pwritev() to write the data, merging adjacent writes to reduce the number of system calls we use (Posix only, since Windows has strict limits on this capability).io_ring - uses HIORING (Windows) / IO_Uring (Linux) to submit the whole set of work to the kernel as a single batch of operations.RavenDB will select the appropriate mode for the system on its own, usually selecting io_ring for modern Linux and Windows machines, and vectored_file_io for Mac. You can control that using the RAVEN_VORON_WRITER_MODE environment variable, but that is provided only because we want to have an escape hatch, not something that you are meant to configure.With those changes, we are on a much better footing for overall performance, but we aren’t done yet! I would love to give you the performance numbers, but we didn’t actually run the full test suite with just these changes. And that is because we aren’t done yet, I’ll cover that in the next post.

When QA Keeps Finding Bugs

by Ardalis

posted on: February 12, 2025

A developer manager recently reached out with a concern: their new QA team member was finding too many bugs, leading to frustration among…Keep Reading →