skip to content
Relatively General .NET

Optimizing Corax: Optimizing tree operations

by Oren Eini

posted on: August 30, 2022

I love trees. Not the ones that produce oxygen, I mean, I guess they are nice too, but I’m talking about trees in software. To be rather more exact, I’m fascinated by B+Trees and all their permutations. In a very real sense, most databases are all about B+Trees. To give you some context, Voron alone, at this time, has the following: Tree – the core used to map an arbitrary byte string key to an arbitrary sized value. This supports many operations, including compressed values, overflow pages, etc. Typically used to implement storage indexes over non-numeric data. Fixed Size Tree – a more compact implementation when more data is known, with int64 key and a fixed size value (which may be zero). Typically used to implement storage indexes over numeric data. Multi Tree – a tree whose keys are arbitrary byte string keys and whose value is itself a tree (with no value) that can contain multiple values for a single entry. Used to implement non-unique storage indexes. Compact Tree – a tree for arbitrary byte string keys to an int64 value. The keys are compressed and we can pack a lot of values in a single page. Used to handle the field terms in Corax. Set – a tree used to efficiently store int64 values in an ordered manner without duplicates. The data is heavily compressed and is used to handle large posting lists in Corax. And I’m probably forgetting a few. Those are all various permutations on a B+Tree. And together they create a really interesting set of capabilities for Voron. Today I want to talk to you about an issue we ran into when building Corax, in particular, this one: Before we dive into exactly what is going on here, I feel that I need to give some background. Corax is an indexing engine, and at its core is an inverted index. When we need to delete an entry, we need to remove references to it from all the terms that it is a part of. The code producing the above is implemented roughly like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters def process_deletes(self): for idToDel in self.deletes: for term in self.getEntry(idToDel): termTree = self.getTermTree(term) termTree.delete(idToDel) self.deleteEntry(idToDel) view raw delete.py hosted with ❤ by GitHub That particular piece accounts for almost 80% of the indexing commit time (and a significant fraction of the total indexing time). That is… just the way it is, I’m afraid. We have to process things in this manner. Lucene does things differently, it will mark the entry as deleted (and then ignore it afterward) eventually merging the results away. I really didn’t want to go in that route. One of the key design principles for Corax is that we want to avoid deferred payments, we want to pay everything in “cash” right now. If we look deeper into the DeleteCommit() function, we can see some interesting details: The DeleteIdFromExactTerm() is the piece that is doing the majority of the work, accounting for over 60% of the runtime for processing deletes. Let’s dig deeper here: And we can see that this is actually really fast. The total cost for invoking this method is 5.5 μs. The problem is that we invoke it 4 million times. And looking into the actual costs, what is the issue? It is the TryGetValue() and FindPageFor() calls, those scan through the tree structure to find the relevant place from which to remove the value. As a reminder, here is a tree structure, (it isn’t a B+Tree, mind, because we don’t need that to understand what is going on). What we have is a lot of calls to Remove() on this tree, and we always need to start searching from the root. It turns out that this is quite expensive: Consider what happens if I want to remove 15 and 20. I have to compare 25, 14, 19, 16 and15. Then I have to go back to the root and compare 25, 14, 19, 22 and 20. There are a lot of commonalities here that I can take advantage of. If I can operate locally, I can probably do better, no? We rewrote the delete code to be more like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters def record_deletes(self): delsByTerm = {} for idToDel in self.deletes: for term in self.getEntry(idToDel): if term not in delsByTerm: delsByTerm[term] = [] delsByTerm[term].append(idToDel) self.dleteEntry(idToDel) for term, idsToDel in delsByTerm.items(): idsToDel.sort() termTree = self.getTermTree(term) termTree.deleteBatch(idsToDel) view raw delete_batch.py hosted with ❤ by GitHub The idea is that we do the deletion of terms in two stages. First, we gather all the ids we need to delete for all the terms from all the entries that are being deleted. Then we sort those values, and then we invoke a batch delete method. Unlike the individual RemoveValue() calls, we can now take advantage of the structure of the tree. In the case of wanting to remove [15,20], we can scan the tree (25,14, 19, 16, 15)  to get to the first item that we remove. Then we proceed using the tree’s own structure. So deleting 20 means comparing (16, 19, 22, 20). In this case, we saved one operation, which isn’t that meaningful. But B+Tree’s most beautiful property is that they are dense. In this case, we are removing values from posting lists, which may contain millions of entries, and it isn’t uncommon to be able to pack thousands of entries per page. That means that the savings are huge. This particular optimization was sufficient to ensure that we just don’t need to care about deletes any longer. The ability to apply the operation in a batch fashion meant that we could amortize all of the setup costs, which lead to a far greater reaction than I actually expected, to be honest. The code in question can make a lot of assumptions about what is going on (the input is sorted, values are present in the tree, etc). That makes things so much easier. As a result, Corax can keep paying all costs upfront and still be so much faster than Lucene (who defers things to the background). To give some context, here are the relevant costs for Lucene: Lucene spends more time processing deletes than Corax does for the whole commit process . Note that this is just a small part of the costs that deletes have for Lucene, there is an excellent blog post describing those from Elastic. The costs range between 20% – 46% slowdown in queries! That is an important reason to avoid deferring work, I think we can agree.

RavenDB 5.4 is out

by Oren Eini

posted on: August 29, 2022

That actually happened over a month ago, but I’m afraid that July / August hasn’t been kind of either my attention span or the ability to actually keep track of what is happening. September is around the corner (at which point the summer vacation would be over and I’m already looking forward to some blessed routine). The 5.4 release is a Long Term Support release, meaning that we encourage users to move their applications to use that version. Note: support for RavenDB 5.3 will terminate on Jan 23 and the 5.2 release (also an LTS) will move to legacy support model a year after the 5.4 release. This release brings to the table, aside from the usual minor fixes and improvements, a few major new features. The primary one is a favorite of mine, internally named etl2q. In other words, it is a native ETL support in RavenDB to push messages to Kafka and RabbitMQ. I did a whole webinar on the topic, you can watch it here. We also added support for pulling data directly from RavenDB into Grafana, so you can create great dashboards and visualizations on your data, like so: You can deploy RavenDB 5.4 instances in the cloud or on premise, and we have some additional goodies hidden under the engine that we are still working on.

Performance: string.Create vs FormattableString

by Gérald Barré

posted on: August 29, 2022

Interpolated strings are very common in C#. For instance, you can write $"Hello {name}! You are {age} years old.". This expression is evaluated using the current culture. If you want to use an invariant culture, you can use FormattableString.Invariant($"..."). Starting with .NET 6 and C# 10, you ca

Webinar Recording

by Oren Eini

posted on: August 26, 2022

Modeling relationships and hierarchies is a fundamental topic taught in university – for relational databases. Document databases are quite different, and our approach to such modeling should be different as well. So how do we do it?

If No One Says No, Your Rate is Too Low

by Ardalis

posted on: August 24, 2022

Pricing is hard, and pricing yourself and your services can be one of the toughest things to get right. While I can't claim to have the…Keep Reading →

Regex with IgnoreCase option may match more characters than expected

by Gérald Barré

posted on: August 22, 2022

This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!String comparisons are harder than it seemsHow to correctly count the number of characters of a stringCorrectly converting a character to lower/upper caseHow not to read a string from an

Managing the most dangerous constructor ever

by Oren Eini

posted on: August 17, 2022

The design of the X509Certificate2 is badly broken in terms of safety. If you load a certificate from the disk or a byte buffer, it will go ahead and create a file on the disk behind the scene. If you’ll dispose the instance, the file will be removed. However, if you don’t explicitly dispose the instance, that is too bad. The file remains. A ticking time bomb, because eventually you’ll have a lot of such files on the disk. Which is then a fun state to try to recover from. I’m not sure why this design decision was made. I assume that at the time, people didn’t need to work so much with certificates, and a lot of the issues are likely with dealing with the underlying crypto API. Regardless, it is mandatory to dispose the certificate after you use it. And that leads to a problem. Consider the following code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public class TlsRequestMiddleWare { private X509Certificate2 _cert; public async Task<Stream> Next(Stream source) { var ssl = new SslStream(source); await ssl.AuthenticateAsServerAsync(_cert); return ssl; } public void UpdateCert(X509Certificate2 newCert) { _cert = newCert; } } view raw tls.middleware.cs hosted with ❤ by GitHub The idea is that we want to be able to switch certificates on the fly (since we need to update them before they expire, without interrupting the server). Old connections can still use the old certificate, while new ones will use the updated one. Practically speaking, the certificate itself shouldn’t be used after the call to AuthenticateAsServerAsync(), but I don’t believe that we have any such promises. Regardless, as the async designation indicates, that can take a while. How would I know to dispose the old certificate? I have to consider multi threading here as well, if I dispose the certificate while it is being used to authenticate a request, that request will likely fail. Given that I’m racing a native API and disposing its resources while it is under use, I may open some severe issues. Ideally, the X509Certificate2 should manage that for me. If it would have implemented a finalizer, it would dispose itself when the GC made sure that no one was looking at it. That is what I want to happen, but in this case, we have no such support. Luckily we got options. Behold the following code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public class CertificateCleaner : CriticalFinalizerObject { private X509Certificate2 _certificate; private static ConditionalWeakTable<X509Certificate2, CertificateCleaner> _associateLifetimes = new(); public static void RegisterForDisposalDuringFinalization(X509Certificate2 cert) { var cleaner = _associateLifetimes.GetOrCreateValue(cert); cleaner!._certificate = cert; } ~CertificateCleaner() => _certificate?.Reset(); } view raw CertificateCleaner.cs hosted with ❤ by GitHub What does this do? It uses several tricks to get what we want, attaching an external finalizer to an object that we don’t control. First, ConditionalWeakTable will ensure that as long as there is a reference to the certificate, the cleaner will be referenced as well. When there is no reference for the certificate, we’ll need to run the finalizer for the cleaner. Next, we have the usage of CriticalFinalizerObject, this is done to ensure that the finalizer will be called even when the process terminates. This is the same manner .NET flushes file handles, so we can be sure that we are doing the utmost to ensure that we’ll properly dispose of the files. Finally, there is the dance with the GetValueOrDefault() call in RegisterForDisposalDuringFinalization(). We need to consider what would happen if we’ll get concurrent requests to register the certificate. If we’ll let it race, one of the cleaners will be discarded, and then the finalizer will be called on that, causing havoc. In this manner, we let ConditionalWeakTable ensure that there is just one instance, and set the value afterward. Since the value is unique per instance, we can set it multiple times (it will always be set to the same value). End result, it takes less than 10 lines of code to fix this (and of course, remember to call register whenever you create a certificate instance). But I would really like that to just be the default behavior. Otherwise, that is a very risky trap.

re

by Oren Eini

posted on: August 16, 2022

I read this blog post from Discord, which presents a really interesting approach to a problem that they ran into. Basically, in the cloud you have the choice between fast and ephemeral disks and persistent (and much slower) disks. Technically speaking you can try to get really fast persistent disks, but those are freaking expensive. Ultra disks on Azure and io2 disks on AWS and Extreme persistent disks for GCP. The cost for a single 1TB disk with 25K IOPS would be around 1,800 USD / month, for example. That is just for the disk! To give some context, an r6gd.16xlarge instance with 64 cores and 512 GB (!) of RAM as well as 3.8 TB of NVMe drive will cost you less than that. That machine can also do 80K IOPS, for context. At Discord scale, they ran into the limitations of I/O in the cloud with their database and had to come up with a solution for that. My approach would be to… not do anything, to be honest. They are using a database (ScyllaDB) that is meant to run on commodity hardware. That means that losing a node is an expected scenario. The rest of the nodes in the cluster will be able to pitch in and recover the data when the node comes back or is replaced. That looks like the perfect scenario for running with fast ephemeral disks, no? There are two problems with ephemeral disks. First, they are ephemeral , wich means that a hardware problem can make the whole disk go poof. Not an issue, that is why you are using a replicated database, no? We’ll get to that. Another issue that is raised by Discord is that they rely on disk snapshots for backups and other important workflows. The nice thing about snapshotting a cloud disk is that it is typically fast and cheap to do. Otherwise you need to deal with higher level backup systems at the database layer. The issue is that you probably do want that, because restoring a system from a set of disk snapshots that were taken at different times and at various states is likely to be… challenging. Discord's solution to this is to create a set of ephemeral disks that are mirrored into persistent disks. Reads are going to the fast disks and writes are spread around. A failure on the ephemeral disks will lead to a recovery of the data from the network disk. Their post has the gory details and a lot more explanation aside. I wanted to write this post for a simple reason. As a database guy, this system scares me to pieces. The issue is that I don’t know what kind of data consistency we can expect in such an environment. Do I get any guarantees about the equivalence of data between the fast disks and the network disk? Can they diverge from one another? What happens when you have partial errors? Consider a transaction that modifies three different pages that end up going to three different fast disks + the network disk. If one of the fast disks has an error during write, what is written to the persistent disk? Can I get an out-of-date read from this system if we read from the network disk for some reason? That may mean that two pages that were written in one transaction are coming back as different versions. That will likely violate assumptions and invariants and can lead to all sorts of… interesting problems. Given that this system is meant to handle the failure modes, it sounds really scary because it is an additional layer of complexity (and one that the database is unaware of) to deal with. Aside from the snapshots, I assume that the reason for this behavior is to avoid the cost of rebuilding a node when there is a failure. I don’t have enough information to say what is the failure rate and the impact on the overall system, but the solution provided is elegant, beautiful, and quite frankly, pretty scary to me. There have been quite a few unknowns that we had to deal with in the realm of storage. But this adds a whole new layer of things that can break.

Importing the Stack Overflow dataset into RavenDB

by Oren Eini

posted on: August 15, 2022

Around 2017 we needed to test RavenDB with realistic datasets. That was the time that we were working hard on the 4.0 release, and we wanted to have some common dataset that was production quality (for all the benefits and complications that this brings) to play with. A serious issue was that we needed that dataset to also be public, because we wanted to discuss its details. The default dataset people usually talk about in such a scenario is the Enron emails, but that is around half a million documents and quite small, all things considered. Luckily for us, Stack Overflow has made their dataset publicly available in a machine readable format. That means that we could take that, adapt that to RavenDB and use that to test various aspects of our behaviors with realistic data. The data is distributed as a set of XML files, so I quickly wrote something that would convert the data to a JSON format and adapt the model to a more relational one. The end result was a dataset with 18 million documents and with a hefty size of 52 GB. I remember that at the time, working with this data was a lengthy process. Importing the data took a long time and indexing even longer. A few years later, this is still our go-to dataset for anything involving non-trivial amount of data, but we have gotten to the point where the full process of working with it has shrunk significantly. It used to take 45+ minutes to import the data, now it takes less than 10, for example. Basically, we made RavenDB good enough that it wasn’t that much of a challenge. Of course… Stack Overflow continues to publish their dataset… so I decided it was time to update their data again. I no longer have the code that I used to do the initial import, but the entire process was fairly simple. You can look at the code that is used to do the import here. This is meant to be quick & dirty code, mind you. It is about 500 lines of code and handles a surprisingly large number of edge cases. You can find the actual data dump here. And the explanation about the schema is here. There is also a database diagram here. In case you missed the point, the idea is that I want to remember how I did it for the next time I'll want to refresh our dataset. So far, I imported a bunch of Stack Exchange communities: World Building – Just over 100K documents and 1 GB in size. Small enough to play with seamlessly. Super User – 1.85 million documents and weighing 4 GB in size. I think we’ll use that as the default database for showing things off on the Raspberry Pi edition. Stack Overflow – 40.5 million documents and exceeding 150 GB in size. This is a great scenario for working with a significant amount of data. That is likely to be our new default benchmarking database. The other advantage is that everyone is familiar with Stack Overflow. It makes for a great demo when we can pull up realistic data on the fly. It already gave me some interesting details to explore. For example, enabling documents compression mode for the Super User community reduced the disk utilization to under 2 GB. That is a great space-saving, and it means that we can easily fit the entire database on a small SD card and have a “RavenDB Server + Database in a box” as a Raspberry Pi. The Stackoverflow dataset is 150GB without compression, with documents compression, it dropped to just 57GB, which is all kinds of amazing. They make for great demos .