My article about Data Management in Complex Systems was published in DZone as part of DZone's 2022 Database Systems Trend Report.
I would love your feedback on it.
When an application is not running well, it could be useful to generate a dump file to debug it. There are many ways to generate a dump file on Windows, Linux, or Azure. Table Of ContentsWindowsdotnet-dump (Windows)Windows Task ManagerSysInternals - Process ExplorerSysInternals - ProcDump (Windows)
Someone asked me why I set AllowUnsafeBlocks to true in the Meziantou.DotNet.CodingStandard package (source). In this post, I'll explain what this property enables, and why it is not more unsafe to set it to true.#The AllowUnsafeBlocks propertyThe AllowUnsafeBlocks compiler option allows code that
RavenDB has a really nice feature, it allows you to index data from related documents. Consider the following document structure:
We have tickets, vehicles, and users, and we want to issue a search on all the tickets issued to Joe. Leaving aside whether this is the proper way to handle this, here is what the index would look like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
map("Tickets", t =>
{
var vehicle = load(t.car, "Vehicles");
var user = load(vehicle.owner, "Users");
return { t.amount, t.reason, user.name };
});
view raw
ticketsByUsername.js
hosted with ❤ by GitHub
What we are doing here is walk the reference graph and index data from related documents. So far, so good. The cool thing about this feature is that RavenDB is in charge of ensuring that if we update the owner of the vehicle or the name of the user, the Right Thing will happen.
Of course, I wouldn’t be writing this blog post if we didn’t run into a problem in this scenario.
The way it works, for each collection referenced by the index, RavenDB maintains a list of the last document that was chceked for changes in the collection. That way, on modification of a related document, we can tell that we need to re-index a particular document.
This looks something like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
{
"Referenced": {
"vehicles/200": ["tickets/100"],
"users/300": ["tickets/100"]
},
"Tickets": 123,
"Vehicles": 456
}
view raw
metadata.index.json
hosted with ❤ by GitHub
In other words, for each document that was loaded by another during indexing, we keep a list of the referencing documents.
Let’s say that we update document vehicles/200. That would be written to the storage with a new etag, and the index would wake up. It would ask to get all the documents in the Vehicles collection after etag 456, get vehicles/200 and then check the ReferencedBy and find that the document tickets/100 loaded it. At this point, it will re-index tickets/100 to ensure we have the latest values.
There is quite a bit more to this process, of course, I’m skipping on a lot of optimizations and detail work. For the purpose of this post, we don’t need any of that.
A customer reported that (very rarely), an index similar to the one above would “miss” on updates. That should not be possible. As much as I love this feature, conceptually, it is a very simple one, there isn’t much here that can fail. And yet, it did. Figuring out what was happening required us to look very deeply into the exact series of steps that were taken to produce this output. It turns out that our approach had a hole in it.
We assume that the writes would always happen in an orderly fashion. In other words, that the writes would be consistent. But there is no actual requirement for that.
Consider what happens if I write just the ticket document to the database:
RavenDB will index the ticket document
It will attempt to load the associated vehicle, figure out that there is no such document and move on
The related user document, of course, is not known at this point (since there is no vehicle document)
The end result is that we have the following data internally:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
{
"Referenced": {
"vehicles/200": ["tickets/100"],
"vehicles/19": ["tickets/20"],
"users/99": ["tickets/20"]
},
"Tickets": 112,
"Vehicles": 442
}
view raw
metadata.index-1.js
hosted with ❤ by GitHub
That is fine, when we’ll add the vehicle and the user, we’ll do the appropriate wiring, no?
In almost all cases, that is exactly what will happen. However, consider the metadata above. We are concerned here with tickets/100, but there is also tickets/20, whose references exist properly. So the structure we have right now in terms of reference tracking is:
It’s important to note that the references are always kept from the initial 'tickets' document. So even though the path from tickets/20 to users/99 goes through vehicles/19, the relationship is a direct association.
What will happen if we’ll insert just the users/300 document now? Well, there is no reference to this document, so we’ve no reason to do anything with it. But that isn’t a problem. When vehicles/200 is inserted, this will be fixed.
On the other hand, if we add just vehicles/200 to the database (with users/300 not being present), that is a change in a tracked document, which will cause us to index the referencing document (tickets/100) again and move us to this state:
When we will then add users/300, document tickets/100 will have the record of this reference and we’ll re-index it.
In other words, we are covered on both sides. Except, that there is still this pesky (and impossible) problem that the user is seeing.
Now, consider the following state of affairs, we are back in the initial state, both vehicles/200 and users/300 are missing in the database and tickets/20, vehicles/19 and users/99 are there.
We add vehicles/200 to the database, and there is a re-indexing process going on. At the same time that we re-index tickets/100 because of the new vehicles/200 document, we are adding the users/300 document in a separate transaction.
That means that during the indexing of tickers/100, we’ll see document vehicles/200 but not the users/300 document (even though it exists).
That is still not a problem, we’ll write the referencing record and on the next batch, detect that we have a user that we haven’t seen and re-index the document again.
Except… what if we didn’t update just the users/300 document in this case, what if we also updated users/99 at the same transaction (and after we insert document users/300).
Depending on the exact timings, we may end up missing document users/300 (because there was no reference to it at the time) but will notice that document users/99 was updated (we already had it referenced). Since users/99 was modified after users/300, we’ll record that we observed all the changes in the Users collection before users/99. That, crucially, also includes the users/300 that we never noticed.
This is confusing, I’ll freely admit. In order to reproduce this bug you need a non-standard pattern for creating references, a chain of at least two references, multiple independent references with different states, and an unlucky draw from Murphy with the exact timing of transactions, indexing and order of operations.
The root cause was that we recorded the newly added document reference in memory, and only updated them when the entire indexing batch was completed. During that time, there may have been multiple transactions that modified the documents. But because we didn’t sync the state until the end of the batch, we would end up missing this case. Solving the problem once we knew what was going on involved moving a single line of code from the outer loop to an inner one, basically.
Writing a reproducible test case was actually far harder, since so many things had to fall just so this would happen. I have to admit that I don’t have any strong conclusions about this bug. It isn’t something systematic or an issue that we missed. It is a sequence of unfortunate events with a very low probability of occurring that we never actually considered.
The really good thing about this issue is that it is the first one in this particular area of the code in quite some time. That means that this has been quite stable for many scenarios.
I mentioned earlier that B+Trees are a gnarly beast to implement properly. On the face of it, this is a really strange statement, because they are a pretty simple data structure. What is so complex about the implementation? You have a fixed size page, you add to it until it is full, then you split the page, and you are done. What’s the hassle?
Here is a simple scenario for page splits, the following page is completely full. We cannot fit another entry there:
Now, if we try to add another item to the tree, we’ll need to split the page, and the result will be something like this (we add an entry with a key: users/050):
How did we split the page? The code for that is really simple:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
def splitPage(source, dest):
# source has >= 2 entries
# dest is empty
mid = len(source.entries) / 2
# iterate over the later half of entries and
# move them in order to the new page
for pos in range(mid, len(source.entries)):
dest.put( source.get(pos) )
source.trimEntriesTo(mid)
view raw
split.py
hosted with ❤ by GitHub
As you can see, since the data is sorted, we can simply take the last half of the entries from the source, copy them to the new page and call it a day. This is simple, effective, and will usually work just fine. The key word here is usually.
Given a B+Tree that uses variable size keys, with a page size of 4KB and a maximum size of 1 KB for the keys. On the face of it, this looks like a pretty good setup. If we split the page, we can be sure that we’ll have enough space to accommodate any valid key, right? Well, just as long as the data distribution makes sense. It often does not. Let’s talk about a concrete scenario, shall we? We store in the B+Tree a list of public keys.
This looks like the image below, where we have a single page with 16 entries and 3,938 bytes in use, and 158 bytes that are free. Take a look at the data for a moment, and you’ll notice some interesting patterns.
The data is divided into two distinct types, EdDSA keys and RSA keys. Because they are prefixed with their type, all the EdDSA keys are first on the page, and the RSA keys are last. There is a big size difference between the two types of keys. And that turns out to be a real problem for us.
Consider what will happen when we want to insert a new key to this page. We still have room to a few more EdDSA keys, so that isn’t really that interesting, but what happens when we want to insert a new RSA key? There is not enough room here, so we split the page. Using the algorithm above, we get the following tree structure post split:
Remember, we need to add an RSA key, so we are now going to go to the bottom right page and try to add the value. But there is not enough room to add a bit more than 512 bytes to the page, is there?
What happens next depends on the exact implementation. It is possible that you’ll get an error, or another split, or the tree will attempt to proceed and do something completely bizarre.
The key here (pun intended) is that even though the situation looks relatively simple, a perfectly reasonable choice can hide a pretty subtle bug for a very long time. It is only when you hit the exact problematic circumstances that you’ll run into problems.
This has been a fairly simple problem, but there are many such edge cases that may be hiding in the weeds of B+Tree implementations. that is one of the reasons that working with production data is such a big issue. Real world data is messy, it has unpredictable patterns and stuff that you’ll likely never think of. It is also the best way I have found to smoke out those details.
We have a lot of tests for RavenDB, and we are running them on plenty of environments. We semi frequently get a build failure when running on the “macOS latest” runner on GitHub.
The problem is that the information that I have is self-contradicting. Here is the most relevant piece:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
byte* address = allocation.Address;
byte* current = _ptrCurrent;
Debug.Assert(address != current,
$"address != current ({new IntPtr(address)} != {new IntPtr(current)} [{nameof(_ptrCurrent)} = {new IntPtr(_ptrCurrent)}])");
// On MacOS, this sometimes fail with:
// Method Debug.Fail failed with 'address != current (140409565380608 != 140409565380608 [_ptrCurrent = 140409565380608])
view raw
wierd.cs
hosted with ❤ by GitHub
Here you can see the failure itself and what is causing it.
Note that the debug message is showing that all three variables here have the same numeric value. The address and the current variables are also held on the stack, so there is no option for race conditions, or something like that.
I can’t figure out any reason why this would be triggered, in this case. About the only thing that pops to mind is whether there is some weirdness going on with pointer comparisons on MacOS, but I don’t have a lead to follow.
We haven’t investigated it properly yet, I thought to throw this to the blog and see if you have any idea what may be going on here.
I love B+Trees, but they can be gnarly beasts, with the number of edge cases that you can run into. Today’s story is about a known difficult place, page splitting in the tree. Consider the following B+Tree, showing a three-level tree with 3 elements on each page.
Consider what will happen when we want to insert a new value to the tree, the value: 27. Given the current state of the tree, that should go on the page marked in red:
But there is no place for the new value on this page, so we have to split it. The tree will then look like so, we split the page and now we need to add the new page to the parent, but that one also doesn’t have room for it:
So we are now in a multi-level split process. Let’s see what this looks like when we go up the tree. This is the final state of the tree when we are done doing all the splits:
The reason for all of this is that we need to add 27 to the tree, and we haven’t done that yet. At this stage, we got the tree back in order and we can safely add the new value to the tree, since we made sure we have enough space.
However, note that the exact same process would apply if we were adding 27 or 29. The page that we’ll add them to, however, is different.
This can be quite complex to keep track of, because of the recursive nature of the process. In code, this looks something like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
def add(self, key, value):
page = self.findPageFor(key)
self.addToPage(key, value)
def addToPage(self, key, value):
if self.cursor[self.pos].add(key, value):
return
# the current page retains the bottom half of entries
# the newPage gets the top half of entries
if self.pos == 0: # root page
self.insertNewRootPage() # afterward, pos == 1
newPage = self.cursor[self.pos].split()
self.popPage() # now adding reference to child page
self.addToPage(newPage.entries[0], newPage.pageNum)
pos = self.searchPage(key) # find the relevant page for the entry
self.pushPage(self.cursor[self.pos].entries[pos]) # go into the child page
self.addToPage(key, value) # add in the right location
view raw
add.py
hosted with ❤ by GitHub
I am skipping on some details, but that is the gist of it. So we do the split (recursively if needed) and then after we wired the parent page properly, we find the right location for the new value.
An important aspect here is the cursor. We use that to mark our current location in the tree, so the cursor will always contain all the parent pages that we are currently searching upon. A lot of the work that we are doing in the tree is related to the cursor.
Now, look at the code and consider the behavior of this code when we insert the value 29. It will correctly generate this page:
However.. what happens if we’ll insert 27?
Well, when we split the page, we went up the tree. And then we had another split, and then we went down another branch. So as written, the result would be adding the 27 to the same page as we would the 29. This would look like this:
Look at the red markers. We put entry 27 on the wrong page.
Fixing this issue is actually pretty hard, because we need to keep track of the values as we go up and down the tree. For fun, imagine what happens in this exact scenario, but when you have 6 levels in the tree and you end up in a completely different location in the tree.
I spent a lot of time struggling with this issue, including getting help from some pretty talented people, and the final conclusion we got was “it’s complicated”.
I don’t want complications here, I need it to be as simple as possible, otherwise, we can’t make any sort of sense here. I kept spinning more and more complex systems to resolve this, when I realized that I just looked at the problem in the wrong manner all along.
The issue was that I was trying to add the new value to the tree after I sorted out the structure of the tree, but there was actually nothing that forced me to do that. Given that I already split the page at this stage, I know that I have sufficient space to add the key without doing anything else. I can first add the key to the right page, then write the split page back to the tree. In this case, I don’t need to do any sort of backtracking or state management .
Here is what this looks like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
def add(self, key, value):
page = self.findPageFor(key)
self.addToPage(key, value)
def addToPage(self, key, value):
if self.cursor[self.pos].add(key, value):
return
# the current page retains the bottom half of entries
# the newPage gets the top half of entries
if self.pos == 0: # root page
self.insertNewRootPage() # afterward, pos == 1
newPage = self.cursor[self.pos].split()
if newPage.entries[0] >= key: # figure out what page this goes into
newPage.add(key, value)
else:
page.add(key, value)
self.popPage() # now adding reference to child page
self.addToPage(newPage.entries[0], newPage.pageNum)
view raw
add2.py
hosted with ❤ by GitHub
And with this change, the entire class of problems related to the tree structure just went away.
I’m very happy with this result, even if it is a bit ironic. Like the problem at hand, a lot of the complexity was there because I had to backtrack the implementation decisions and go on a new path to solve this.
Also, I just checked, the portion that controls page splits inside Voron has had roughly 1 change a year for the past 5 years. Given our scope and usage, that means that it has been incredibly stable in the face of everything that we could throw at it.
Double-writes are a common problem when building an application. This occurs when the same file is written at least twice during a build. This could be because 2 different projects are outputting the same files, so you are doing unneeded work. Thus, the build is slower than it needs to be. Also, it
Earlier this year I talked about the reasoning behind the RavenDB team spending so much time building self monitoring diagnostics. Today I’m happy to announce that RavenDB has another such feature, allowing you to see, in real time, exactly what is going on in your database in terms of I/O.
You can see what this looks like in this image, RavenDB is showing you key metrics in terms of I/O utilization.
You can get the same metrics from your disk directly, and you have similar dashboards on all cloud infrastructure. For that matter, RavenDB Cloud also gives you access to this information. So why build that directly into RavenDB?
The answer is simple, putting that directly into RavenDB means that it is accessible. We don’t assume that a user has access / ability to see such data. For example, you may have access to the RavenDB instance, but not to the cloud account that is running it. Or you may not have sufficient permissions to view metrics data.
In many cases, even if they have access, they don’t know (or it wouldn’t occur to them to look at that). By reducing the amount of hassle you have to go through, we can make those metrics more accessible.
Remember, you probably aren’t looking for those numbers for fun. You need to troubleshoot some issue, and being able to directly see what is going on is key to quickly resolving a problem. For example, if you are seeing the disk queue length spiking a lot, you know that you are spending all of your I/O budget.
Just knowing that will let you direct your investigation in the right direction a lot sooner.