There is a great article discussing how SQLite is handling transactions at fly.io. Which led me to the great documentation on the WAL mode for SQLite. And that led me to think about the differences between how SQLite does it and how Voron does it.
Both SQLite and Voron share asame behavior, they use Copy on Write and make the modifications for the pages in the database on a copy of the data. That means that readers can continue to operate with zero overhead, reading the stable snapshot of their data.
However, SQLite works by copying the data to the WAL file directly and modifying it there. Voron doesn’t use this approach. Instead, we have the notion of scratch space where this is done. Look at the figure below, which showcase the difference between the databases:
In SQLite, any modifications are written to the WAL file and operated on there. When you want to commit a transaction in SQLite, you’ll compute the checksum of all the pages modified in the transaction and write a commit record to the disk, at which point you’ll need to issue an fsync() call.
Voron, on the other hand, will copy the data that is modified in the transaction into scratch space (essentially, just some memory we allocated). On commit, it will not write the data to the WAL. Instead, it will take the following actions:
Compute a diff of the current state of the page compared to its initial state, writing only the modifications.
Compress the resulting output.
Compute a checksum of all the pages that were modified.
Write the compressed output and the checksum as a single write call.
Voron opens the file with O_DIRECT | O_DSYNC, the idea is that we don’t need to call fsync() at any stage, and we significantly reduce the number of system calls we have to make to commit a transaction.
Other transactions, at the same time, will use the data in the scratch space, not the WAL, to read the current state of pages. Voron also supports MVCC, so you may have multiple copies of the data in memory at once (for different transactions).
Voron is able to significantly reduce the total amount of I/O we have to use for writes, because we only write the changes in the data between page versions and even that is compressed. We typically can safely trade off the additional CPU work in favor of I/O costs and still come up ahead.
Another reason we went with this route is that we use memory mapped files, and on Windows, those aren’t coherent with file I/O calls. That means that mixing reading via mmap() and writing via file I/O (which is what we want to do to avoid fsync() calls) wouldn’t really work. Voron also benefits from not having to deal with multiple processes running at the same time, since it is typically deployed from within a single process.
Finally, the fact that we use scratch spaces separately from the WAL means that we put that somewhere else. You can have a fast empheral disk (common on the cloud) for scratch files, very fast (but small) disk for the WAL journal and standard disk for the actual data. This sort of configuration gives you more choices on how to deal with the physical layout of your data.
The official RavenDB Client for PHP is now out in beta. You can now make use of a rich client to consume RavenDB with all the usual features you would expect.
To start using RavenDB, run:
$ composer require ravendb/ravendb-php-client
And then you can start using RavenDB in your project. Here are some interesting code samples.
Setting up a document store:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
use RavenDB\Documents\DocumentStore;
$store = new DocumentStore(["http://live-test.ravendb.net" ], "Northwind")
$store->initialize();
view raw
init.php
hosted with ❤ by GitHub
Loading a document:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
$session = $store->openSession();
$company = $session->load(Company::class, $companyId);
echo $company->getName() . "\n";
echo "Located at: " . $company->getAddress()->getCountry() . "\n";
view raw
load.php
hosted with ❤ by GitHub
Querying:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
$session = $store->openSession();
$orders = $session->query(Order::class)
->whereEquals("customer", $customerId)
->toList();
echo "Orders for " . $customerId . "\n";
foreach($orders as $order) {
echo $order->getId() . " - " . $order->getTotal() . "$\n";
}
view raw
query.php
hosted with ❤ by GitHub
Pretty much all other capabilities are also available (unit of work, change tracking, automatic failover, and more).
Please give it a whirl, we’ll love to hear about your experience with RavenDB & PHP.
Before executing sensitive actions, such as purchasing something or showing sensitive data, you need to get user consent. Also, you need to ensure the actual user is the one who is executing the action. Windows provides an API to get user consent before executing sensitive actions.First, you need t
If you are using RavenDB for defense projects, we have got good news for you. RavenDB is now available on Iron Bank, making it that much easier to make use of RavenDB in defense or high security projects.
Iron Bank is the DoD repository of digitally signed, binary container images including both Free and Open-Source software (FOSS) and Commercial off-the-shelf (COTS). All artifacts are hardened according to the Container Hardening Guide. Containers accredited in Iron Bank have DoD-wide reciprocity across classifications.
RavenDB has a history of focusing on (usable) security and making sure that your systems are secured by default and in practice. Now it is even easier to make use of RavenDB in projects that are required to meet the DoD standards. Note that you get the exact same codebase and configuration that you’ll get from the usual RavenDB distribution, it has simply been audited and approved by Iron Bank.
A popular extension and later core feature of VS Code, rainbow bracket colorization is now available as a free extension for Visual Studio…Keep Reading →
I love trees. Not the ones that produce oxygen, I mean, I guess they are nice too, but I’m talking about trees in software. To be rather more exact, I’m fascinated by B+Trees and all their permutations. In a very real sense, most databases are all about B+Trees. To give you some context, Voron alone, at this time, has the following:
Tree – the core used to map an arbitrary byte string key to an arbitrary sized value. This supports many operations, including compressed values, overflow pages, etc. Typically used to implement storage indexes over non-numeric data.
Fixed Size Tree – a more compact implementation when more data is known, with int64 key and a fixed size value (which may be zero). Typically used to implement storage indexes over numeric data.
Multi Tree – a tree whose keys are arbitrary byte string keys and whose value is itself a tree (with no value) that can contain multiple values for a single entry. Used to implement non-unique storage indexes.
Compact Tree – a tree for arbitrary byte string keys to an int64 value. The keys are compressed and we can pack a lot of values in a single page. Used to handle the field terms in Corax.
Set – a tree used to efficiently store int64 values in an ordered manner without duplicates. The data is heavily compressed and is used to handle large posting lists in Corax.
And I’m probably forgetting a few. Those are all various permutations on a B+Tree. And together they create a really interesting set of capabilities for Voron. Today I want to talk to you about an issue we ran into when building Corax, in particular, this one:
Before we dive into exactly what is going on here, I feel that I need to give some background. Corax is an indexing engine, and at its core is an inverted index. When we need to delete an entry, we need to remove references to it from all the terms that it is a part of. The code producing the above is implemented roughly like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
def process_deletes(self):
for idToDel in self.deletes:
for term in self.getEntry(idToDel):
termTree = self.getTermTree(term)
termTree.delete(idToDel)
self.deleteEntry(idToDel)
view raw
delete.py
hosted with ❤ by GitHub
That particular piece accounts for almost 80% of the indexing commit time (and a significant fraction of the total indexing time). That is… just the way it is, I’m afraid. We have to process things in this manner. Lucene does things differently, it will mark the entry as deleted (and then ignore it afterward) eventually merging the results away. I really didn’t want to go in that route. One of the key design principles for Corax is that we want to avoid deferred payments, we want to pay everything in “cash” right now.
If we look deeper into the DeleteCommit() function, we can see some interesting details:
The DeleteIdFromExactTerm() is the piece that is doing the majority of the work, accounting for over 60% of the runtime for processing deletes. Let’s dig deeper here:
And we can see that this is actually really fast. The total cost for invoking this method is 5.5 μs. The problem is that we invoke it 4 million times.
And looking into the actual costs, what is the issue? It is the TryGetValue() and FindPageFor() calls, those scan through the tree structure to find the relevant place from which to remove the value.
As a reminder, here is a tree structure, (it isn’t a B+Tree, mind, because we don’t need that to understand what is going on). What we have is a lot of calls to Remove() on this tree, and we always need to start searching from the root. It turns out that this is quite expensive:
Consider what happens if I want to remove 15 and 20. I have to compare 25, 14, 19, 16 and15. Then I have to go back to the root and compare 25, 14, 19, 22 and 20. There are a lot of commonalities here that I can take advantage of. If I can operate locally, I can probably do better, no?
We rewrote the delete code to be more like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
def record_deletes(self):
delsByTerm = {}
for idToDel in self.deletes:
for term in self.getEntry(idToDel):
if term not in delsByTerm:
delsByTerm[term] = []
delsByTerm[term].append(idToDel)
self.dleteEntry(idToDel)
for term, idsToDel in delsByTerm.items():
idsToDel.sort()
termTree = self.getTermTree(term)
termTree.deleteBatch(idsToDel)
view raw
delete_batch.py
hosted with ❤ by GitHub
The idea is that we do the deletion of terms in two stages. First, we gather all the ids we need to delete for all the terms from all the entries that are being deleted. Then we sort those values, and then we invoke a batch delete method.
Unlike the individual RemoveValue() calls, we can now take advantage of the structure of the tree. In the case of wanting to remove [15,20], we can scan the tree (25,14, 19, 16, 15) to get to the first item that we remove. Then we proceed using the tree’s own structure. So deleting 20 means comparing (16, 19, 22, 20). In this case, we saved one operation, which isn’t that meaningful. But B+Tree’s most beautiful property is that they are dense. In this case, we are removing values from posting lists, which may contain millions of entries, and it isn’t uncommon to be able to pack thousands of entries per page. That means that the savings are huge.
This particular optimization was sufficient to ensure that we just don’t need to care about deletes any longer.
The ability to apply the operation in a batch fashion meant that we could amortize all of the setup costs, which lead to a far greater reaction than I actually expected, to be honest.
The code in question can make a lot of assumptions about what is going on (the input is sorted, values are present in the tree, etc). That makes things so much easier.
As a result, Corax can keep paying all costs upfront and still be so much faster than Lucene (who defers things to the background). To give some context, here are the relevant costs for Lucene:
Lucene spends more time processing deletes than Corax does for the whole commit process . Note that this is just a small part of the costs that deletes have for Lucene, there is an excellent blog post describing those from Elastic. The costs range between 20% – 46% slowdown in queries! That is an important reason to avoid deferring work, I think we can agree.
That actually happened over a month ago, but I’m afraid that July / August hasn’t been kind of either my attention span or the ability to actually keep track of what is happening. September is around the corner (at which point the summer vacation would be over and I’m already looking forward to some blessed routine).
The 5.4 release is a Long Term Support release, meaning that we encourage users to move their applications to use that version.
Note: support for RavenDB 5.3 will terminate on Jan 23 and the 5.2 release (also an LTS) will move to legacy support model a year after the 5.4 release.
This release brings to the table, aside from the usual minor fixes and improvements, a few major new features. The primary one is a favorite of mine, internally named etl2q. In other words, it is a native ETL support in RavenDB to push messages to Kafka and RabbitMQ. I did a whole webinar on the topic, you can watch it here.
We also added support for pulling data directly from RavenDB into Grafana, so you can create great dashboards and visualizations on your data, like so:
You can deploy RavenDB 5.4 instances in the cloud or on premise, and we have some additional goodies hidden under the engine that we are still working on.
Interpolated strings are very common in C#. For instance, you can write $"Hello {name}! You are {age} years old.". This expression is evaluated using the current culture. If you want to use an invariant culture, you can use FormattableString.Invariant($"..."). Starting with .NET 6 and C# 10, you ca
Modeling relationships and hierarchies is a fundamental topic taught in university – for relational databases. Document databases are quite different, and our approach to such modeling should be different as well. So how do we do it?