skip to content
Relatively General .NET

Managing RavenDB indexes in production, a DevOps guide

by Oren Eini

posted on: April 04, 2022

RavenDB has the ability to analyze your queries and generate the appropriate indexes for you automatically. This isn’t a feature you need to enable or a toggle to switch, it is just the way it works by default. For more advanced scenarios, you have the ability to write your own indexes to process your data in all sorts of interesting ways.  Indexes in RavenDB are used for aggregation (map-reduce), full text search, spatial queries, background computation and much more. This post isn’t going to talk about what you can do with RavenDB’s indexes, however. I’m going to discuss how you’ll manage them. There are several ways to create indexes in RavenDB, the one that we usually recommend is to create a class that will inherit from AbstractIndexCreationTask. If you are using C# or TypeScript, you can create strongly typed indexes that will be checked by the compiler for you. If you are using other clients (or JS indexes), you will have the index definition as constant strings inside a dedicated class. Once you have the indexes defined as part of your codebase, you can then create them using a single command: IndexCreation.CreationIndexes(); What I described so far is the mechanics of working with indexes. You can read all about them in the documentation. I want to talk about the implications of this design approach: Your indexes live in the same repository as your code. Whenever you checkout a branch, the index definitions you’ll use will always match the code that queries them. Your indexes are strongly typed and are checked by the compiler. I mentioned this earlier, but this is a huge advantage, worth mentioning twice. You can track changes on your indexes using traditional source control tools. That makes reviewing index changes just a standard part of the job, instead of something you need to do in addition. RavenDB has a lot of features around index management. Side by side index deployment, rolling indexes, etc. The question is now, when do you deploy those indexes. During development, it’s standard to deploy your indexes whenever the application starts. This way, you can change your indexes, hit F5 and you are immediately working on the latest index definition without having to make any other actions. For production, however, we don’t recommend taking this approach. Two versions of the application using different index definitions would “fight” to apply the “right” version of the index, causing version bounce, for example. RavenDB has features such as index locking, but those are to save you from a fall, not for day to day activity. You should have a dedicated endpoint / tool that you can invoke that would deploy your indexes from your code to your RavenDB instances. The question is, what should that look like? Before I answer this question, I want to discuss another aspect of indexing in RavenDB: automatic indexing. So far, we discussed static indexes, ones that you define in your code manually. But RavenDB also allows you to run queries without specifying which index they will use. At this point, the query optimizer will generate the right indexes for your needs. This is an excellent feature, but how does that play in production? If you deploy a new version of your application, it will likely have new ways of querying the database. If you just push that to production blindly, RavenDB will adjust quickly enough, but it will still need to learn all the new ways you query your data. That can take some time, and will likely cause a higher load on the system. Instead of doing all the learning and adjusting in production, there are better ways to do so. Run the new version of your system on QA / UAT instance and put it through its paces. The QA instance will have the newest static indexes and RavenDB will learn what sort of queries you are issuing and what indexes it needs to run. Once you have completed this work, you can export the indexes from the QA instance and import them into production. Let the new indexes run and process all their data, then you can push the new version of your application out. The production database is already aware of the new behavior and adjusted to it. As a final note, RavenDB index deployment is idempotent. That means that you can deploy the same set of indexes twice, but it will not cause us to re-index. That reduces the operational overhead that you have to worry about.

Forcing HttpClient to use IPv4 or IPv6 addresses

by Gérald Barré

posted on: April 04, 2022

Recently, I had a problem with an application where connecting to a server using IPv6 was much slower than IPv4 (more than 2 seconds to open the socket). I'm not sure why, but forcing the client to use IPv4 addresses solved the problem. I think the server (localhost in my case) is not listening on

Converting code to the new Regex Source Generator

by Gérald Barré

posted on: March 28, 2022

.NET 7 brings a new feature for Regex. Indeed, it allows generating the source code of a regular expression at compile time using a Roslyn Source Generator. Generating the source code at compile-time instead of runtime has multiple advantages:The first regex execution is faster. Indeed, you don't n

Using RavenDB for data aggregation from dynamic sources

by Oren Eini

posted on: March 24, 2022

I got an interesting question from a customer recently and thought that it would make for a fun blog post. The issue the customer is facing is that they are aggregating data from many sources, and they need to make sense of all the data in a nice manner. For some of the sources, they have some idea about the data, but in many cases, the format of the data they get is pretty arbitrary. Consider the image on the right, we have four different documents, from separate sources: titles/123-45-678/2022-01-28 – The car title tickets/0000000000000000009-A – A parking ticket that was issued for a car orders/0000000000000000010-A – An order from a garage about fixes made for a customer (which includes some cars) claims/0000000000000000011-A – Details of a rejected insurance claim for a car We need to make sense of all of this information and provide some information to the user about a car from all those sources. The system in question is primarily interested in cars, so what I would like to do is show a “car file”. All the information at hand that we have for a particular car. The problem is that this is not trivial to do. In some cases, we have a field with the car’s license plate, but each data source named it differently. In the case of the Order document, the details about the specific service for the car are deep inside the document, in a free form text field. I can, of course, just index the whole thing and try to do a full text search on the data. It would work, but can we do better than that? A license plate in the system has the following format: 123-45-768. Can we take advantage of that? If you said regex, you now have two problems :-). Let’s see what we can do about this… One way to handle this is to create a multi map-reduce index inside of RavenDB, mapping the relevant items from each collection and then aggregating the values by the car’s license plate from all sources. The problem with this approach is that you’ll need to specialize for each and every data source you have. Sometimes, you know what the data is going to look like and can get valuable insight from that, but in other cases, we are dealing with whatever the data providers will give us… For that reason, I created the following index, which uses a couple of neat techniques all at once to give me insight into the data that I have in the system, without taking too much time or complexity. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters // map map("@all_docs", d => { var ids = [id(d)] var occurance = {}; occurance[d["@metadata"]["@collection"]] = 1; var results = scanLicensePlates(d, {}); return Object.keys(results).map(match => ({ Collections: occurance, Ids: ids, LicensePlate: match })); }) const licensePlateRegex = /\d{3}-\d{2}-\d{3}/g; function scanLicensePlates(d, results){ var vals = Object.values(d); for(var i = 0; i < vals.length; i++){ var v = vals[i]; switch(typeof v){ case "string": var it = v.matchAll(licensePlateRegex); while(true){ var cur = it.next(); if(cur.done) break; results[cur.value[0]] = null; } break; case "object": // array / object, recursive scanLicensePlates(v, results); break; } } return results; } // reduce groupBy(x=>x.LicensePlate) .aggregate(g => { var collections = g.values.reduce((acc, cur) => { for(var k in cur.Collections){ acc[k] = (acc[k] || 0) + cur.Collections[k]; } return acc; }, {}); var ids = g.values.reduce((acc, cur) => { cur.Ids.forEach(k => acc[k] = null); return acc; }, {}); return { LicensePlate: g.key, Collections: collections, Ids: Object.keys(ids); }; }); view raw index.js hosted with ❤ by GitHub This looks like a lot of code, I know, but the most complex part is in the scanLicensePlates() portion. There we define a regex for the license plate and scan the documents recursively trying to find a proper match. The idea is we’ll find a license plate in either the field directly (such as Title.LicensePlate) or part of the field contents (such as Orders.Lines.Task field). Regardless of where we find the data, in the map phase we’ll emit a separate value for each detected license plate in the document. We’ll then aggregate by the license plate in the reduce phase. Some part of the complexity here is because we are building a smart summary, here is the output of this index: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters { "LicensePlate": "123-45-678", "Collections": { "Titles": 1, "Tickets": 1, "Orders": 2, "Claims": 1 }, "Ids": [ "titles/123-45-678/2022-01-28", "tickets/0000000000000000009-A", "orders/0000000000000000010-A", "claims/0000000000000000011-A", "Orders/0000000000000000012-A" ], "@metadata": { "@change-vector": null, "@index-score": 1 } } view raw output.json hosted with ❤ by GitHub As you can see, the map-reduce index results will give us the following data items: The license plate obviously (which is how we’ll typically search this index) The summary for all the data items that we have for this particular license plate. That will likely be something that we’ll want to show to the user. The ids of all the documents related to this license plate, which we’ll typically want to show to the user. The nice thing about this approach is that we are able to extract actionable information from the system with very little overhead. If we have new types of data sources that we get in the future, they’ll seamlessly merge into the final output for this index. Of course, if you know more about the data you are working with, you can probably extract more interesting information. For example, we may want to show the current owner of the car, which we can extract from the latest title document we have. Or we may want to compute how many previous owners a particular vehicle has, etc. As the first step to aggregate information from dynamic data sources, that gives us a lot of power. You can apply this technique in a wide variety of scenarios. If you are finding yourself doing coarse grained searches and trying to regex your way to the right data, this sort of approach can drastically improve your performance and make it far easier to build a system that can be maintained over the long run.

A tale of eventually consistent ACID model

by Oren Eini

posted on: March 23, 2022

I recently had a conversation about ACID, I don’t think it would surprise anyone that I’m a big proponent of ACID. After all, RavenDB was an ACID database from the first release. When working with distributed systems, on the other hand, it is far harder to get ACID guarantees at a reasonable cost. Pretty much all the 1st generation NoSQL databases left ACID on the sidelines, because it is a hard problem. That was one of the primary reasons why RavenDB even exists. I couldn’t imagine living without transactions. This is a post from 2011, talking about just that topic. Consistency in a distributed system is a hard problem, mostly because it has an impact on the design and performance of the system. It is also common to think about ACID as a binary property, which is sort of true (A for Atomic ). However, it turns out that the real world is a lot more nuanced than that. I want to discuss the consistency model for RavenDB as it applies to running in a distributed cluster. It is ACID with eventual consistency, which doesn’t sound like it makes sense, right? I found a good example to explain the importance of ACID operations from your database even in the presence of eventual consistency. Consider the following scenario, we have a married couple with a shared bank account. Both husband and wife have a checkbook for the account and primarily use checks to pay for things in their day to day life. Checks are anachronistic for some people, who are used to instant payments and wire transfers. The great thing about checks is that they are (by definition) a way to work in a distributed system. You hand someone a check and at some future point in time they will deposit that and get the money from your account. One of the most important aspects of using checks was managing that delay. The amount of money you had in the account didn’t necessarily represent how much money you had available. If your rent check wasn’t deposited yet, you still had to consider the rest money “gone”, even if you could still see it in the bank statement. Because of checks’ eventual consistency, a really important part of using checks was to keep track of all the outstanding checks that weren’t deposited yet. You did that by filling in the stub of the check in the checkbook whenever you wrote a check. In other words, you never gave a check before you properly filled the stub for that. That brings us back to ACID. The act of filling the stub and writing the check is a transaction, composed of two separate actions. That action isn’t a global transaction. The husband and wife in our example do not have to coordinate with one another whenever they write a check. But they do need to ensure that no check would be handed off without a proper stub (and vice versa, if we want to be exact). If the act of writing a check and filling the stub isn’t atomic, you may have a check unexpectedly hit your account, which is… exciting (in the Chinese proverb  manner). On the other side, the entity that you handed the check to also needs a transaction. They need to fill out an invoice for the check (even though it hasn’t been deposited yet). Having a check with no invoice or an invoice with no check is… bad (as in, IRS agents having shots and high fives during an audit). The idea is that at the local level, you have to use transactions, otherwise, you cannot be sure about the consistency of your own data. If you don’t have transactions at the persistence layer, you’ll have to build it on top of that, which is… not ideal, really hard and usually not going to work in all cases. With local transactions, you can then start pushing consistent data out and resolve all the distributed states you have. Going back to our husband and wife example, for the most part, they can act completely independently of one another, and they’ll reconcile their account status with each other at a later date (weekly budget meeting). At the same time, there are certain transactions (pun intended) where they won’t act independently. A great example is buying a car, that sort of amount requires that both will be consulted on the purchase. That is a high value operation, so it is worth the additional cost of distributed consistency. With RavenDB, we have the notion of local node transactions, which are then sent out to the rest of the nodes in the cluster in the background (async replication) as well as support for cluster wide transactions, which requires the consent of a majority of the nodes in the cluster. You can choose for each scenario exactly what level of transactions and consistency you want to have, local or global.

Using RavenDB from Serverless applications

by Oren Eini

posted on: March 22, 2022

I got a great question about using RavenDB from Serverless applications: DocumentStore should be created as a singleton. For Serverless applications, there are no long-running instances that would be able to satisfy this condition. DocumentStore is listed as being "heavy weight", which would likely cause issues every time a new insurance is created. RavenDB’s documentation explicitly calls out that the DocumentStore should be a singleton in your application: We recommend that your Document Store implement the Singleton Pattern as demonstrated in the example code below. Creating more than one Document Store may be resource intensive, and one instance is sufficient for most use cases. But the point of Serverless systems is that there is no such thing as a long running instance. As the question points out, that is likely going to cause problems, no? On the one hand we have RavenDB’s DocumentStore, which is optimized for long running processes and on the other hand we have Serverless systems, which focus on minimal invocations. Is this really a problem? The answer is that there is no real contradiction between those two desires, because while the Serverless model is all about a single function invocation, the actual manner in which it runs means that there exists a backing process that is reused between invocations. Taking AWS Lambda as an example, you can define a function that will be invoked for SQS (Simple Queuing Service), the signature for the function will look something like this: async Task HandleSQSEvent(SQSEvent sqsEvent, ILambdaContext context); The Serverless infrastructure will invoke this function for messages arriving on the SQS queue. Depending on its settings, the load and defined policies, the Serverless infrastructure may invoke many parallel instances of this function. What is important about Serverless infrastructure is that a single function instance will be reused to process multiple events. It is the Serverless infrastructure's responsibility to decide how many instances it will spawn, but it will usually not spawn a separate instance per message in the queue. It will let an instance handle the messages and spawn more as they are needed to handle the ingest load. I’m using SQS as an example here, but the same applies for handling HTTP requests, S3 events, etc. Note that this is relevant for AWS Lambda, Azure Functions, GCP Cloud Functions, etc. A single instance is reused across multiple invocations. This ensure far more efficient processing (you avoid startup costs) and can make use of caching patterns and common optimizations. When it comes to RavenDB usage, the same thing applies. We need to make sure that we won’t be creating a separate DocumentStore for each invocation, but once per instance. Here is a simplified example of how you can do this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters using Amazon.Lambda.Core; using Amazon.Lambda.RuntimeSupport; using Amazon.Lambda.Serialization.SystemTextJson; using Amazon.Lambda.SQSEvents; using Raven.Client.Documents; using System.Security.Cryptography.X509Certificates; using var documentStore = new documentStore { Urls = new string[] { /* urls */}, Certificate = new X509Certificate2(/*path*/) }; documentStore.Initialize(); var handler = async (SQSEvent sqsEvent, ILambdaContext context) => { using var session = documentStore.OpenAsyncSession(); foreach (var record in sqsEvent.Records) { // use the session to process the records } } await LambdaBootstrapBuilder.Create(handler, new DefaultLambdaJsonSerializer()) .Build() .RunAsync(); view raw lambda.cs hosted with ❤ by GitHub We define the DocumentStore when we initialize the instance, then we reuse the same DocumentStore for each invocation of the lambda (the handler code). We can now satisfy both RavenDB’s desire to use a singleton DocumentStore for best performance and the Serverless programming model that abstracts how we actually run the code, without really needing to think about it.