skip to content
Relatively General .NET

Production postmortem

by Oren Eini

posted on: April 25, 2022

A customer called us about an elevated I/O load in their system after an upgrade to RavenDB 5.3 from RavenDB 4.2. We looked into that, and we saw a small (but very noticeable) rise that we just could not explain. Those sorts of issues are tough to crack, because there isn’t an error or a smoking gun to get you started. Instead, we just saw a higher average I/O rate, but what is the reason for that? Maybe it is a seasonal change for the customer, with a higher load during the springtime? Or maybe it is related to a new index that was deployed? We looked, but there hasn’t been anything that should cause higher I/O stress for the system that we could see. So we started diving deeper and deeper into the metrics. On Linux, you can check what files are being read or written to (and all of those that we could see represented reasonable values for their load, there wasn’t anything that wasn’t expected). You can also pull the I/O stats by thread, and we could see that the cluster threads were quite busy in terms of I/O, but that is a big cluster, with plenty of databases and cluster operations to manage, it seems reasonable. What is going on? I just checked, and the timeline for this investigation is about four weeks, we tried a lot of things to figure it out. But we couldn’t find a smoking gun. Separately, we got a few bug reports from the field about a cluster issue, sometimes the cluster connection between nodes would break for no reason. The connectivity was good, so there was no reason for the break. This is a transient (and expected) error, which RavenDB will gracefully recover from. But it was a new behavior, so we looked into that. It turns out that during some refactoring, we moved a piece of code in such a way that under certain conditions, it would read too much from the buffer, but not consume all of it. Basically, this issue came back in some cases. In order to trigger this problem, we had to have a very specific network configuration with exact latencies compared to the CPU load on the server. When that behavior was triggered, we would discard some part of the message from the other side. In some cases, that just meant that we skipped an update (in a stream of them), no big deal, we’ll get the next one successfully. But depending on the size of the cluster in question and the latencies involved, we may get corrupted data (since we are missing the data). We properly detect and abort the connection in this case. It turns out that when such a thing happens, RavenDB considers the other side to have failed, and the cluster takes the appropriate action to compensate. That means that it will re-assign the tasks across the cluster. A few seconds later, the connection would be resumed, the cluster would realize that the node is “up” again and move the tasks back to the node. Those tasks include things like subscriptions, ETL processes, external replication, etc. In other words, under a specific set of conditions, we’ll have a lot of jitters, for lack of a better term in the cluster. Some of the nodes will be moved in and out of rehab (a status that means that they aren’t fully functional). That led, in turn, to a high churn of tasks (and each of those has its own I/O costs). There are other factors here, naturally, such as higher CPU and memory, but I/O is where we are typically most constrained, so it showed up there mostly. The bug was fixed (and it is in the latest stable) and we have confirmation from the customer that this indeed resolved their issue. It just goes to show how complex systems are. A bug occurs on node A when reading from the network under specific latencies conditions has cascaded to a higher resource utilization on node C. Butterfly effect indeed.

Sharing coding style and Roslyn analyzers across projects

by Gérald Barré

posted on: April 25, 2022

This post is part of the series 'Coding style'. Be sure to check out the rest of the blog posts of the series!How to enforce a consistent coding style in your projectsEnforce .NET code style in CI with dotnet formatRunning GitHub Super-Linter in Azure PipelinesSharing coding style and Roslyn analyz

Production postmortem

by Oren Eini

posted on: April 22, 2022

A customer called the support hotline with a serious problem. They had a large database and wanted to add another replica node to it. This is a fairly standard thing to do, of course. The problem was that somewhere around the 70% mark, the replication process stalled. All the metrics were green, the mentor node and the new node had perfect connectivity, and there were no errors in the logs. Typical reasons for replication to stall usually involve connectivity issues, but in this case, we could see that there was no such sign of that. In fact, the mentor node kept sending (empty) batches to the destination node. That shouldn’t be the case, however. If we have nothing to send, there shouldn’t be a batch sent over the wire. That was the only hint of something wrong. We also looked into what information RavenDB could tell us about the system, and noticed that we have a performance hint about big documents. Some of them exceeded 32MB in size, which is… quite a lot.  That doesn’t really relate so much to replication, however. It would surely slow it down, but it should work. Looking into the logs, we could see that the mentor node was attempting to send a batch, but it was sending zero documents. Digging deeper, we saw an entry about skipping documents, that was… strange. Cross referencing the log statement with the source code revealed that RavenDB decided that it is sending too much in the batch and aborted it. But… it isn’t sending anything in the batch. What is actually going on is that the database in question is an encrypted one. Encrypted databases in RavenDB are encrypted in both disk and memory. The only time that we decrypt a document is when there is an active transaction reading it. During that time, we hold that in locked memory (so it wouldn’t be paged to disk). As a result of that, we try to limit the size of transactions in encrypted databases. When we replicate data between nodes, we open a read transaction on the source node, read the documents that we need to replicate and send them to the other side. There is a small caveat here, each node in an encrypted database can use a different encryption key, so we aren’t sending the encrypted data, but the plain text. Of course, the communication itself is encrypted, so nothing can peek into the data in the middle. By default, we’ll stop a replication batch in an encrypted database after we locked more than 64 MB of memory. A replication batch of 64 MB is plenty big enough, after all. However… we didn’t take into account a scenario where a single document may cause us to consume more than 64 MB in locked memory. And we put the check to close the replication batch very early in the process. The sequence of operations was then: Start a replication batch Load the first document to send Realize that we locked too much memory and close the batch Send a zero length batch Rinse and repeat, since we can’t make any forward progress. The actual solution was to set the “Replication.MaxSizeToSendInMb” configuration option to a higher value, enough to send even the biggest documents the customer has. At that point, there was forward progress again in the system and the replication was completed successfully. We still consider this a bug, and we’ll fix it so there won’t be a hang in the system, but I’m happy to see that we were able to do a configuration change and get everything up to speed so quickly.

RavenDB Cloud: Metrics & Disk I/O enhancements

by Oren Eini

posted on: April 21, 2022

Our cloud team just finished pushing a big set of features to production. Some of them are user facing and add some nice features that I wanted to talk about. The most important feature we have in this cycle is directly exposing your instances metrics to you. Here is what this looks like: This is a significant quality of life improvement for both our users and the cloud support team, since that makes it much easier to understand what is going on from an operational perspective. From experience, one of the most common issues that users are running into is hitting the limits of their I/O. Disk I/O in the cloud is a… complicated beast. As a database, RavenDB is sensitive to the I/O platform that it is running on. We have now made it clear what exactly you are getting from the underlying system. This is what this looks like: You can also raise those values, of course. In fact, you can now selectively raise your disk performance selectively on Azure (you could always do that on AWS). This is what this looks like: As you can see, you can change both the size of the disk (which is permanent) and the performance tier. On Azure, you may change the performance tier for the disk every 12 hours (6 hours on AWS), so that isn’t something that you enable instantly. It is a very useful feature if you are expected a high load (such as big import, deployment of new indexes on large databases, initial replication, etc). Once the load is complete, you can reduce the performance tier and use a cheaper disk for your needs. The metrics & the ability to change the disk performance tier means that you don’t need to contact support to either figure out what is wrong or what to do about it.

Production postmortem

by Oren Eini

posted on: April 20, 2022

A typical production postmortem story is a tale of daring dives deep into the guts of your system. It is a journey into the intricacies of dependencies between multiple components, the delicate balance of distributed processes that got just the wrong level of alignment to cause some havoc. A production postmortem is a toil of mystery that can last for weeks. This isn’t one of those tales, however. In this one, the entire thing was wrapped up within fifteen minutes. So what was the issue? The initial premise was pretty straightforward. A customer was running RavenDB in production, but due to their topology, their RavenDB instances are not exposed to the outside world directly. Instead, they route the connection to Azure Web Application Firewall and through Azure Front Door. I have no comment on the actual decision to route through those firewalls. The problem the customer had was that Azure Front Door doesn’t support web-sockets, RavenDB studio makes extensive use of them for a bunch of reasons and there are certain features that are also dependent on them (such as aggressive caching, the Changes() API, etc). The customer wanted everything to work, and asked if RavenDB can support a long polling method, to avoid the issue entirely. This is an XY Problem. There was much confusion to be had, between our support team, yours truly and the technical people of the customer. Here is the issue, the problem the customer experienced is simply not possible. There is absolutely no way that they can run into this issue. Here is the deal: RavenDB is a secured-by-default database, which assumes that it is always running in a hostile environment. For security, RavenDB uses TLS 1.2 or higher to safeguard the data in transit. For authentication, RavenDB uses mutual authentication for both client & server using X509 certificates. Take those three together and you’ll realize that the very design of RavenDB forces you to do SSL termination (here I’m using TLS & SSL as interchangeable terms) at the RavenDB process directly. We have to do it in this manner, since otherwise we wouldn’t be able to validate the certificate from the client. The customer in this case was running in a secured mode, but was completely unable to use web sockets. Again, that is not possible. Let me explain why. If RavenDB is the entity that does SSL termination (in this case, doing the cryptographic handshake, authentication, etc) then anything in the middle between RavenDB and the client is dealing with an encrypted stream of bytes that are indistinguishable from random noise. In other words, there shouldn’t be a way to not support web sockets, since any proxy in the middle shouldn’t be able to tell what the content of the request is. This design by RavenDB also prevents you from forwarding requests, since the SSL stream must reach to RavenDB directly (as-is). Otherwise, RavenDB will not be able to authenticate the client certificate. When we looked at the actual server in question, it quickly became apparent what the issue was. The customer was accessing RavenDB using HTTPS, as is proper. However, RavenDB itself was not configured to run in a secured manner. In other words, the client was accessing RavenDB using HTTPS, but the proxies in the middle will then connect to RavenDB itself using HTTP (no security). That means that RavenDB talks to the proxy with no encryption and the proxy is able to see into the requests. That leads, of course, to the situation where the supported feature set of the proxy impacts what capabilities RavenDB can utilize. This is a broken setup, I want to point out. It is also a highly misleading setup, because RavenDB is running in unsecured mode, but you are using HTTPS to access it. We intend to make this configuration setup raise an alert and block this from deployments. RavenDB goes to great lengths to ensure that you won’t have those pitfalls to stumble into. I have to admit that we have never actually considered this sort of setup as a scenario. I am strongly reminded of this. RavenDB is amenable to running behind a proxy, of course. The key to doing so successfully is that the proxy is responsible for TCP traffic only, never interfering with the (encrypted) content that goes over the wire. As a result of this requirement, we don’t need to worry about the capabilities of the various proxies. As long as it is able to support TCP connections, all features of RavenDB will work.

Aggregate Responsibility Design

by Ardalis

posted on: April 20, 2022

The Aggregate Pattern comes from Domain-Driven Design and provides a way to encapsulate business logic among several related objects. The…Keep Reading →

Modeling Relationships in a DDD Way

by Vladimir Khorikov

posted on: April 19, 2022

Let’s talk about modeling of relationships, including the dreaded many-to-many relationships, in a DDD way.

RavenDB with the Java API

by Oren Eini

posted on: April 19, 2022

I usually talk about RavenDB in the context of C# and .NET, but we also have clients for many other platforms. On Thursday, the Copenhagen Javagruppen Meetup is doing an online event to show how you can use RavenDB from Java. The meetup is open to the public, and we would love to see you there.

How to list all routes in an ASP.NET Core application

by Gérald Barré

posted on: April 18, 2022

When your ASP.NET Core application is big enough, you may want to have a comprehensive view of all routes. There are multiple ways to declare routes. You can use Minimal API, Controllers, Razor Pages, gRPC, Health checks, etc. But all of them use the same routing system under the hood.The collectio