skip to content
Relatively General .NET

Postmortem

by Oren Eini

posted on: September 27, 2021

On Friday, Sep 24, 2021, our monitoring tool started emit alerts. Our customers’ nodes were losing connectivity to one another. Not all of them, but a significant percentage.  The issue impacted nodes in all availability zones, in all regions and across cloud providers. We are running isolated clusters, and aside from a total breakdown of the Internet, there shouldn’t be a scenario where this happens. Looking into the details, it soon became clear that the issue wasn’t with the infrastructure or the network. Actual connectivity between the nodes was fine, they were able to reach one another, but they weren’t trusting one another. Following a checklist, we verified that this isn’t an issue with a specific release or version, the issue was affecting (some) nodes across all our supported versions, including many that has been running in production for years without issue. TLDR; If you are using RavenDB and you run into SSL / TLS connection issues such as: System.Security.Authentication.AuthenticationException: The remote certificate is invalid because of errors in the certificate chain You can resolve the issue by running the following command on your instances: rm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/DAC9024F54D8F6DF94935FB1732638CA6AD77C13.pfxrm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/48504E974C0DAC5B5CD476C8202274B24C8C7172.pfx RavenDB uses X509 certificates for authentication. One of the primary reasons that it is doing so is that reduce the dependency on external actions (like LDAP, for example). We were able to quickly figure out what the root cause was. Clients were not trusting the servers. Usually, for authentication, it is the other way around, but in this case, somehow, the client connection would detect that the remote server is using an invalid certificate. We are using Let’s Encrypt certificate, like the vast majority of the web, and a quick look on our own showed us nothing. Our browsers certainly thought that the certificate was valid, and locally everything worked. On the production systems (and that impacted a few dozen clusters), on the other hand, they were flat out rejected. That was not supposed to happen. The certificate were valid, and we would get a different error if they were expired. One of the good things about using X509 certificates is that it is easy to debug independently. Here is an example of what we saw: ayende@oren-pc:~$ curl https://a.free.ayende.ravendb.cloud curl: (60) SSL certificate problem: unable to get local issuer certificate More details here: https://curl.haxx.se/docs/sslcerts.html curl failed to verify the legitimacy of the server and therefore could not establish a secure connection to it. To learn more about this situation and how to fix it, please visit the web page mentioned above. Digging a bit deeper, we can use OpenSSL to see the real details: ayende@oren-pc:~$ openssl s_client -connect a.free.ayende.ravendb.cloud:443 CONNECTED(00000003) depth=1 C = US, O = Let's Encrypt, CN = R3 verify error:num=20:unable to get local issuer certificate verify return:1 depth=0 CN = *.free.ayende.ravendb.cloud verify return:1 --- Certificate chain 0 s:CN = *.free.ayende.ravendb.cloud i:C = US, O = Let's Encrypt, CN = R3 1 s:C = US, O = Let's Encrypt, CN = R3 i:O = Digital Signature Trust Co., CN = DST Root CA X3 --- So the issue is that is is now no longer trusting Let’s Encrypt? What the hell is going on. The start of the answer is here, the root certificate for Let’s Encrypt will expire at the end of the month. That was known well in advance, and there is a new root certificate in place that should be in pretty much all devices that you use (except very old Androids, which aren’t relevant here). The problem is that for some reason, which if I understand correctly has to do with old versions of OpenSSL, the ca-certificates package, which is a common way to maintain a list of trusted root certificates has removed the soon to be expiring (but not yet actually expired) from the list of trusted root certificates. That was pushed (as a security update), to many instances. It is at this point that problem started: Established connections are fine and worked just great, since they already authenticated and were working without issue. New connections, on the other hand, would fail, because the trusted root was now untrusted. As you can imagine, this caused somewhat of an uproar for our operations team. The killer thing for us, I have to say, is that we tested this scenario before hand. There has been plenty of time to get ready, and our ops team have made sure that we provisioned the certificates with the new root. This should have been a non event, given that we prepared specifically for that. However… what happened was that on some portion of our machines, the certificate that was used was using the old chain, the one that is no longer valid. At this point we were looking at having to redeploy certificates to hundreds of clusters and we split the team to work on things in parallel. While one part of the team prepped the emergency patch to all instances, the other half focused on how to properly generate the right certificate. That is a bit complex, we intentionally do not keep the private keys of the clusters on our management plane, they are located solely on customer instances, so we first had to fetch them from the store, generate a new Let’s Encrypt certificate and then distribute it back to the right location. Tricky bit of work, but we are used to working at this scale. Except… that didn’t work. To be rather more exact, the certificates that we got from Let’s Encrypt were always using the old root certificate, never the new one. There is a way to get the specific chain from Let’s Encrypt, but our client for interacting with Let’s Encrypt did not support it. It shouldn’t matter, however, all recent (6 months or so, at least) certificates from Let’s Encrypt are signed using the new chain, so how can this be? Here is how you’ll typically see a discussion of cross-signed certificates: Basically, you have a single certificate that is signed using both chains. Except, that isn’t how it works. A certificate can be signed by a single issuer. The key here is that the issuer is not a certificate. The issuer is the public key + name of the certificate that provided the signature. The key here is that you can get the same public key + name to be signed by multiple parties. Which means, of course, that when you serve the certificate to a client, you typically need to tell it what certificate signed this one. I’ll have a more detailed post on that topic, but for now, assume that when the server reply to a client, it sends either: Our Certificate + Issuer (green) Our Certificate + Issuer (red) Note that the client that receive the certificate is free to absolutely ignore any other certificate that the server it sending, but most of the time, it will use that to follow the chain of certificates and find a trusted root. In our case, we could see that we were serving the wrong certificate chain. We were serving the (valid) server certificate, and an intermediate that pointed to a removed root. That meant that the client would follow the chain we provided and then would find a missing root, leading to a trust violation and an aborted connection. That, in a word, sucks. Because we couldn’t figure out what was going on. Given the scope of the problem and the number of impacted customers, we needed to figure out a way to resolve that as soon as possible. Going back to the Let’s Encrypt certificates that we were generating, what we actually get back from Let’s Encrypt is actually a certificate that contains: Server Certificate (a.free.ayende.ravendb.cloud, for example) Intermediate certificate R3 signed by DST Root CA X3 – that one is the problematic one Intermediate certificate R3 signed by ISRG Root X1 – that is the one we want We verified that not only are we getting the right certificate, we have been getting the right certificate for quite some time now, including on all the impacted servers. A note on the state of the system at that frame. We have quite a few customers running on the cloud, and the problem only impacted a small percentage at the time. Given features such as connection pooling, actually creating new connection tends to be rare. But the longer we let it go, the more chance is that a connection would break and we can’t renew those. What is more, that impacts both server to server communication in the cluster and clients connecting to the server. That means that this also happened to customers on their application (although it was pretty rare), the major impact that we could see is that the cluster connections broke, which meant that we effectively had a network split, even though the network itself was fine. RavenDB is actually designed to operate in such an environment, and the nodes carried on their task. Reads and writes to the database worked as usual, but cluster wide tasks (such as deploying indexes) would fail. So the sky wasn’t falling yet, but there were some dark clouds in the horizon. At this point, we have managed to figure out why the certificate isn’t trusted anymore, because of the automatic update for the ca-certificates package. The test environment showed that reinstalling the removed certificate would restore proper behavior. After verifying that this indeed fixes the issue, we rolled out across our fleet. Problem solved? Not so fast. We plugged the hole in the ship, but in a but a few days, we are going to be looking into the exact same issue, because the certificate will actually expire. We had to figure out what happened and why RavenDB wasn’t serving the right certificate chain in this context. Cue putting of spelunking equipment and digging into pretty much every code base in site. We scoured OpenSSL, .NET framework and dug through documentation about the certificate infrastructure in Windows. We couldn’t figure out what was going on. This should work. In fact, all our tests showed us that it is working, except it didn’t. At this point, we believe that we figured it out, and this is pretty insane reason. Four years ago, we run into a serious problem inside of RavenDB when running in a secured environment. If the certificate chain wasn’t available on the host, it would download that so it could serve the appropriate chain when doing TLS handshake. To avoid this issue, RavenDB will build the certificate chain once on startup, and then store that in the CA X509Store for the RavenDB user. That allowed us to remove a few remote calls on connection startup and made us more resilient when we are facing network issues. The actual problem we were trying to solve is that occasionally, RavenDB would fail when (a completely random, as far as we are concerned) server (which is serving the missing intermediate certificate) is missing. Registering the certificate ourselves in the RavenDB’s user store was a way to avoid this issue entirely. It also falls in line with the It Just Works model that we have. The other alternative would be to ask users to setup the full certificate chains locally (for each platform and environment). We rather handle that ourselves and not have to have another item in the checklist. That said, this created an interesting issue. On Linux machines, there are now multiple certificate stores that are being used. There is the .NET X509 certificate store, which is stored at ~/.dotnet/corefx/cryptography/x509stores/ca/ and there is the system store, which is stored in /etc/ssl/certs (usually). The ca-certificates update, however, caused an interesting problem. The relevant certificate was removed from the system store, but it was still on the .NET store. If it would have expired, as we expected it to, the situation would be just fine. The stores would agree and serve the new chain without issue. However, given that it didn’t expire yet, we had a discrepancy. .NET will obviously favor its X509 store if it can find the certificate there, so it kept serving the old chain. The solution to that it to manually clear the old certificate from the cache as well, you can do that by issuing the following commands: rm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/DAC9024F54D8F6DF94935FB1732638CA6AD77C13.pfxrm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/48504E974C0DAC5B5CD476C8202274B24C8C7172.pfx If you are running on windows, you can run the following commands to achieve the same goal: .\PsExec.exe -u "nt authority\Local Service" powershell.exe Get-ChildItem cert:/CurrentUser/CA/48504E974C0DAC5B5CD476C8202274B24C8C7172 | Remove-Item  That would remove the relevant chain from the store and make RavenDB use the right chain. There is no need to restart the service, it should automatically fix itself. For the future, we are going to take a long look into our policies in this matter. Given the advent of Docker & Cloud services, we feel that the actual management of the entire environment should be left to the environment, not to the RavenDB process itself.

GitHub Actions On Demand

by Ardalis

posted on: September 27, 2021

I use GitHub Actions a lot. They're a great tool for performing continuous integration (CI) and continuous deployment (CD) for your…Keep Reading →

I fixed a bug. What should I do now?

by Gérald Barré

posted on: September 27, 2021

The more code you write, the more bugs you create. Indeed, this is a known fact that almost all applications are buggy! This means that as a software developer, you must spend a part of your time debugging applications.There are a few common steps when debugging an application:Getting all the neede

Atomic reference counting (with Zig code samples)

by Oren Eini

posted on: September 20, 2021

Referencing counting is probably the oldest memory management technique in existence. It is widely used, easily to understand and explain and in most cases does the Right Thing. There are edge cases and nasty scenarios, but for the most part, it works. At least, as long as you are running as a single threaded program. Here is what a reference counting scheme looks like: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters struct item { // other fields; int refs; }; void add_ref(struct item* i) { i->refs++; } void release(struct item *i){ if(--i->refs == 0) free(i); } view raw ref_count.c hosted with ❤ by GitHub However, the moment that we have any form of concurrency, the simplicity goes right out the window. Consider the call to add_ref vs. release. Both of them need a reference to a valid object. However, what happens if we have the following sequence of events?We have an object whose reference count is set to 1.Thread  #1 gets the address of the object.Thread #2 now calls to release();There are no more references to the object and it is freed.Thread #1 now calls to add_ref();It is passing the address of the recently released object, meaning that we have undefined behavior.This is a hard problem, because we need to keep the reference count somewhere, and we also need to release the resource as soon as there are no more references to it. The more generic term for this issue is the ABA problem.In literature, there are a lot of attempts to solve this issue:Hazzard pointersGCEpochsThey are complex solution to the problem, but mostly because we have two distinct steps here. First we get a pointer to the ref counted value, then we need to change that, but we need to do that safely with concurrent releases. One way of ensuring that this works is to take a lock around the entire operation (acquire the pointer & add ref) and take the same lock for release. That is a somewhat heavy weight approach.It is also pretty much completely redundant on any modern system. Any x86/x64 CPU released in the past decade will have support this assembly instruction:The cmpxchg16b instruction allow us to do an atomic operation on a value that is 128 bits long (16 bytes). That is important, because that means that we can break apart the stages above. We can store both the pointer and its counter in a single location, and operate on them atomically. This is called the DCAS (double compare and swap), which greatly simplify the problem. To the point where there is really no reason to want to use anything else.Except… if you care about ARM systems. There is no comparable instruction to this on ARM, and given how common those machines are, that seem to point us right back to the complexities of hazard pointers and epochs. Of course, ARM has 64 bits atomic instructions, as you can see here, for example:But our pointer is also 64 bits, so that doesn’t really help us that much, does it? Interesting tidbit here, however, the pointers we use aren’t actually 64 bits in size. They are just 48 bits, in truth. That is for both x64 and for AArch64. That means that you can target only a maximum of 256TB of RAM, but there are no machines that big right now (the biggest that I’m aware of are 10% of this size and are expensive). Given that this is a CPU limit, we can probably assume that this isn’t going away soon and that when it does, ARM will likely have a 128 bits atomic instruction. That means that a 64 bits instruction can give us a 16 bits of free space to work with. But we can do better. Let’s assume that we get our pointers from malloc(), or a similar call. We know that malloc() is required to return the data aligned on max_align_t size. For 64 bits, that is 16. In other words, we have full 20 bits to play with that we can utilize for our needs, while preserving the original pointer value. If we are using page aligned pointers, however, we can use 28(!) bits out of the 64 of the pointer value, for that matter. Let’s see how we can take that assumption and turn that into something usable. I’m going to use Zig here, because I like the language and it gives us a succinct manner with which to work with native code. The first thing to do is to define how we are going to overall structure: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn RefCounted(comptime T: type) type { return struct { const Self = @This(); const InternalRef = packed union { item: packed struct { value: u44, is_error: bool, references: u19 }, raw: u64 }; data: InternalRef, version: std.atomic.Atomic(u32), comptime { if (@typeInfo(T) != .Pointer) { @compileError("RefCounted may only work on pointer values"); } if (@sizeOf(InternalRef) != @sizeOf(u64)) { @compileError("Wrong size for InternalRef?"); } } }; } view raw RefCounted.zig hosted with ❤ by GitHub The size of this structure is 12 bytes, and I’m using Zig’s arbitrary precision integers to help us pack the data into a single u64 value, without having to write all the bit shifts manually. All the other functions that I’m showing here are going to be inside the generic structure. I’m asserting the size and that the T that we are working on is a pointer of some kind. You can see that the structure is also ready to be marked as an error, so we have three possible states:Empty – no value is stored insideErrored – we’ll just retain the error codeValue – we have a value and we keep the reference count in the references field. Note that this is a 19 bits field, so we have a maximum of 512K outstanding references for the counter. For pretty much all needs, I believe that this will be sufficient. In addition to the data itself, we also have the notion of a version field. That one is needed because we want to allow the caller to wait for the value to become available. Let’s see how we can get a value from this? This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn tryAcquire(self: *Self) !?T { while (true) { var cur = self.data; var original = cur; if (cur.item.is_error) { return @intToError(@intCast(u16, cur.item.references)); } if (cur.item.value == 0) { return null; } cur.item.references += 1; if (cur.item.references == std.math.maxInt(u19)) { return error.ReferenceCounterWouldOverflow; } if (@cmpxchgWeak(u64, &self.data.raw, original.raw, cur.raw, .Monotonic, .Monotonic) == null) { var v = @intCast(u64, cur.item.value) << 4; return @intToPtr(T, v); } } } view raw tryAcquire.zig hosted with ❤ by GitHub What I’m doing here is to get the current value, do some basic checks (do we have a value, was an error registered, etc). Then we increment the reference as well as ensure that we don’t overflow it. Finally, we publish the new value using cmpxchng call. Note that the whole thing is in a while loop, to ensure that if the cmpxchng fails, we’ll retry. If we were able to update the reference count, we turn the compacted value into its original form and return that to the user. Because we get the pointer value and increment the reference count as a single step, we are ensured that you cannot end up with missing the release call. The tryAcquire() call also have a sibling, which will wait for the value to become available, it looks like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn acquire(self: *Self, timeout: ?u64) !T { while (true) { var curVer = self.version; if (try self.tryAcquire()) |v| { return v; } try Futex.wait(&self.version, curVer, timeout); } } view raw acquire.zig hosted with ❤ by GitHub If the value does not exists (if there is an error, we’ll return immediately), we’ll wait using the Futex.wait() call. This is why we need the version field. It allows us to properly wait without requiring to create kernel level objects. Let’s see how we set the new value, potentially concurrently with threads that want to get to it as well. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn init(self: *Self, val: T) !void { return self.set(val, null); } pub fn errored(self: *Self, code: anyerror) !void { return self.set(null, code); } fn set(self: *Self, val: ?T, code: ?anyerror) !void { while (true) { var cur = self.data; if (cur.item.value != 0 or cur.item.is_error) { return error.ValueAlreadySet; } var update: InternalRef = undefined; if (code) |e| { update = InternalRef{ .item = .{ .value = 0, .references = @errorToInt(e), .is_error = true } }; } else if (val) |v| { var u44val = try pointerToU44(v); update = InternalRef{ .item = .{ .value = u44val, .references = 1, .is_error = false } }; } else { return error.BothValueAndCodeAreEmpty; } if (@cmpxchgWeak(u64, &self.data.raw, cur.raw, update.raw, .Monotonic, .Monotonic) == null) { _ = self.version.fetchAdd(1, .Release); Futex.wake(&self.version, std.math.maxInt(u32)); break; } } } fn pointerToU44(val: T) !u44 { var iVal = @ptrToInt(val); if (iVal & 0b1111 != 0) { return error.PointerValueNot16BytesAligned; } if (iVal >> 48 != 0) { return error.PointerValueHigh16BitsArentCleared; } return @intCast(u44, iVal >> 4); } view raw set.zig hosted with ❤ by GitHub We have some convenience functions to make it clear what it is that we are actually setting (an error or a value) and then we get to the meat of this structure, the set() call. There isn’t much here, to be honest. We check that the value isn’t already set, and then set it properly as either an error or a value with a single reference (which is for the caller). We again use cmpxchng() call to ensure that we are safe in regard to multiple threads (although the usage I have in mind for this calls for a single writer, not competing ones, it doesn’t cost us anything to make it safer to use). After setting the value, we increment the version field and wake any waiting threads. You can also see how we validate and pack the pointer value to 44 bits. All of this work, but we are still missing one part. The whole point of reference counting is to delete the value when it it no longer in use, where is that code at? Let’s take a look: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pub fn release(self: *Self, v: T, destroy: fn (T) void) !void { var val = try pointerToU44(v); while (true) { var cur = self.data; var original = cur; if (cur.item.is_error) { return @intToError(@intCast(u16, cur.item.value)); } if (cur.item.value != val) { return error.RetunedValueDoesNotMatchStoredValue; } if (cur.item.references == 1) { cur.raw = 0; } else { cur.item.references -= 1; } if (@cmpxchgWeak(u64, &self.data.raw, original.raw, cur.raw, .Monotonic, .Monotonic) == null) { if (cur.raw == 0) { destroy(v); } return; } } } view raw release.zig hosted with ❤ by GitHub We are taking both the value that we are releasing and the destruction function. The value we want to free is there to ensure that it wasn’t modified meantime, and if there are no more references, we know that we can safely destroy it. Note that this whole scheme relies on the fact that we are managing the reference count externally to the object itself. In other words, we are assuming that the RefCount value is going to be kept alive. In my case, I”m actually intending to use that as a cache, and we’ll keep an array of those values around for a long time. Otherwise, you run the same risk as before. If you have a reference to the RefCount value, and you may release that, you may end up with a situation where you have a reference to a released memory.This technique of pointer packing is valid if you use manual memory management mode, you cannot use that in C#, for example, because object may move, and even if you pinned a value, the GC will not consider the value to be a valid pointer, so it will not work for you. For managed languages, there is actually a much better option. Just let go of the value and let the finalizer handle that. In other words, something else (an already existing component) will ensure that there are no live references remaining).

Generate PDF files using an html template and Playwright

by Gérald Barré

posted on: September 20, 2021

There are many ways to generate PDF files. You can use a PDF library, but generating a PDF manually is not trivial. You can use a report library, but they are often paid solution and not trivial to use for simple needs.In this post, we'll use an html template and Playwright to generate a PDF file.

Production postmortem

by Oren Eini

posted on: September 17, 2021

We got a pretty nasty bug at a customer site a few months ago. Every now and then, the server running RavenDB will go into a high CPU mode, use 100% CPU and stay there for extended period of time. After a while, it will just return back to normal. Looking at the details, there was nothing really that should cause this scenario. The amount of load of the system didn’t justify this load, we are talking about a maximum load of under 500 requests / second. RavenDB can handle that much on a Raspberry PI without straining itself, after all.The problem was that we couldn’t figure out what was going on… none of the usual metrics were relevant. Typically, when we see high CPU utilization, the fault is either our code or the GC is working hard. In this case, however, while the RavenDB process was responsible for the CPU usage, there was no indication that it was any of the usual suspects. Here is what the spike looked like:The customer has increased the size of the machine several time, trying to accommodate the load, but the situation was not getting better. In fact, the situation appeared to be getting worse. This is on a server running Windows 2016, and all the nodes in the cluster would experience this behavior, effectively taking the system down. The did not do that on an synchronized schedule, but one would go into a high CPU load, and the clients would fail to the other nodes, which will usually (but not always) trigger the situation. After a short while, it would get back down to normal rates, but that was obviously not a good situation.After a while, we found something that was absolutely crazy. Looking at the Task Manager, we added all the possible columns and looked at them. Take a look at the following screen shot:What you see here is the Page Fault Delta, basically, how many page faults happened in the system in the past second for this process. A high number on this column is when you see hundreds of page faults. Thousands of page faults usually means that you are swapping badly. Hundreds of thousands? I have never seen such a scenario and I couldn’t imagine what that actually is. What is crazier is that this number should be physically impossible. At a glance, we are looking at a lot of reads from the disk, but looking at the disk metrics, we could see that we had very little activity there. That is when we discovered that there is another important metric, Hard Page Faults / sec. That metric, on the other hand, typically ranged in the single digits or very low double digits, nothing close to what we were seeing. So what was going on here?In addition to hard page faults (reading the data from disk), there is also the concept of soft page faults. Those page faults can happen if the OS can find the data it needs in RAM (the page cache, for example). But if it is in RAM, why do we even have a page fault in the first place? The answer is that while the memory may be in RAM, it may not have mapped to the process in question.Consider the following image, we have two processed that mapped the same file to memory. On both processes, the first page is mapped to the same physical page. But the 3rd page on the second process (5678) is not mapped. What do you think will happen if the process will access this page?At this point, the CPU will trigger a page fault, which the OS needs to handle. How is it going to do that? It can fetch the data from disk, but it doesn’t actually need to do so. What it needs to do is just update the page mapping to point to the already loaded page in memory. That is what is called a soft page fault (with hard page fault requiring us to go to disk). Note that the CPU utilization above shows that the vast majority of the time is actually spent on system time, not user time. That means that the kernel is somehow doing a lot of work, but what is going on? The issue with page fault delta was the key to understanding what exactly is going on here. When we looked deeper using ETW, we were able to capture the following trace:What you can see in the image is that the vast majority of the time is spent in handling the page fault on the kernel side, as expected given the information that we have so far. However, the reason for this issue is that we are contending on an exclusive lock? What lock is that?We worked with Microsoft to figure out what exactly is going on and we found that in order to modify the process mapping table, Windows needs to take a lock. That make sense, since you need to avoid concurrent modification to the page table. However, on Windows 2016, that is a process wide lock. Consider the impact of that. If you have a scenario where a lot of threads want to access pages that aren’t mapped to the process, what will happen?On each thread, we’ll have a page fault and handle that. If the page fault is a hard page fault, we will issue a read and put the thread to sleep until this happen. But what if this is a soft page fault? Then we just need to take the process lock, update the mapping table and return. But what if I have a high degree of concurrency? Like 64 concurrent cores that all contend on the same exact lock? You are going to end up with the exact situation above. There is going to be a hotly contested lock and you’ll spend all your time at the kernel level.The question now was, why did this happen? The design of RavenDB relies heavily on memory mapped I/O and it is something that we have been using for over a decade. What can cause us to have so many soft page faults? The answer came from looking even more deeply into the ETW traces we took. Take a look at the following stack:When we call FlushFileBuffers, as we need to do to ensure that the data is consistent on disk, there is a lot that is actually going on. However, one of the key aspects that seems to be happening is that Windows will remove pages that were written by FlushFileBuffers from the working set of the process. That will lead to page faults (soft ones). We confirmed with Microsoft that this is the expected behavior, calling FlushFileBuffers (fsync) will trim the modified pages from the process mapping table. This is done to improve coherency between the memory mapped pages and the page cache, I believe.To reproduce this scenario, you’ll need to do something similar to:Map a large number of pages (in this case, hundreds of GB)Modify the data in those pages (in our case, write documents, indexes, etc)Call FlushFileBuffers on the dataFrom many threads, access the recently flushed data (each thread ideally accessing a different page)On Windows 2016, you’ll hit a spin lock contention issue and spend most of your time contending inside the kernel. The recommendation from Microsoft has been to move to Windows 2019, where the memory lock granularity has been increased, so they won’t all contend on the same lock. Indeed, testing on Windows 2019 we weren’t able to reproduce the problem.The really strange thing here is that we have using the exact same code and approach in RavenDB for many years, and only recently did we see a shift with most of our customers running on Linux. That particular behavior is how we are used to be running, and I would expect it to be triggered often.The annoying thing about this is that this is actually the case of too much of a good thing. Usually RavenDB will scale linearly with the number of cores for reads, the customer in question moved from RavenDB 3.5 to RavenDB 5.2, and they used the same size machine in both cases. RavenDB 5.2 is far more efficient, however. It was able to utilize the cores a lot better and trigger this behavior on a consistent basis. Using RavenDB 3.5, on the other hand, a lot of the CPU time was spent on doing other things, so we didn’t trigger this issue. Indeed, a workaround to improve performance was to reduce the number of cores on the system. That reduced the contention and made the whole system more stable. The actual solution, however, was to run on Windows 2019, but that was a hard problem to solve. We tested pretty much any scenario that we could think to see what can help us here. And yes, we tested this on  Linux, and didn’t see any indication of a similar problem.

Dependency Injection Book Reviewed

by Ardalis

posted on: September 16, 2021

I like to use an index card as a bookmark when I'm reading non-fiction books, so that I can come back to interesting topics later or review…Keep Reading →