skip to content
Relatively General .NET

Production postmortem

by Oren Eini

posted on: July 13, 2022

We got a call from a customer, a pretty serious one. RavenDB is used to compute billing charges for customers. The problem was that in one of their instances, the value for a particular customer was wrong. What was worse was that it was wrong on just one instance of the cluster. So the customer would see different values in different locations. We take such things very seriously, so we started an investigation. Let me walk you through reproducing this issue, we have three collections (Users, Credits and Charges): The user is performing actions in the system, which issue charges. This is balanced by the Credits in the system for the user (payment they made). There is no 1:1 mapping between charges and credits, usually. Here is an example of the data: And now, let’s look at the index in question: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters // maps from u in docs.Users select new { u.Id, u.Name, Amount = 0 } from c in docs.Charges select new { Id = c.User, Amount = -c.Amount, Name = default(string) } from c in docs.Credits select new { Id = c.User, c.Amount, Name = default(string) } // reduce from r in results group r by r.Id into g let user = g.FirstOrDefault(x=>x.Name != null) select new { user.Name, user.Id, Amount = g.Sum(x=>x.Amount) } view raw index.cs hosted with ❤ by GitHub This is a multi map-reduce index that aggregates data from all three collections. Now, let’s run a query: This is… wrong. The charges & credits should be more or less aligned. What is going on? RavenDB has a feature called Map Reduce Visualizer, to help work with such scenarios, let’s see what this tells us, shall we? What do we see in this image? You can see that we have two results for the index. Look at Page #854 (at the top), we have one result with –67,343 and another with +67,329. The second result also does not have an Id property or a Name property. What is going on? It is important to understand that the image that we have here represents the physical layout of the data on disk. We run the maps of the documents, and then we run the reduce on each page individually, and sum them up again. This approach allows us to handle even a vast amount of data with ease. Look at what we have in Page #540. We have two types of documents there, the users/ayende document and the charges documents. Indeed, at the top of Page #540 we can see the result of reducing all the results in the page. The data looks correct. However… Look at Page #865, what is going on there? Looks like we have most of the credits there. Most importantly, we don’t have the users/ayende document there. Let’s take a look at the reduce definition we have: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters from r in results group r by r.Id into g let user = g.FirstOrDefault(x=>x.Name != null) select new { user.Name, user.Id, Amount = g.Sum(x=>x.Amount) } view raw reduce.cs hosted with ❤ by GitHub What would happen when we execute it on the results in Page #865? Well, there is no entry with the Name property there. So there is no Name, but there is also no Id. But we project this out to the next stage. When we are going to reduce the data again among all the entries in Page #854 (the root one), we’ll group by the Id property, but the Id property from the different pages is different. So we get two separate results here. The issue is that the reduce function isn’t recursive, it assumes that in all invocations, it will have a document with the Name property. That isn’t valid, since RavenDB is free to shuffle the deck in the reduce process. The index should be robust to reducing the data multiple times. Indeed, that is why we had different outputs on different nodes, since we don’t guarantee that will process results in the same order, only that the output should be identical, if the reduce function is correct. Here is the fixed version: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters from r in results group r by r.Id into g let user = g.FirstOrDefault(x=>x.Name != null) select new { user.Name, - user.Id + Id = g.Key, Amount = g.Sum(x=>x.Amount) } view raw reduce.diff hosted with ❤ by GitHub And the query is now showing the correct results: That is much better

Production postmortem

by Oren Eini

posted on: July 12, 2022

A customer called us, complaining that RavenDB isn’t supporting internationalization. That was a big term to unpack. It boiled down to a simple issue. They were using Hebrew text in their system, consuming us from a node.js client, and they observed that sometimes, RavenDB would corrupt the data. They would get JSON similar to this: { “Status”: "�", “Logged: true } That… is not good. And also quite strange. I’m a native Hebrew speaker, so I threw a lot of such texts into RavenDB in the past. In fact, one of our employees built a library project for biblical texts, naturally all in Hebrew. Another employee maintained a set of Lucene analyzers for Hebrew. I think that I can safely say that RavenDB and Hebrew has been done. But the problem persisted. What was worse, it was not consistent. Every time that we tried to see what is going on, it worked. We added code inside of RavenDB to try to detect what is going on, and there was nothing there. Eventually we tried to look into the Node.js RavenDB client, because we exhausted everything else. It looked okay, and in our tests, it… worked. So we sat down and thought about what it could be. Let’s consider the actual scenario we have on hand: Hebrew characters in JSON are being corrupted. RavenDB uses UTF-8 encoding exclusively. That means that Hebrew characters are using multi byte characters That line of thinking led me to consider that the problem is related to chunking. We read from the network in chunks, and if the chunk happened to fall on a character boundary, we mess it up, maybe? Once I started looking into this, the fix was obvious: Here we go: ‍! This bug is a great example of how things can not show up in practice for a really long time. In order to hit this you need chunking to happen in just the wrong place, and if you are running locally (as we usually do when troubleshooting), the likelihood you’ll see this is far lower. Given that most JSON property names and values are in the ASCII set, you need a chunk of just the right size to see it. Once we know about it, reproducing it is easy, just create a single string that is full of multi byte chars (such as an emoji) and make it long enough that it must be chunked. The fix was already merged and released.

Adding content negotiation to minimal APIs with Carter

by Andrew Lock

posted on: July 12, 2022

In this post I show how to use the open source library Carter to add content negotiation to an ASP.NET Core minimal API app…

Production postmortem

by Oren Eini

posted on: July 11, 2022

A customer opened a support call telling us that they reached the scaling limits of RavenDB. Given that they had a pretty big machine specifically to handle the load they were expecting, they were (rightly) upset about that. A short back and forth caused us to realize that RavenDB started to fail shortly after they added a new customer to their system. And by fail I mean that it started throwing OutOfMemoryException in certain places. The system was not loaded and there weren’t any other indications of high load. The system had plenty of memory available, but critical functions inside RavenDB would fail because of out of memory errors. We looked at the actual error and found this log message: Raven.Client.Exceptions.Database.DatabaseLoadFailureException: Failed to start database orders-21 At /data/global/ravenData/Databases/orders-21 ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown. at System.Threading.Thread.StartInternal(ThreadHandle t, Int32 stackSize, Int32 priority, Char* pThreadName) at System.Threading.Thread.StartCore() at Raven.Server.Utils.PoolOfThreads.LongRunning(Action`1 action, Object state, String name) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Utils\PoolOfThreads.cs:line 91 at Raven.Server.Documents.TransactionOperationsMerger.Start() in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\TransactionOperationsMerger.cs:line 76 at Raven.Server.Documents.DocumentDatabase.Initialize(InitializeOptions options, Nullable`1 wakeup) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\DocumentDatabase.cs:line 388 at Raven.Server.Documents.DatabasesLandlord.CreateDocumentsStorage(StringSegment databaseName, RavenConfiguration config, Nullable`1 wakeup) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\DatabasesLandlord.cs:line 826 This is quite an interesting error. To start with, this is us failing to load a database, because we couldn’t spawn the relevant thread to handle transaction merging. That is bad, but why? It turns out that .NET will only consider a single failure scenario for a thread failing to start. If it fails, it must be because the system is out of memory. However, we are running on Linux, and there are other reasons why that can happen. In particular, there are various limits that you can set on your environment that would limit the number of threads that you can set. There are global knobs that you should look at first, such as those: /proc/sys/kernel/threads-max /proc/sys/kernel/pid_max /proc/sys/vm/max_map_count Any of those can serve as a limit. There are also ways to set those limits on a per process manner. There is also a per user setting, which is controlled via: /etc/systemd/logind.conf: UserTasksMax The easiest way to figure out what is going on is to look at the kernel log at that time, here is what we got in the log: a-orders-srv kernel: cgroup: fork rejected by pids controller in /system.slice/ravendb.service That made it obvious where the problem was, in the ravendb.service file, we didn’t have TasksMax set, which meant that it was set to 4915 (probably automatically set by the system depending on some heuristic). When the number of databases and operations on the database reached a particular size, we hit the limit and started failing. That is not a fun place to be in, but at least it is easy to fix. I created this post specifically so it will be easy to Google that in the future. I also created an issue to get a better error message in this scenario.

Replace characters in a string using Vectorization

by Gérald Barré

posted on: July 11, 2022

This post is part of the series 'SIMD'. Be sure to check out the rest of the blog posts of the series!Faster Guid comparisons using Vectors (SIMD) in .NETFinding the maximum value in an array using vectorizationReplace characters in a string using Vectorization (this post)I continue to look at code

The Art of Debugging

by Oren Eini

posted on: July 08, 2022

I often end up being pulled into various debugging sessions. I find that having a structured approach for debugging can help you figure out what is going on and fix even the nastiest of bugs. This is a story about a few different bugs that I run into. We have some new features coming in RavenDB 5.4 which touch the manner in which we manage some data. When we threw enough data at RavenDB, it died. Quite rudely, I have to say. I described the actual bug here, if you care. There wasn’t really much to finding this one. The process crashes, we don’t know why. The process will usually gives us good signs about what exactly is happening. In this case, attaching a debugger and running the code and seeing where it failed was sufficient. I’ll admit that I mostly pressed F10 and F11 until I found the offending function, and then I didn’t believe it for some time. Still in the same area, however, we began to run into a problem with impossible results. Invariants were broken, quite harshly. The problem was that this was something we could only reproduce after ~30 minutes or so of running a big benchmark. This does not make it easy to figure out what is going on. What is worse, there is a distinct gap between the time that the bug showed up and when it actually happened. That made it very hard to figure out what is going on. This is where I think the structured approach shines. Architecture – I hate concurrency debugging. I’m roughly 20+ years of experience in writing parallel programs, it is not sufficient to debug concurrency, I’m afraid. As a result, the design of RavenDB goes to great lengths to avoid having to do concurrency coordination. There are very well defined locations where we are applying concurrency (handling requests) and there are places where we are using serial manner (modifying data). The key is that we are using multiple serial processes at the same time. Each index is bound to a thread, for example, but they are independent of one another. The error in question was in an index, and I was able to narrow it down to a particular area of the code. Somewhere along the way, we messed things up. Reproducing – I had a way to reproduce the issue, it was just utterly unhelpful for debugging purposes. But since the process is a serial one, it meant that it is also fairly consistent. So I added a bunch of printf() that logged the interesting operations that we had: And here is what this looked like: Each line is a particular command that I’m executing. I wrote a trivial parser that would read from the file and perform the relevant operations. For ease of explanation, imagine that I’m writing a hash table, and this set of operations prime it to expose the issue. The file contained hundreds of millions of operations. Somewhere toward the end, we run into the corrupted state. That is too much to debug, but I don’t need to do that. I can write code to verify the state of the data structure. There is one problem here, the cost. Verifying the data structure is an operation that is O(N*logN + N) at least (probably O(N^2), to be honest, didn’t bother to run the proper analysis). I can’t run it on each command.  The good news is that I don’t need to. Here is what I did: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters int counter = 0; string line = null; while((line = reader.ReadLine()) { counter++; var parts = line.Split(); if(parts[0] == "put") { var id = dataStructure.Put(long.Parse(parts[1], parts[2]); mapping[parts[3]] = id; } else // del { dataStructure.Remove(mapping[parts[1]], out _); } if(counter % 100_000 == 0) { dataStructure.FullVerify(); } } view raw repro.cs hosted with ❤ by GitHub That did a full verification every 100,000 runs. This meant that I was confident that this was roughly the error way. Then I just repeated the process, raising the minimum start for doing the check and reducing the gaps in the frequencies. This is a pretty boring part, to be honest. You run the code, do some emails, check it again, update a couple of numbers, and move on. This bisection approach yields great results, however. It means that I could very quickly narrow down to the specific action that caused the error. That was in step 48,292,932, by the way. The actual error was discovered far later. Fixing – that part I’m going to skip, this is usually fairly obvious once you know what is going on. In the case above, there was some scenario where we were off by 4 bytes, and that caused… havoc. The key here is that for most of the debugging process, I don’t really need to do any sort of thinking. I figure out what information I need to gather, then I write the code that would do that for me. And then I basically keep running things until I narrowed the field down to the level where the problem is clear and I can fix that.

My interview at Productive C#

by Oren Eini

posted on: July 07, 2022

I had an interview with Andrea Angella from Product C# earlier this week. I think that it was quite an interesting discussion, you can watch it here (or listen in Podcast form):

Returning XML from minimal APIs in .NET 6

by Andrew Lock

posted on: July 05, 2022

In this post I show how to return an XML response from minimal APIs, how to create a helper extension method, and discuss performance considerations…

Finding the maximum value in an array using vectorization

by Gérald Barré

posted on: July 04, 2022

This post is part of the series 'SIMD'. Be sure to check out the rest of the blog posts of the series!Faster Guid comparisons using Vectors (SIMD) in .NETFinding the maximum value in an array using vectorization (this post)Replace characters in a string using VectorizationIn the previous post, I de

Challenge

by Oren Eini

posted on: July 01, 2022

Yesterday I presented a bug that killed the process in a particularly rude manner. This is a recursive function that guards against stack overflows using RuntimeHelpers.EnsureSufficientExecutionStack(). Because of how this function kills the process, it took some time to figure out what is going on. There was no StackOverflowException, just an exit code. Here is the relevant code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters unsafe void WriteNulls(Dictionary<string, object?> o, BinaryWriter writer) { writer.Write7BitEncodedInt(o.Count); var sizeInBytes = o.Count / (sizeof(byte) * 8) + o.Count % (sizeof(byte) * 8) == 0 ? 0 : 1; const int MaxStackAllocation = 256; if (sizeInBytes > MaxStackAllocation) { WriteNullsAllocating(o, writer); return; } byte* bits = stackalloc byte[sizeInBytes]; var index = -1; foreach (var kvp in o) { index++; if (kvp.Value is not null) continue; bits[index / (sizeof(byte) * 8)] |= (byte)(1 << (index % (sizeof(byte) * 8))); } writer.Write(new Span<byte>(bits, sizeInBytes)); } view raw opps.cs hosted with ❤ by GitHub This looks okay, we optimize for zero allocations on the common path (less than 2K items), but also handle the big one. The problem is that our math is wrong. More specifically, take a look at this line: var sizeInBytes = o.Count / (sizeof(byte) * 8) + o.Count % (sizeof(byte) * 8) == 0 ? 0 : 1; Let’s assume that your count is 10, what do you think the value of this is going to be? Well, it looks like this should give us 2, right? 10 / 8 + 10%8 == 0 ? 0 :1 The problem is in the operator precedence. I read this as: (10 / 8) + (10 % 8 == 0 ? 0 : 1) And the C# compiler read it as: (10 / 8 + 10 % 8) == 0 ? 0 : 1 In other words, *#@*!*@!. The end result is that we overwrite past our allocated stack. Usually that doesn’t do anything bad, since there is enough stack space. But sometimes, if the stack is aligned just right, we cross into the stack guard page and kill the process. Opps, that was not expected.