skip to content
Relatively General .NET

Production postmortem

by Oren Eini

posted on: July 11, 2022

A customer opened a support call telling us that they reached the scaling limits of RavenDB. Given that they had a pretty big machine specifically to handle the load they were expecting, they were (rightly) upset about that. A short back and forth caused us to realize that RavenDB started to fail shortly after they added a new customer to their system. And by fail I mean that it started throwing OutOfMemoryException in certain places. The system was not loaded and there weren’t any other indications of high load. The system had plenty of memory available, but critical functions inside RavenDB would fail because of out of memory errors. We looked at the actual error and found this log message: Raven.Client.Exceptions.Database.DatabaseLoadFailureException: Failed to start database orders-21 At /data/global/ravenData/Databases/orders-21 ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown. at System.Threading.Thread.StartInternal(ThreadHandle t, Int32 stackSize, Int32 priority, Char* pThreadName) at System.Threading.Thread.StartCore() at Raven.Server.Utils.PoolOfThreads.LongRunning(Action`1 action, Object state, String name) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Utils\PoolOfThreads.cs:line 91 at Raven.Server.Documents.TransactionOperationsMerger.Start() in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\TransactionOperationsMerger.cs:line 76 at Raven.Server.Documents.DocumentDatabase.Initialize(InitializeOptions options, Nullable`1 wakeup) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\DocumentDatabase.cs:line 388 at Raven.Server.Documents.DatabasesLandlord.CreateDocumentsStorage(StringSegment databaseName, RavenConfiguration config, Nullable`1 wakeup) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\DatabasesLandlord.cs:line 826 This is quite an interesting error. To start with, this is us failing to load a database, because we couldn’t spawn the relevant thread to handle transaction merging. That is bad, but why? It turns out that .NET will only consider a single failure scenario for a thread failing to start. If it fails, it must be because the system is out of memory. However, we are running on Linux, and there are other reasons why that can happen. In particular, there are various limits that you can set on your environment that would limit the number of threads that you can set. There are global knobs that you should look at first, such as those: /proc/sys/kernel/threads-max /proc/sys/kernel/pid_max /proc/sys/vm/max_map_count Any of those can serve as a limit. There are also ways to set those limits on a per process manner. There is also a per user setting, which is controlled via: /etc/systemd/logind.conf: UserTasksMax The easiest way to figure out what is going on is to look at the kernel log at that time, here is what we got in the log: a-orders-srv kernel: cgroup: fork rejected by pids controller in /system.slice/ravendb.service That made it obvious where the problem was, in the ravendb.service file, we didn’t have TasksMax set, which meant that it was set to 4915 (probably automatically set by the system depending on some heuristic). When the number of databases and operations on the database reached a particular size, we hit the limit and started failing. That is not a fun place to be in, but at least it is easy to fix. I created this post specifically so it will be easy to Google that in the future. I also created an issue to get a better error message in this scenario.

Replace characters in a string using Vectorization

by Gérald Barré

posted on: July 11, 2022

This post is part of the series 'SIMD'. Be sure to check out the rest of the blog posts of the series!Faster Guid comparisons using Vectors (SIMD) in .NETFinding the maximum value in an array using vectorizationReplace characters in a string using Vectorization (this post)I continue to look at code

The Art of Debugging

by Oren Eini

posted on: July 08, 2022

I often end up being pulled into various debugging sessions. I find that having a structured approach for debugging can help you figure out what is going on and fix even the nastiest of bugs. This is a story about a few different bugs that I run into. We have some new features coming in RavenDB 5.4 which touch the manner in which we manage some data. When we threw enough data at RavenDB, it died. Quite rudely, I have to say. I described the actual bug here, if you care. There wasn’t really much to finding this one. The process crashes, we don’t know why. The process will usually gives us good signs about what exactly is happening. In this case, attaching a debugger and running the code and seeing where it failed was sufficient. I’ll admit that I mostly pressed F10 and F11 until I found the offending function, and then I didn’t believe it for some time. Still in the same area, however, we began to run into a problem with impossible results. Invariants were broken, quite harshly. The problem was that this was something we could only reproduce after ~30 minutes or so of running a big benchmark. This does not make it easy to figure out what is going on. What is worse, there is a distinct gap between the time that the bug showed up and when it actually happened. That made it very hard to figure out what is going on. This is where I think the structured approach shines. Architecture – I hate concurrency debugging. I’m roughly 20+ years of experience in writing parallel programs, it is not sufficient to debug concurrency, I’m afraid. As a result, the design of RavenDB goes to great lengths to avoid having to do concurrency coordination. There are very well defined locations where we are applying concurrency (handling requests) and there are places where we are using serial manner (modifying data). The key is that we are using multiple serial processes at the same time. Each index is bound to a thread, for example, but they are independent of one another. The error in question was in an index, and I was able to narrow it down to a particular area of the code. Somewhere along the way, we messed things up. Reproducing – I had a way to reproduce the issue, it was just utterly unhelpful for debugging purposes. But since the process is a serial one, it meant that it is also fairly consistent. So I added a bunch of printf() that logged the interesting operations that we had: And here is what this looked like: Each line is a particular command that I’m executing. I wrote a trivial parser that would read from the file and perform the relevant operations. For ease of explanation, imagine that I’m writing a hash table, and this set of operations prime it to expose the issue. The file contained hundreds of millions of operations. Somewhere toward the end, we run into the corrupted state. That is too much to debug, but I don’t need to do that. I can write code to verify the state of the data structure. There is one problem here, the cost. Verifying the data structure is an operation that is O(N*logN + N) at least (probably O(N^2), to be honest, didn’t bother to run the proper analysis). I can’t run it on each command.  The good news is that I don’t need to. Here is what I did: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters int counter = 0; string line = null; while((line = reader.ReadLine()) { counter++; var parts = line.Split(); if(parts[0] == "put") { var id = dataStructure.Put(long.Parse(parts[1], parts[2]); mapping[parts[3]] = id; } else // del { dataStructure.Remove(mapping[parts[1]], out _); } if(counter % 100_000 == 0) { dataStructure.FullVerify(); } } view raw repro.cs hosted with ❤ by GitHub That did a full verification every 100,000 runs. This meant that I was confident that this was roughly the error way. Then I just repeated the process, raising the minimum start for doing the check and reducing the gaps in the frequencies. This is a pretty boring part, to be honest. You run the code, do some emails, check it again, update a couple of numbers, and move on. This bisection approach yields great results, however. It means that I could very quickly narrow down to the specific action that caused the error. That was in step 48,292,932, by the way. The actual error was discovered far later. Fixing – that part I’m going to skip, this is usually fairly obvious once you know what is going on. In the case above, there was some scenario where we were off by 4 bytes, and that caused… havoc. The key here is that for most of the debugging process, I don’t really need to do any sort of thinking. I figure out what information I need to gather, then I write the code that would do that for me. And then I basically keep running things until I narrowed the field down to the level where the problem is clear and I can fix that.

My interview at Productive C#

by Oren Eini

posted on: July 07, 2022

I had an interview with Andrea Angella from Product C# earlier this week. I think that it was quite an interesting discussion, you can watch it here (or listen in Podcast form):

Returning XML from minimal APIs in .NET 6

by Andrew Lock

posted on: July 05, 2022

In this post I show how to return an XML response from minimal APIs, how to create a helper extension method, and discuss performance considerations…

Finding the maximum value in an array using vectorization

by Gérald Barré

posted on: July 04, 2022

This post is part of the series 'SIMD'. Be sure to check out the rest of the blog posts of the series!Faster Guid comparisons using Vectors (SIMD) in .NETFinding the maximum value in an array using vectorization (this post)Replace characters in a string using VectorizationIn the previous post, I de

Challenge

by Oren Eini

posted on: July 01, 2022

Yesterday I presented a bug that killed the process in a particularly rude manner. This is a recursive function that guards against stack overflows using RuntimeHelpers.EnsureSufficientExecutionStack(). Because of how this function kills the process, it took some time to figure out what is going on. There was no StackOverflowException, just an exit code. Here is the relevant code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters unsafe void WriteNulls(Dictionary<string, object?> o, BinaryWriter writer) { writer.Write7BitEncodedInt(o.Count); var sizeInBytes = o.Count / (sizeof(byte) * 8) + o.Count % (sizeof(byte) * 8) == 0 ? 0 : 1; const int MaxStackAllocation = 256; if (sizeInBytes > MaxStackAllocation) { WriteNullsAllocating(o, writer); return; } byte* bits = stackalloc byte[sizeInBytes]; var index = -1; foreach (var kvp in o) { index++; if (kvp.Value is not null) continue; bits[index / (sizeof(byte) * 8)] |= (byte)(1 << (index % (sizeof(byte) * 8))); } writer.Write(new Span<byte>(bits, sizeInBytes)); } view raw opps.cs hosted with ❤ by GitHub This looks okay, we optimize for zero allocations on the common path (less than 2K items), but also handle the big one. The problem is that our math is wrong. More specifically, take a look at this line: var sizeInBytes = o.Count / (sizeof(byte) * 8) + o.Count % (sizeof(byte) * 8) == 0 ? 0 : 1; Let’s assume that your count is 10, what do you think the value of this is going to be? Well, it looks like this should give us 2, right? 10 / 8 + 10%8 == 0 ? 0 :1 The problem is in the operator precedence. I read this as: (10 / 8) + (10 % 8 == 0 ? 0 : 1) And the C# compiler read it as: (10 / 8 + 10 % 8) == 0 ? 0 : 1 In other words, *#@*!*@!. The end result is that we overwrite past our allocated stack. Usually that doesn’t do anything bad, since there is enough stack space. But sometimes, if the stack is aligned just right, we cross into the stack guard page and kill the process. Opps, that was not expected.

Challenge

by Oren Eini

posted on: June 30, 2022

The following code is something that we ran into yesterday, under some conditions, this code will fail with a stack overflow. More specifically, the process crashes and the return code is –1073740791 (or as it is usually presented: 0xC0000409. At this point in my career I can look at that error code and just recall that this is the Windows error code for a stack overflow, to be more precise, this is: STATUS_STACK_BUFFER_OVERRUN That… makes sense, I guess, this is a recursive code, after all. Let’s take a look: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters using System.Buffers; using System.Collections; using System.Linq; using System.Runtime.CompilerServices; using System.Runtime.InteropServices; var dic = Enumerable.Range(0, 100).ToArray() .ToDictionary(x => x.ToString(), x => (object?)null); Write(dic, new BinaryWriter(new MemoryStream())); void Write(object? item, BinaryWriter writer) { switch (item) { case string s: writer.Write((byte)1); writer.Write(s); break; case Dictionary<string, object?> o: RuntimeHelpers.EnsureSufficientExecutionStack(); writer.Write((byte)2); WriteNulls(o, writer); foreach (var kvp in o) { writer.Write(kvp.Key); Write(kvp.Value, writer); } break; default: throw new InvalidDataException("Cannot understand how to write " + item.GetType()); } } unsafe void WriteNulls(Dictionary<string, object?> o, BinaryWriter writer) { writer.Write7BitEncodedInt(o.Count); var sizeInBytes = o.Count / (sizeof(byte) * 8) + o.Count % (sizeof(byte) * 8) == 0 ? 0 : 1; const int MaxStackAllocation = 256; if (sizeInBytes > MaxStackAllocation) { WriteNullsAllocating(o, writer); return; } byte* bits = stackalloc byte[sizeInBytes]; var index = -1; foreach (var kvp in o) { index++; if (kvp.Value is not null) continue; bits[index / (sizeof(byte) * 8)] |= (byte)(1 << (index % (sizeof(byte) * 8))); } writer.Write(new Span<byte>(bits, sizeInBytes)); } void WriteNullsAllocating(Dictionary<string, object?> o, BinaryWriter writer) { var bits = ArrayPool<byte>.Shared.Rent(o.Count / 8 + 1); var index = -1; foreach (var kvp in o) { index++; if (kvp.Value is not null) continue; bits[index / 8] |= (byte)(1 << (index % 8)); } writer.Write(bits); ArrayPool<byte>.Shared.Return(bits); } view raw opps.cs hosted with ❤ by GitHub Except, that this code explicitly protects against this. Note the call to: RuntimeHelpers.EnsureSufficientExecutionStack(); In other words, if we are about the run out of stack space, we ask the .NET framework to throw (just before we run out, basically). This code doesn’t fail often, and we tried to push deeply nested structure through that, and we got an InsufficientExecutionStackException thrown. Sometimes, however, when we run this code with a relatively flat structure (2 – 4 levels), it will just die with this error. Can you spot the bug?