Page 83 • Relatively General .NET

Production postmortem

by Oren Eini

posted on: July 11, 2022

A customer opened a support call telling us that they reached the scaling limits of RavenDB. Given that they had a pretty big machine specifically to handle the load they were expecting, they were (rightly) upset about that. A short back and forth caused us to realize that RavenDB started to fail shortly after they added a new customer to their system. And by fail I mean that it started throwing OutOfMemoryException in certain places. The system was not loaded and there weren’t any other indications of high load. The system had plenty of memory available, but critical functions inside RavenDB would fail because of out of memory errors. We looked at the actual error and found this log message: Raven.Client.Exceptions.Database.DatabaseLoadFailureException: Failed to start database orders-21 At /data/global/ravenData/Databases/orders-21 ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown. at System.Threading.Thread.StartInternal(ThreadHandle t, Int32 stackSize, Int32 priority, Char* pThreadName) at System.Threading.Thread.StartCore() at Raven.Server.Utils.PoolOfThreads.LongRunning(Action`1 action, Object state, String name) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Utils\PoolOfThreads.cs:line 91 at Raven.Server.Documents.TransactionOperationsMerger.Start() in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\TransactionOperationsMerger.cs:line 76 at Raven.Server.Documents.DocumentDatabase.Initialize(InitializeOptions options, Nullable`1 wakeup) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\DocumentDatabase.cs:line 388 at Raven.Server.Documents.DatabasesLandlord.CreateDocumentsStorage(StringSegment databaseName, RavenConfiguration config, Nullable`1 wakeup) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\DatabasesLandlord.cs:line 826 This is quite an interesting error. To start with, this is us failing to load a database, because we couldn’t spawn the relevant thread to handle transaction merging. That is bad, but why? It turns out that .NET will only consider a single failure scenario for a thread failing to start. If it fails, it must be because the system is out of memory. However, we are running on Linux, and there are other reasons why that can happen. In particular, there are various limits that you can set on your environment that would limit the number of threads that you can set. There are global knobs that you should look at first, such as those: /proc/sys/kernel/threads-max /proc/sys/kernel/pid_max /proc/sys/vm/max_map_count Any of those can serve as a limit. There are also ways to set those limits on a per process manner. There is also a per user setting, which is controlled via: /etc/systemd/logind.conf: UserTasksMax The easiest way to figure out what is going on is to look at the kernel log at that time, here is what we got in the log: a-orders-srv kernel: cgroup: fork rejected by pids controller in /system.slice/ravendb.service That made it obvious where the problem was, in the ravendb.service file, we didn’t have TasksMax set, which meant that it was set to 4915 (probably automatically set by the system depending on some heuristic). When the number of databases and operations on the database reached a particular size, we hit the limit and started failing. That is not a fun place to be in, but at least it is easy to fix. I created this post specifically so it will be easy to Google that in the future. I also created an issue to get a better error message in this scenario.

Replace characters in a string using Vectorization

by Gérald Barré

posted on: July 11, 2022

This post is part of the series 'SIMD'. Be sure to check out the rest of the blog posts of the series!Faster Guid comparisons using Vectors (SIMD) in .NETFinding the maximum value in an array using vectorizationReplace characters in a string using Vectorization (this post)I continue to look at code

The Art of Debugging

by Oren Eini

posted on: July 08, 2022

I often end up being pulled into various debugging sessions. I find that having a structured approach for debugging can help you figure out what is going on and fix even the nastiest of bugs. This is a story about a few different bugs that I run into. We have some new features coming in RavenDB 5.4 which touch the manner in which we manage some data. When we threw enough data at RavenDB, it died. Quite rudely, I have to say. I described the actual bug here, if you care. There wasn’t really much to finding this one. The process crashes, we don’t know why. The process will usually gives us good signs about what exactly is happening. In this case, attaching a debugger and running the code and seeing where it failed was sufficient. I’ll admit that I mostly pressed F10 and F11 until I found the offending function, and then I didn’t believe it for some time. Still in the same area, however, we began to run into a problem with impossible results. Invariants were broken, quite harshly. The problem was that this was something we could only reproduce after ~30 minutes or so of running a big benchmark. This does not make it easy to figure out what is going on. What is worse, there is a distinct gap between the time that the bug showed up and when it actually happened. That made it very hard to figure out what is going on. This is where I think the structured approach shines. Architecture – I hate concurrency debugging. I’m roughly 20+ years of experience in writing parallel programs, it is not sufficient to debug concurrency, I’m afraid. As a result, the design of RavenDB goes to great lengths to avoid having to do concurrency coordination. There are very well defined locations where we are applying concurrency (handling requests) and there are places where we are using serial manner (modifying data). The key is that we are using multiple serial processes at the same time. Each index is bound to a thread, for example, but they are independent of one another. The error in question was in an index, and I was able to narrow it down to a particular area of the code. Somewhere along the way, we messed things up. Reproducing – I had a way to reproduce the issue, it was just utterly unhelpful for debugging purposes. But since the process is a serial one, it meant that it is also fairly consistent. So I added a bunch of printf() that logged the interesting operations that we had: And here is what this looked like: Each line is a particular command that I’m executing. I wrote a trivial parser that would read from the file and perform the relevant operations. For ease of explanation, imagine that I’m writing a hash table, and this set of operations prime it to expose the issue. The file contained hundreds of millions of operations. Somewhere toward the end, we run into the corrupted state. That is too much to debug, but I don’t need to do that. I can write code to verify the state of the data structure. There is one problem here, the cost. Verifying the data structure is an operation that is O(N*logN + N) at least (probably O(N^2), to be honest, didn’t bother to run the proper analysis). I can’t run it on each command. The good news is that I don’t need to. Here is what I did: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters int counter = 0; string line = null; while((line = reader.ReadLine()) { counter++; var parts = line.Split(); if(parts[0] == "put") { var id = dataStructure.Put(long.Parse(parts[1], parts[2]); mapping[parts[3]] = id; } else // del { dataStructure.Remove(mapping[parts[1]], out _); } if(counter % 100_000 == 0) { dataStructure.FullVerify(); } } view raw repro.cs hosted with ❤ by GitHub That did a full verification every 100,000 runs. This meant that I was confident that this was roughly the error way. Then I just repeated the process, raising the minimum start for doing the check and reducing the gaps in the frequencies. This is a pretty boring part, to be honest. You run the code, do some emails, check it again, update a couple of numbers, and move on. This bisection approach yields great results, however. It means that I could very quickly narrow down to the specific action that caused the error. That was in step 48,292,932, by the way. The actual error was discovered far later. Fixing – that part I’m going to skip, this is usually fairly obvious once you know what is going on. In the case above, there was some scenario where we were off by 4 bytes, and that caused… havoc. The key here is that for most of the debugging process, I don’t really need to do any sort of thinking. I figure out what information I need to gather, then I write the code that would do that for me. And then I basically keep running things until I narrowed the field down to the level where the problem is clear and I can fix that.

My interview at Productive C#

by Oren Eini

posted on: July 07, 2022

I had an interview with Andrea Angella from Product C# earlier this week. I think that it was quite an interesting discussion, you can watch it here (or listen in Podcast form):

Returning XML from minimal APIs in .NET 6

by Andrew Lock

posted on: July 05, 2022

In this post I show how to return an XML response from minimal APIs, how to create a helper extension method, and discuss performance considerations…

Finding the maximum value in an array using vectorization

by Gérald Barré

posted on: July 04, 2022

This post is part of the series 'SIMD'. Be sure to check out the rest of the blog posts of the series!Faster Guid comparisons using Vectors (SIMD) in .NETFinding the maximum value in an array using vectorization (this post)Replace characters in a string using VectorizationIn the previous post, I de

Challenge

by Oren Eini

posted on: July 01, 2022

Yesterday I presented a bug that killed the process in a particularly rude manner. This is a recursive function that guards against stack overflows using RuntimeHelpers.EnsureSufficientExecutionStack(). Because of how this function kills the process, it took some time to figure out what is going on. There was no StackOverflowException, just an exit code. Here is the relevant code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters unsafe void WriteNulls(Dictionary<string, object?> o, BinaryWriter writer) { writer.Write7BitEncodedInt(o.Count); var sizeInBytes = o.Count / (sizeof(byte) * 8) + o.Count % (sizeof(byte) * 8) == 0 ? 0 : 1; const int MaxStackAllocation = 256; if (sizeInBytes > MaxStackAllocation) { WriteNullsAllocating(o, writer); return; } byte* bits = stackalloc byte[sizeInBytes]; var index = -1; foreach (var kvp in o) { index++; if (kvp.Value is not null) continue; bits[index / (sizeof(byte) * 8)] |= (byte)(1 << (index % (sizeof(byte) * 8))); } writer.Write(new Span<byte>(bits, sizeInBytes)); } view raw opps.cs hosted with ❤ by GitHub This looks okay, we optimize for zero allocations on the common path (less than 2K items), but also handle the big one. The problem is that our math is wrong. More specifically, take a look at this line: var sizeInBytes = o.Count / (sizeof(byte) * 8) + o.Count % (sizeof(byte) * 8) == 0 ? 0 : 1; Let’s assume that your count is 10, what do you think the value of this is going to be? Well, it looks like this should give us 2, right? 10 / 8 + 10%8 == 0 ? 0 :1 The problem is in the operator precedence. I read this as: (10 / 8) + (10 % 8 == 0 ? 0 : 1) And the C# compiler read it as: (10 / 8 + 10 % 8) == 0 ? 0 : 1 In other words, *#@*!*@!. The end result is that we overwrite past our allocated stack. Usually that doesn’t do anything bad, since there is enough stack space. But sometimes, if the stack is aligned just right, we cross into the stack guard page and kill the process. Opps, that was not expected.

Challenge

by Oren Eini

posted on: June 30, 2022

The following code is something that we ran into yesterday, under some conditions, this code will fail with a stack overflow. More specifically, the process crashes and the return code is –1073740791 (or as it is usually presented: 0xC0000409. At this point in my career I can look at that error code and just recall that this is the Windows error code for a stack overflow, to be more precise, this is: STATUS_STACK_BUFFER_OVERRUN That… makes sense, I guess, this is a recursive code, after all. Let’s take a look: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters using System.Buffers; using System.Collections; using System.Linq; using System.Runtime.CompilerServices; using System.Runtime.InteropServices; var dic = Enumerable.Range(0, 100).ToArray() .ToDictionary(x => x.ToString(), x => (object?)null); Write(dic, new BinaryWriter(new MemoryStream())); void Write(object? item, BinaryWriter writer) { switch (item) { case string s: writer.Write((byte)1); writer.Write(s); break; case Dictionary<string, object?> o: RuntimeHelpers.EnsureSufficientExecutionStack(); writer.Write((byte)2); WriteNulls(o, writer); foreach (var kvp in o) { writer.Write(kvp.Key); Write(kvp.Value, writer); } break; default: throw new InvalidDataException("Cannot understand how to write " + item.GetType()); } } unsafe void WriteNulls(Dictionary<string, object?> o, BinaryWriter writer) { writer.Write7BitEncodedInt(o.Count); var sizeInBytes = o.Count / (sizeof(byte) * 8) + o.Count % (sizeof(byte) * 8) == 0 ? 0 : 1; const int MaxStackAllocation = 256; if (sizeInBytes > MaxStackAllocation) { WriteNullsAllocating(o, writer); return; } byte* bits = stackalloc byte[sizeInBytes]; var index = -1; foreach (var kvp in o) { index++; if (kvp.Value is not null) continue; bits[index / (sizeof(byte) * 8)] |= (byte)(1 << (index % (sizeof(byte) * 8))); } writer.Write(new Span<byte>(bits, sizeInBytes)); } void WriteNullsAllocating(Dictionary<string, object?> o, BinaryWriter writer) { var bits = ArrayPool<byte>.Shared.Rent(o.Count / 8 + 1); var index = -1; foreach (var kvp in o) { index++; if (kvp.Value is not null) continue; bits[index / 8] |= (byte)(1 << (index % 8)); } writer.Write(bits); ArrayPool<byte>.Shared.Return(bits); } view raw opps.cs hosted with ❤ by GitHub Except, that this code explicitly protects against this. Note the call to: RuntimeHelpers.EnsureSufficientExecutionStack(); In other words, if we are about the run out of stack space, we ask the .NET framework to throw (just before we run out, basically). This code doesn’t fail often, and we tried to push deeply nested structure through that, and we got an InsufficientExecutionStackException thrown. Sometimes, however, when we run this code with a relatively flat structure (2 – 4 levels), it will just die with this error. Can you spot the bug?

Understanding your middleware pipeline in .NET 6 with the Middleware Analysis package

by Andrew Lock

posted on: June 28, 2022

In this post I introduce the Microsoft.AspNetCore.MiddlewareAnalysis package and show how to use it to visualise the middleware in your .NET 6 apps.…

High performance .NET

by Oren Eini

posted on: June 27, 2022

One of the high costs that we have right now in my Redis Clone is strings. That is actually a bit misleading, take a look here: Strings take 12.57% of the runtime, but there is also the GC Wait, where we need to cleanup after them. That means that the manner in which we are working is pretty inefficient. Our test scenario right now also involves solely GET and SET requests, there are no deletions, expirations, etc. I mention that because we need to consider what we’ll replace the strings with. The simplest option is to replace that with a byte array, but that is still managed memory and incurs the costs associated with GC. We can pool those byte arrays, but then we have an important question to answer, how do we know when a buffer is no longer used? Consider the following set of events: Time Thread #1 Thread #2 1 SET abc 2 GET abc 3 SET abc 4 Use the buffer we got on #2 In this case, we have thread #2 accessing the value buffer, but we replaced that buffer. We need to let thread #2 keep using this buffer until it is done. This little tidbit put us right back at concurrent manual memory management, which is scary. We can do things in a slightly different manner, however. We can take advantage of the GC to support us, like so: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public class ReusableBuffer { public byte[] Buffer; public int Length; public Span<byte> Span => new Span<byte>(Buffer, 0, Length); public ReusableBuffer(byte[] buffer, int length) { Buffer = buffer; Length = length; } public override bool Equals(object? obj) { if (obj is not ReusableBuffer o) return false; return o.Span.SequenceEqual(Span); } public override int GetHashCode() { var hc = new HashCode(); hc.AddBytes(Span); return hc.ToHashCode(); } ~ReusableBuffer() { ArrayPool<byte>.Shared.Return(Buffer); } } view raw ReusableBuffer.cs hosted with ❤ by GitHub The idea is pretty simple. We have a class that holds a buffer, and when the GC notices that it is no longer in use, it will add its buffer back to the pool. The idea is that we rely on the GC to resolve this (really hard) problem for us. The fact that this moves the cost to the finalizer means that we can not worry about this. Otherwise, you have to jump through a lot of hoops. The ReusableBuffer class also implements GetHashCode() / Equals() which allow us to use it as a key in the dictionary. Now that we have the backing store for keys and values, let’s see how we can read & write from the network. I’m going to go back to the ConcurrentDictionary implementation for now, so I’ll handle only a single concept at a time. Before, we used StreamReader / StreamWriter to do the work, now we’ll use PipeReader / PipeWriter from System.IO.PIpelines. That will allow us to easily work with the raw bytes directly and it is meant for high performance scenarios. I wrote the code twice, once using the reusable buffer model and once using PIpeReader / PipeWriter and allocating strings. I was surprised to see that my fancy reusable buffers were within 1% performance of the (much simpler) strings implementation. That is 1% in the wrong direction, by the way. On my machine, the buffer based system was 165K ops/second while the strings based one was 166K ops/sec. Here is the reusable buffer based approach complete source code. And to compare, here is the string based one. The string based one is about 50% shorter in terms of lines of code. I’m guessing that the allocation pattern is really good for the kind of heuristics that the GC does. We either have long term objects (in the cache) or very short term ones. It’s worth pointing out that the actual parsing of the commands from the network isn’t using strings. Only the actual keys and values are actually translated to strings. The rest I’m doing using raw bytes. Here is what the code looks like for the string version under the profiler: And here is the same thing using the reusable buffer: There are a few interesting things to note here. The cost of ExecCommand is almost twice as high as the previous attempt. Digging deeper, I believe that the fault is here: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters var buffer = ArrayPool<byte>.Shared.Rent((int)cmds[2].Length); cmds[2].CopyTo(buffer); var val = new ReusableBuffer(buffer, (int)cmds[2].Length); Item newItem; ReusableBuffer key; if (_state.TryGetValue(_reusable, out var item)) { // can reuse key buffer newItem = new Item(item.Key, val); key = item.Key; } else { var keyBuffer = ArrayPool<byte>.Shared.Rent((int)cmds[1].Length); cmds[1].CopyTo(keyBuffer); key = new ReusableBuffer(keyBuffer, (int)cmds[1].Length); newItem = new Item(key, val); } _state[key] = newItem; WriteMissing(); view raw ExecCommand.Set.cs hosted with ❤ by GitHub This is the piece of code that is responsible for setting an item in the dictionary. However, note that we are doing a read for every write? The idea here is that if we have a set on an existing item, we can avoid allocating the buffer for the key again, and reuse it. However, that piece of code is in the critical path for this benchmark and it is quite costly. I changed it to do the allocations always, and we got a fairly consistent 1% – 3% faster than the string version. Here is what this looks like: In other words, here is the current performance table (under the profiler): 1.57 ms - String based 1.79 ms - Reusable buffer based (reduce memory usage) 1.04 ms - Reusable buffer (optimized lookup) All of those numbers are under the profiler, and on my development machine. Let’s see what we get when I’m running them on the production instances, shall we? String based – 1,602,728.75 ops/sec Reusable buffer (with reducing memory code) – 1,866,676.53 ops/sec Reusable buffer (optimized lookup) – 1,756,930.64 Those results do not match with what we see in my development machine. The likely reason is that the amount of operations is high enough and the load is sufficiently big that we are seeing a much bigger impact from the memory optimization at scale. That is the only conclusion I can draw from the fact that the memory reduction code, which adds costs, is actually able to process more requests/seconds under such load.