skip to content
Relatively General .NET

Debugging native memory issues in a C# application

by Oren Eini

posted on: April 10, 2023

I’m working on improving the performance of Corax, RavenDB’s new search engine. Along the way, I introduced a bug, a fairly nasty one. At a random location, while indexing a ~50 million documents corpus, we are getting an access violation exception. That means that I messed something up. That makes sense, given that my changes were mostly about making things lower-level. Working directly with pointers and avoiding length checks. At our speed, even the use of Span can be a killer for performance, and we want to be as close to the raw metal as possible. The particular changeset that I was working on was able to improve the indexing speed from 90,000 per second to 120,000 per second. That is a change that I absolutely want to keep, so I started investigating it. I mentioned that it is a fairly nasty problem. A truly nasty problem would be heap corruption that is discovered after the fact and is very hard to trace. In this case, it was not consistent, which is really strange. One of the important aspects of Corax is that it is single-threaded, which means that a lot of complexity is out the window. It means that for the same input, we always have the same behavior. If there is any variance, such as not crashing all the time, it means that there are external factors involved. At any rate, given that it happened at least half the time, I was able to attach WinDBG to the process and wait for the exception to happen, this is what I got: (5e20.1468): Access violation - code c0000005 (first chance) First chance exceptions are reported before any exception handling. This exception may be expected and handled. Corax!Corax.IndexWriter.AddEntriesToTermResultViaSmallPostingList+0x953: 00007ffa`24dcea53 c4e261902411 vpgatherdd xmm4,dword ptr [rcx+xmm2],xmm3 ds:0000026d`516514e7=???????? Now, look at the last line, that is an interesting one, we use the VPGATHERDD assembly instruction. It is gathering packed DWORD values, in C#, this is generated using the Avx2.GatherVector128() method. We are using that to do some bit packing in this case, so this makes a lot of sense. Next, let’s see what we get from the exception: 0:074> .exr -1 ExceptionAddress: 00007ffafc2bfe7c (KERNELBASE!RaiseException+0x000000000000006c) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000080 NumberParameters: 2 Parameter[0]: 0000000000000000 Parameter[1]: 0000026d51650000 Attempt to read from address 0000026d51650000 All of this points to an out-of-bounds read, but why is that? The call we have for GatherVector128() is used inside a method named: ReadAvx2(). And this method is called like this: private unsafe static ulong Read(int stateBitPos, byte* inputBufferPtr, int bitsToRead, int inputBufferSize, out int outputStateBit) { if ((stateBitPos + bitsToRead) / 8 >= inputBufferSize) throw new ArgumentOutOfRangeException(); if ( Avx2.IsSupported) { return ReadAvx2(stateBitPos, inputBufferPtr, bitsToRead, out outputStateBit); } return ReadScalar(stateBitPos, inputBufferPtr, bitsToRead, out outputStateBit); } It is an optimized approach to read some bits from a buffer, I’ll skip the details on exactly how this works. As you can see, we have a proper bounds check here, ensuring that we aren’t reading past the end of the buffer. Except… That we aren’t actually checking this. What we are doing is checking that we can access the bytes range, but consider the following scenario: We have a memory page and a buffer that is located toward the end of it.  We are now trying to access the last bit in the buffer, using ReadAvx2(). If we’ll check the actual bytes range, it will pass, we are trying to access the last byte. However, we are going to call GatherVector128(), which means that we’ll actually access 16 bytes(!), and only the first byte is in the valid memory range, the rest is going to be read from the next page, which isn’t mapped. This also explains why we are not always failing. If the next page is valid (which is subject to the decisions of the operating system allocator), it will pass. So that is why we didn’t have 100% reproduction. In fact, this is the sort of bug that is very easy to hide for a very long time in the system, given that it is dependent on the actual memory structure of the application. Once we figured out what was going on, it was pretty easy to understand, but the fact that the AVX instructions will read after the end of the buffer was really confusing. Because even when we used Span, and its range checks, it would be completely ignored. Makes total sense, given that those aren’t really methods, but compiler intrinsics that are translated to direct machine instructions. Amusingly enough, now that we found the problem, we ran into something very similar a long while ago. Then it was the wrong instruction being used (loading a word, instead of a byte), that would fail, but the same overal setup. It will sometimes fail, depending on the state of the next page in the memory. We actually built some tooling around managing that, we call that electric fence memory. We allocate memory so any out-of-band access would always hit invalid memory, stopping us in our tracks. That means that I can get easy reproduction of those sorts of issues, and once we have that, the rest isn’t really that interesting, to be honest. It’s just a normal bug fix. It’s the hunt for the root cause that is both incredibly frustrating and quite rewarding.

Investigating a crash in Enumerable.LastOrDefault with a custom collection

by Gérald Barré

posted on: April 10, 2023

This post is part of the series 'Crash investigations and code reviews'. Be sure to check out the rest of the blog posts of the series!Investigating a performance issue with a regexInvestigating an infinite loop in Release configurationInvestigating a crash in Enumerable.LastOrDefault with a custom

Adding client-side validation to ASP.NET Core, without jQuery or unobtrusive validation

by Andrew Lock

posted on: April 04, 2023

In this article I describe how to use the aspnet-client-validation library to provide client-side validation instead of relying on jQuery.…

Listing all available ETW events in a .NET application

by Gérald Barré

posted on: April 03, 2023

When tracing an application, it can be useful to know which ETW events are available. .NET exposes many events to trace an application as shown in some previous posts (Getting telemetry data from inside or outside a .NET application, or Avoid DNS issues with HttpClient in .NET). It's hard to know w

Tricks of the trade: Figuring out progress of a large upload

by Oren Eini

posted on: March 31, 2023

I found myself today needing to upload a file to S3, the upload size is a few hundred GBs in size. I expected the appropriate command, like so: aws s3api put-object --bucket twitter-2020-rvn-dump --key mydb.backup --body ./mydb.backup But then I realized that this is uploading a few hundred GB file to S3, which may take a while. The command doesn’t have any progress information, so I had no way to figure out where it is at. I decided to see what I can poke around to find, first, I ran this command: ps -aux | grep s3api This gave me the PID of the upload process in question. Then I checked the file descriptors for this process, like so: $ ls -alh /proc/84957/fd total 0dr-x------ 2 ubuntu ubuntu  0 Mar 30 08:10 .dr-xr-xr-x 9 ubuntu ubuntu  0 Mar 30 08:00 ..lrwx------ 1 ubuntu ubuntu 64 Mar 30 08:10 0 -> /dev/pts/8lrwx------ 1 ubuntu ubuntu 64 Mar 30 08:10 1 -> /dev/pts/8lrwx------ 1 ubuntu ubuntu 64 Mar 30 08:10 2 -> /dev/pts/8lr-x------ 1 ubuntu ubuntu 64 Mar 30 08:10 3 -> /backups/mydb.backup As you can see, we can tell that file descriptor#3 is the one that we care about, then we can ask for more details: $ cat /proc/84957/fdinfo/3 pos: 140551127040 flags: 02400000 mnt_id: 96 ino: 57409538 In other words, the process is currently at ~130GB of the file or there about. It’s not ideal, but it does give me some idea about where we are at. It is a nice demonstration of the ability to poke into the insides of a running system to figure out what is going on.

Storing information in its highest form

by Vladimir Khorikov

posted on: March 29, 2023

There’s an interesting guideline I’ve been meaning to write about for a long time. I call it Storing information in its highest form.

Understanding the .NET ecosystem: The introduction of .NET Standard

by Andrew Lock

posted on: March 28, 2023

In this article, previously part of my new book, we look at .NET Standard, look at why it was created, and discuss its future.…

Handling CancelKeyPress using a CancellationToken

by Gérald Barré

posted on: March 27, 2023

You sometimes need to detect when a console application is closing to perform some cleanup. Console.CancelKeyPress allows registering a callback when a user press Ctrl+C or Ctrl+Break. This event can prevent the application from closing, so you can take a few seconds to perform the cleanup before a

RavenDB 6.0 live instance is now up & running: Come test it out!

by Oren Eini

posted on: March 21, 2023

RavenDB has the public live test instance, and we have recently upgraded that to version 6.0.  That means that you can start playing around with RavenDB 6.0 directly, including giving us feedback on any issues that you find. Of particular interest, of course, is the sharding feature, it is right here: And once enabled, you can see things in more details: If we did things properly, the only thing you’ll notice that indicates that you are running in sharded mode is: Take a look, and let us know what you think. As a reminder, at the top right of the page, there is the submit feedback option: Use it, we are waiting for your insights.

Understanding the .NET ecosystem: The evolution of .NET into .NET 7

by Andrew Lock

posted on: March 21, 2023

In this article, previously part of my new book, we look at the introduction of .NET Core, why it was created, and how it has evolved into .NET 7.…