skip to content
Relatively General .NET

Configuration values & Escape hatches

by Oren Eini

posted on: January 27, 2025

RavenDB is meant to be a self-managing database, one that is able to take care of itself without constant hand-holding from the database administrator. That has been one of our core tenets from the get-go. Today I checked the current state of the codebase and we have roughly 500 configuration options that are available to control various aspects of RavenDB’s behavior. These two statements are seemingly contradictory, because if we have so many configuration options, how can we even try to be self-managing? And how can a database administrator expect to juggle all of those options? Database configuration is a really finicky topic. For example, RocksDB’s authors flat-out admit that out loud:Even we as RocksDB developers don't fully understand the effect of each configuration change. If you want to fully optimize RocksDB for your workload, we recommend experiments and benchmarking.And indeed, efforts were made to tune RocksDB using deep-learning models because it is that complex.RavenDB doesn’t take that approach, tuning is something that should work out of the box, managed directly by RavenDB itself. Much of that is achieved by not doing things and carefully arranging that the environment will balance itself out in an optimal fashion. But I’ll talk about the Zen of RavenDB another time.Today, I want to talk about why we have so many configuration options, the vast majority of which you, as a user, should neither use, care about, nor even know of. The idea is very simple, deploying a database engine is a Big Deal, and as such, something that users are quite reluctant to do. When we hit a problem and a support call is raised, we need to provide some mechanism for the user to fix things until we can ensure that this behavior is accounted for in the default manner of RavenDB.I treat the configuration options more as escape hatches that allow me to muddle through stuff than explicit options that an administrator is expected to monitor and manage. Some of those configuration options control whether RavenDB will utilize vectored instructions or the compression algorithm to use over the wire. If you need to touch them, it is amazing that they exist. If you have to deal with them on a regular basis, we need to go back to the drawing board.

Roslyn Annotations for Code Fix

by Gérald Barré

posted on: January 27, 2025

Roslyn Analyzer can be used to detect patterns in your code and report them. It can also provide code fixes to automatically fix the issues. In this case the code fix takes the existing SyntaxTree and return the new SyntaxTree with the issue fixed. Roslyn does provide the concept of Annotation to m

Partial writes, IO_Uring and safety

by Oren Eini

posted on: January 24, 2025

In my previous post, I discussed how Linux will silently truncate a big write (> 2 GB) for you. That is expected by the interface of write(). The problem is that this behavior also applies when you use IO_Uring. Take a look at the following code:struct io_uring_sqe *sqe = io_uring_get_sqe(&ring); if (!sqe) { return 1; } io_uring_prep_write(sqe, fd, buffer, BUFFER_SIZE, 0); io_uring_submit(&ring); struct io_uring_cqe *cqe; ret = io_uring_wait_cqe(&ring, &cqe); if (ret < 0) { return 2; }If BUFFER_SIZE is 3 GB, then this will write about 2 GB to the file. The number of bytes written is correctly reported, but the complexity this generates is huge. Consider the following function:int32_t rvn_write_io_ring( void *handle, int32_t count, struct page_to_write *buffers, int32_t *detailed_error_code);There is a set of buffers that I want to write, and the natural way to do that is:int32_t rvn_write_io_ring( void *handle, int32_t count, struct page_to_write *buffers, int32_t *detailed_error_code) { struct handle *handle_ptr = handle; for (size_t i = 0; i < count; i++) { struct io_uring_sqe *sqe = io_uring_get_sqe( &handle_ptr->global_state->ring); io_uring_prep_write(sqe, handle_ptr->file_fd, buffers[i].ptr, buffers[i].count_of_pages * VORON_PAGE_SIZE, buffers[i].page_num * VORON_PAGE_SIZE ); } return _submit_and_wait(&handle_ptr->global_state->ring, count, detailed_error_code); } int32_t _submit_and_wait( struct io_uring* ring, int32_t count, int32_t* detailed_error_code) { int32_t rc = io_uring_submit_and_wait(ring, count); if(rc < 0) { *detailed_error_code = -rc; return FAIL_IO_RING_SUBMIT; } struct io_uring_cqe* cqe; for(int i = 0; i < count; i++) { rc = io_uring_wait_cqe(ring, &cqe); if (rc < 0) { *detailed_error_code = -rc; return FAIL_IO_RING_NO_RESULT; } if(cqe->res < 0) { *detailed_error_code = -cqe->res; return FAIL_IO_RING_WRITE_RESULT; } io_uring_cqe_seen(ring, cqe); } return SUCCESS; }In other words, send all the data to the IO Ring, then wait for all those operations to complete. We verify complete success and can then move on. However, because we may have a write that is greater than 2 GB, and because the interface allows the IO Uring to write less than we thought it would, we need to handle that with retries.After thinking about this for a while, I came up with the following implementation:int32_t _submit_writes_to_ring( struct handle *handle, int32_t count, struct page_to_write *buffers, int32_t* detailed_error_code) { struct io_uring *ring = &handle->global_state->ring; off_t *offsets = handle->global_state->offsets; memset(offsets, 0, count * sizeof(off_t)); while(true) { int32_t submitted = 0; for (size_t i = 0; i < count; i++) { off_t offset = offsets[i]; if(offset == buffers[i].count_of_pages * VORON_PAGE_SIZE) continue; struct io_uring_sqe *sqe = io_uring_get_sqe(ring); if (sqe == NULL) // the ring is full, flush it... break; io_uring_sqe_set_data(sqe, i); io_uring_prep_write(sqe, handle->file_fd, buffers[i].ptr + offset, buffers[i].count_of_pages * VORON_PAGE_SIZE - offset, buffers[i].page_num * VORON_PAGE_SIZE + offset); submitted++; } if(submitted == 0) return SUCCESS; int32_t rc = io_uring_submit_and_wait(ring, submitted); if(rc < 0) { *detailed_error_code = -rc; return FAIL_IO_RING_SUBMIT; } struct io_uring_cqe *cqe; uint32_t head = 0; uint32_t i = 0; bool has_errors = false; io_uring_for_each_cqe(ring, head, cqe) { i++; uint64_t index = io_uring_cqe_get_data64(cqe); int result = cqe->res; if(result < 0) { has_errors = true; *detailed_error_code = -result; } else { offsets[index] += result; if(result == 0) { // there shouldn't be a scenario where we return 0 // for a write operation, we may want to retry here // but figuring out if this is a single happening, of if // we need to retry this operation (_have_ retried it?) is // complex enough to treat this as an error for now. has_errors = true; *detailed_error_code = EIO; } } } io_uring_cq_advance(ring, i); if(has_errors) return FAIL_IO_RING_WRITE_RESULT; } }That is a lot of code, but it is mostly because of how C works. What we do here is scan through the buffers we need to write, as well as scan through an array of offsets that store additional information for the operation.If the offset to write doesn’t indicate that we’ve written the whole thing, we’ll submit it to the ring and keep going until we either fill the entire ring or run out of buffers to work with. The next step is to submit the work and wait for it to complete, then run through the results, check for errors, and update the offset that we wrote for the relevant buffer.Then, we scan the buffers array again to find either partial writes that we have to complete (we didn’t write the whole buffer) or buffers that we didn’t write at all because we filled the ring. In either case, we submit the new batch of work to the ring and repeat until we run out of work. This code assumes that we cannot have a non-error state where we write 0 bytes to the file and treats that as an error. We also assume that an error in writing to the disk is fatal, and the higher-level code will discard the entire IO_Uring if that happens. The Windows version, by the way, is somewhat simpler. Windows explicitly limits the size of the buffer you can pass to the write() call (and its IO Ring equivalent). It also ensures that it will write the whole thing, so partial writes are not an issue there. It is interesting to note that the code above will effectively stripe writes if you send very large buffers. Let’s assume that we send it two 4 GB buffers, like so:OffsetSizeBuffer 11 GB 4 GBBuffer 210 GB6 GBThe patterns of writes that will actually be executed are:1GB .. 3 GB, 10 GB .. 12 GB3 GB .. 5 GB, 12 GB .. 14 GB14 GB .. 16 GBI can “fix” that by never issuing writes that are larger than 2 GB and issuing separate writes for each 2 GB range, but that leads to other complexities (e.g., tracking state if I split a write and hit the full ring status, etc.). At those sizes, it doesn’t actually matter in terms of efficiency or performance. Partial writes are almost always a sign of either very large writes that were broken up or some underlying issue that is already problematic, so I don’t care that much about that scenario in general. For the vast majority of cases, this will always issue exactly one write for each buffer.What is really interesting from my point of view, however, is how even a pretty self-contained feature can get pretty complex internally. On the other hand, this behavior allows me to push a whole bunch of work directly to the OS and have it send all of that to the disk as fast as possible.In our scenarios, under load, we may call that with thousands to tens of thousands of pages (each 8 KB in size) spread all over the file. The buffers are actually sorted, so ideally, the kernel will be able to take advantage of that, but even if not, just reducing the number of syscalls will result in performance savings.

Answer

by Oren Eini

posted on: January 22, 2025

I previously asked what the code below does, and mentioned that it should give interesting insight into the kind of mindset and knowledge a candidate has. Take a look at the code again:#include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <sys/stat.h> #define BUFFER_SIZE (3ULL * 1024 * 1024 * 1024) // 3GB in bytes int main() { int fd; char *buffer; struct stat st; buffer = (char *)malloc(BUFFER_SIZE); if (buffer == NULL) { return 1; } fd = open("large_file.bin", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR); if (fd == -1) { return 2; } if (write(fd, buffer, BUFFER_SIZE) == -1) { return 3; } if (fsync(fd) == -1) { return 4; } if (close(fd) == -1) { return 5; } if (stat("large_file.bin", &st) == -1) { return 6; } printf("File size: %.2f GB\n", (double)st.st_size / (1024 * 1024 * 1024)); free(buffer); return 0; }This program will output: File size: 2.00 GBAnd it will write 2 GB of zeros to the file:~$ head large_file.bin | hexdump -C 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 7ffff000The question is why? And the answer is quite simple. Linux has a limitation of about 2 GB for writes to the disk. Any write call that attempts to write more than that will only write that much, and you’ll have to call the system again. This is not an error, mind. The write call is free to write less than the size of the buffer you passed to it.Windows has the same limit, but it is honest about itIn Windows, all write calls accept a 32-bit int as the size of the buffer, so this limitation is clearly communicated in the API. Windows will also ensure that for files, a WriteFile call that completes successfully writes the entire buffer to the disk.And why am I writing 2 GB of zeros? In the code above, I’m using malloc(), not calloc(), so I wouldn’t expect the values to be zero. Because this is a large allocation, malloc() calls the OS to provide us with the buffer directly, and the OS is contractually obligated to provide us with zeroed pages.

WinForms: Analyze This (Me in Visual Basic)

by Klaus Loeffelmann

posted on: January 21, 2025

Your WinForms code might have issues—maybe an Async call picked the wrong overload, or it’s leaking data into resource files. Time to call in a code-shrink! So, WinForms, Analyze This!

Challenge

by Oren Eini

posted on: January 20, 2025

Here is a pretty simple C program, running on Linux. Can you tell me what you expect its output to be?#include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <sys/stat.h> #define BUFFER_SIZE (3ULL * 1024 * 1024 * 1024) // 3GB in bytes int main() { int fd; char *buffer; struct stat st; buffer = (char *)malloc(BUFFER_SIZE); if (buffer == NULL) { return 1; } fd = open("large_file.bin", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR); if (fd == -1) { return 2; } if (write(fd, buffer, BUFFER_SIZE) == -1) { return 3; } if (fsync(fd) == -1) { return 4; } if (close(fd) == -1) { return 5; } if (stat("large_file.bin", &st) == -1) { return 6; } printf("File size: %.2f GB\n", (double)st.st_size / (1024 * 1024 * 1024)); free(buffer); return 0; }And what happens when I run:head large_file.bin | hexdump -CThis shows both surprising behavior and serves as a good opening for discussion on a whole bunch of issues. In an interview setting, that can give us a lot of insight into the sort of knowledge a candidate has.

Using Roslyn to analyze and rewrite code in a solution

by Gérald Barré

posted on: January 20, 2025

I've written a lot about Roslyn in the context of Roslyn Analyzers and Source Generators. You can also use Roslyn as a library to analyze and generate code. For instance, you can create a console application that loads a solution, find patterns, and rewrite code. While Roslyn Analyzers are tied to

Production post-mortem

by Oren Eini

posted on: January 17, 2025

The problem was that this took time - many days or multiple weeks - for us to observe that. But we had the charts to prove that this was pretty consistent. If the RavenDB service was restarted (we did not have to restart the machine), the situation would instantly fix itself and then slowly degrade over time. The scenario in question was performance degradation over time. The metric in question was the average request latency, and we could track a small but consistent rise in this number over the course of days and weeks. The load on the server remained pretty much constant, but the latency of the requests grew.That the customer didn’t notice that is an interesting story on its own. RavenDB will automatically prioritize the fastest node in the cluster to be the “customer-facing” one, and it alleviated the issue to such an extent that the metrics the customer usually monitors were fine. The RavenDB Cloud team looks at the entire system, so we started the investigation long before the problem warranted users’ attention.I hate these sorts of issues because they are really hard to figure out and subject to basically every caveat under the sun. In this case, we basically had exactly nothing to go on. The workload was pretty consistent, and I/O, memory, and CPU usage were acceptable. There was no starting point to look at.Those are also big machines, with hundreds of GB of RAM and running heavy workloads. These machines have great disks and a lot of CPU power to spare. What is going on here?After a long while, we got a good handle on what is actually going on. When RavenDB starts, it creates memory maps of the file it is working with. Over time, as needed, RavenDB will map, unmap, and remap as needed. A process that has been running for a long while, with many databases and indexes operating, will have a lot of work done in terms of memory mapping.In Linux, you can inspect those details by running:$ cat /proc/22003/smaps 600a33834000-600a3383b000 r--p 00000000 08:30 214585 /data/ravendb/Raven.Server Size: 28 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 28 kB Pss: 26 kB Shared_Clean: 4 kB Shared_Dirty: 0 kB Private_Clean: 24 kB Private_Dirty: 0 kB Referenced: 28 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 0 VmFlags: rd mr mw me dw 600a3383b000-600a33847000 r-xp 00006000 08:30 214585 /data/ravendb/Raven.Server Size: 48 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 48 kB Pss: 46 kB Shared_Clean: 4 kB Shared_Dirty: 0 kB Private_Clean: 44 kB Private_Dirty: 0 kB Referenced: 48 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 0 VmFlags: rd ex mr mw me dwHere you can see the first page of entries from this file. Just starting up RavenDB (with no databases created) will generate close to 2,000 entries. The smaps virtual file can be really invaluable for figuring out certain types of problems. In the snippet above, you can see that we have some executable memory ranges mapped, for example.The problem is that over time, memory becomes fragmented, and we may end up with an smaps file that contains tens of thousands (or even hundreds of thousands) of entries.Here is the result of running perf top on the system, you can see that the top three items that hogs most of the resources are related to smaps accounting.This file provides such useful information that we monitor it on a regular basis. It turns out that this can have… interesting effects. Consider that while we are running the scan through all the memory mapping, we may need to change the memory mapping for the process. That leads to contention on the kernel locks that protect the mapping, of course. It’s expensive to generate the smaps fileReading from /proc/[pid]/smaps is not a simple file read. It involves the kernel gathering detailed memory statistics (e.g., memory regions, page size, resident/anonymous/shared memory usage) for each virtual memory area (VMA) of the process. For large processes with many memory mappings, this can be computationally expensive as the kernel has to gather the required information every time /proc/[pid]/smaps is accessed.When /proc/[pid]/smaps is read, the kernel needs to access memory-related structures. This may involve taking locks on certain parts of the process’s memory management system. If this is done too often or across many large processes, it could lead to contention or slow down the process itself, especially if other processes are accessing or modifying memory at the same time.If the number of memory mappings is high, and the frequency with which we monitor is short… I hope you can see where this is going. We effectively spent so much time running over this file that we blocked other operations. This wasn’t an issue when we just started the process, because the number of memory mappings was small, but as we worked on the system and the number of memory mappings grew… we eventually started hitting contention. The solution was two-fold. We made sure that there is only ever a single thread that would read the information from the smaps (previously it might have been triggered from multiple locations).  We added some throttling to ensure that we aren’t hammering the kernel with requests for this file too often (returning cached information if needed) and we switched from using smaps to using smaps_rollup instead. The rollup version provides much better performance, since it deals with summary data only.With those changes in place, we deployed to production and waited. The result was flat latency numbers and another item that the Cloud team could strike off the board successfully.

Meet the .NET Team at NDC London 2025

by Mehul Harry

posted on: January 16, 2025

Meet the .NET team at NDC London 2025 to explore the latest in .NET 9, Azure, and AI-powered development through keynotes, sessions, and 1:1 meetups.