skip to content
Relatively General .NET

Challenge

by Oren Eini

posted on: February 03, 2025

I’m trying to reason about the behavior of this code, and I can’t decide if this is a stroke of genius or if I’m suffering from a stroke. Take a look at the code, and then I’ll discuss what I’m trying to do below:HANDLE hFile = CreateFileA("R:/original_file.bin", GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); if (hFile == INVALID_HANDLE_VALUE) { printf("Error creating file: %d\n", GetLastError()); exit(__LINE__); } HANDLE hMapFile = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 0, NULL); if (hMapFile == NULL) { fprintf(stderr, "Could not create file mapping object: %x\n", GetLastError()); exit(__LINE__); } char* lpMapAddress = MapViewOfFile(hMapFile, FILE_MAP_WRITE, 0, 0, 0); if (lpMapAddress == NULL) { fprintf(stderr, "Could not map view of file: %x\n", GetLastError()); exit(__LINE__); } for (size_t i = 2 * MB; i < 4 * MB; i++) { lpMapAddress[i]++; } HANDLE hDirect = CreateFileA("R:/original_file.bin", GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); SetFilePointerEx(hDirect, (LARGE_INTEGER) { 6 * MB }, & fileSize, FILE_BEGIN); for (i = 6 ; i < 10 ; i++) { if (!WriteFile(hDirect, lpMapAddress + i * MB, MB, &bytesWritten, NULL)) { fprintf(stderr, "WriteFile direct failed on iteration %d: %x\n", i, GetLastError()); exit(__LINE__); } }The idea is pretty simple, I’m opening the same file twice. Once in buffered mode and mapping that memory for both reads & writes. The problem is that to flush the data to disk, I have to either wait for the OS, or call FlushViewOfFile() and FlushFileBuffers() to actually flush it to disk explicitly.The problem with this approach is that FlushFileBuffers() has undesirable side effects. So I’m opening the file again, this time for unbuffered I/O. I’m writing to the memory map and then using the same mapping to write to the file itself. On Windows, that goes through a separate path (and may lose coherence with the memory map). The idea here is that since I’m writing from the same location, I can’t lose coherence. I either get the value from the file or from the memory map, and they are both the same. At least, that is what I hope will happen.For the purpose of discussion, I can ensure that there is no one else writing to this file while I’m abusing the system in this manner. What do you think Windows will do in this case?I believe that when I’m writing using unbuffered I/O in this manner, I’m forcing the OS to drop the mapping and refresh from the disk. That is likely the reason why it may lose coherence, because there may be already reads that aren’t served from main memory, or something like that.This isn’t an approach that I would actually take for production usage, but it is a damn interesting thing to speculate on. If you have any idea what will actually happen, I would love to have your input.

Migrate from MSTest to xUnit using a Roslyn analyzer

by Gérald Barré

posted on: February 03, 2025

Both MSTest and xUnit are great test framework. If you are curious about, I've written many blog posts about themMSTest seriesQuick introduction to xUnit.netIf you want to migrate to xUnit from MSTest, I've written a Roslyn Analyzer. This analyzer reports all MSTest attributes and assertions in you

NTFS has an emergency stash of disk space

by Oren Eini

posted on: January 31, 2025

I would really love to have a better understanding of what is going on here!If you format a 32 MB disk using NTFS, you’ll get the following result:So about 10 MB are taken for NTFS metadata. I guess that makes sense, and giving up 10 MB isn’t generally a big deal these days, so I wouldn’t worry about it.I write a 20 MB file and punch a hole in it between 6 MB and 18 MB (12 MB in total), so we have:And in terms of disk space, we have:The numbers match, awesome! Let’s create a new 12 MB file, like so:And the disk is:And now I’m running the following code, which maps the first file (with the hole punched in it) and writes 4 MB to it using memory-mapped I/O:HANDLE hMapFile = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 0, NULL); if (hMapFile == NULL) { fprintf(stderr, "Could not create file mapping object: %x\n", GetLastError()); exit(__LINE__); } char* lpMapAddress = MapViewOfFile(hMapFile, FILE_MAP_WRITE, 0, 0, 0); if (lpMapAddress == NULL) { fprintf(stderr, "Could not map view of file: %x\n", GetLastError()); exit(__LINE__); } for (i = 6 * MB; i < 10 * MB; i++) { ((char*)lpMapAddress)[i]++; } if (!FlushViewOfFile(lpMapAddress, 0)) { fprintf(stderr, "Could not flush view of file: %x\n", GetLastError()); exit(__LINE__); } if (!FlushFileBuffers(hFile)) { fprintf(stderr, "Could not flush file buffers: %x\n", GetLastError()); exit(__LINE__); }The end for this file is:So with the other file, we have a total of 24 MB in use on a 32 MB disk. And here is the state of the disk itself:The problem is that there used to be 9.78 MB that were busy when we had a newly formatted disk. And now we are using at least some of that disk space for storing file data somehow.I’m getting the same behavior when I use normal file I/O:moveAmount.QuadPart = 6 * MB; SetFilePointerEx(hFile, moveAmount, NULL, FILE_BEGIN); for (i = 6 ; i < 10 ; i++) { if (!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)) { fprintf(stderr, "WriteFile failed on iteration %d: %x\n", i, GetLastError()); exit(__LINE__); } }So somehow in this sequence of operations, we get more disk space. On the other hand, if I try to write just 22 MB into a single file, it fails. See:hFile = CreateFileA("R:/original_file.bin", GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); if (hFile == INVALID_HANDLE_VALUE) { printf("Error creating file: %d\n", GetLastError()); exit(__LINE__); } for (int i = 0; i < 22; i++) { if (!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)) { fprintf(stderr, "WriteFile failed on iteration %d: %x\n", i, GetLastError()); exit(__LINE__); } }You can find the full source here. I would love to understand what exactly is happening and how we suddenly get more disk space usage in this scenario.

What happens when a sparse file allocation fails?

by Oren Eini

posted on: January 29, 2025

Today I set out to figure out an answer to a very specific question. What happens at the OS level when you try to allocate disk space for a sparse file and there is no additional disk space?Sparse files are a fairly advanced feature of file systems. They allow you to define a file whose size is 10GB, but that only takes 2GB of actual disk space. The rest is sparse (takes no disk space and on read will return just zeroes). The OS will automatically allocate additional disk space for you if you write to the sparse ranges.This leads to an interesting question, what happens when you write to a sparse file if there is no additional disk space?Let’s look at the problem on Linux first. We define a RAM disk with 32MB, like so:sudo mkdir -p /mnt/ramdisk sudo mount -t tmpfs -o size=32M tmpfs /mnt/ramdiskAnd then we write the following code, which does the following (on a disk with just 32MB):Create a file - write 32 MB to itPunch a hole of 8 MB in the file (range is 12MB - 20MB)Create another file - write 4 MB to it (there is now only 4MB available)Open the original file and try to write to the range with the hole in it (requiring additional disk space allocation)#define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <linux/falloc.h> #include <errno.h> #include <string.h> #include <sys/random.h> #define MB (1024 * 1024) void write_all(int fd, const void *buf, size_t count) { size_t bytes_written = 0; const char *ptr = (const char *)buf; while (bytes_written < count) { ssize_t result = write(fd, ptr + bytes_written, count - bytes_written); if (result < 0) { if (errno == EINTR) continue; fprintf(stderr, "Write error: errno = %d (%s)\n", errno, strerror(errno)); exit(EXIT_FAILURE); } if (result == 0) { fprintf(stderr, "Zero len write is bad: errno = %d (%s)\n", errno, strerror(errno)); exit(EXIT_FAILURE); } bytes_written += result; } } int main() { int fd; char buffer[MB]; unlink("/mnt/ramdisk/fullfile"); unlink("/mnt/ramdisk/anotherfile"); getrandom(buffer, MB, 0); ssize_t bytes_written; fd = open("/mnt/ramdisk/fullfile", O_RDWR | O_CREAT | O_TRUNC, 0644); if (fd == -1) { fprintf(stderr, "open full file: errno = %d (%s)\n", errno, strerror(errno)); exit(EXIT_FAILURE); } for (int i = 0; i < 32; i++) { write_all(fd, buffer, MB); } close(fd); fd = open("/mnt/ramdisk/fullfile", O_RDWR); if (fd == -1) { fprintf(stderr, "reopen full file: errno = %d (%s)\n", errno, strerror(errno)); exit(EXIT_FAILURE); } if (fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 12 * MB, 8 * MB) == -1) { fprintf(stderr, "fallocate failure: errno = %d (%s)\n", errno, strerror(errno)); exit(EXIT_FAILURE); } close(fd); fd = open("/mnt/ramdisk/anotherfile", O_RDWR | O_CREAT | O_TRUNC, 0644); if (fd == -1) { fprintf(stderr, "open another file: errno = %d (%s)\n", errno, strerror(errno)); exit(EXIT_FAILURE); } for (int i = 0; i < 4; i++) { write_all(fd, buffer, MB); } close(fd); // Write 8 MB to the hole in the first file fd = open("/mnt/ramdisk/fullfile", O_RDWR); if (fd == -1) { fprintf(stderr, "reopen full file 2: errno = %d (%s)\n", errno, strerror(errno)); exit(EXIT_FAILURE); } // Seek to the start of the hole if (lseek(fd, 12 * MB, SEEK_SET) == -1) { fprintf(stderr, "seek full file: errno = %d (%s)\n", errno, strerror(errno)); exit(EXIT_FAILURE); } for (int i = 0; i < 8; i++) { write_all(fd, buffer, MB); } close(fd); printf("Operations completed successfully.\n"); return 0; }As expected, this code will fail on the 5th write (since there is no disk space to allocate in the disk). The error would be:Write error: errno = 28 (No space left on device)Here is what the file system reports:$ du -h /mnt/ramdisk/* 4.0M /mnt/ramdisk/anotherfile 28M /mnt/ramdisk/fullfile $ ll -h /mnt/ramdisk/ total 33M drwxrwxrwt 2 root root 80 Jan 9 10:43 ./ drwxr-xr-x 6 root root 4.0K Jan 9 10:30 ../ -rw-r--r-- 1 ayende ayende 4.0M Jan 9 10:43 anotherfile -rw-r--r-- 1 ayende ayende 32M Jan 9 10:43 fullfileAs you can see, we have a total of 32 MB of actual size reported, but ll is reporting that we actually have files bigger than that (because we have hole punching).What would happen if we were to run this using memory-mapped I/O? Here is the code:fd = open("/mnt/ramdisk/fullfile", O_RDWR); char *mapped_memory = mmap(NULL, 32 * MB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (mapped_memory == MAP_FAILED) { fprintf(stderr, "fail mmap: errno = %d (%s)\n", errno, strerror(errno)); exit(EXIT_FAILURE); } for (size_t i = (12 * MB); i < (20 * MB); i++) { mapped_memory[i] = 1; } munmap(mapped_memory, 32 * MB); close(fd);This will lead to an interesting scenario. We need to allocate disk space for the memory, and we’ll do so (note that we are writing into the hole), and this code will fail with a segmentation fault.It will fail in the loop, by the way, as part of the page fault to bring the memory in, the file system needs to allocate the disk space. If there is no such disk space, it will fail. The only way for the OS to behave in this case is to fail the write, which leads to a segmentation fault.I also tried that on Windows. I defined a virtual disk like so:$ diskpart create vdisk file="D:\ramdisk.vhd" maximum=32 select vdisk file=D:\ramdisk.vhd" attach vdisk create partition primary format fs=NTFS quick label=RAMDISK assign letter=R exitThis creates a 32MB disk and assigns it the letter R. Note that we are using NTFS, which has its own metadata, we have roughly 21MB or so of usable disk space to play with here.Here is the Windows code that simulates the same behavior as the Linux code above:#include <stdio.h> #include <windows.h> #define MB (1024 * 1024) int main() { HANDLE hFile, hFile2; DWORD bytesWritten; LARGE_INTEGER fileSize, moveAmount; char* buffer = malloc(MB); int i; DeleteFileA("R:\\original_file.bin"); DeleteFileA("R:\\another_file.bin"); hFile = CreateFileA("R:/original_file.bin", GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); if (hFile == INVALID_HANDLE_VALUE) { printf("Error creating file: %d\n", GetLastError()); exit(__LINE__); } for (int i = 0; i < 20; i++) { if (!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)) { fprintf(stderr, "WriteFile failed on iteration %d: %x\n", i, GetLastError()); exit(__LINE__); } if (bytesWritten != MB) { fprintf(stderr, "Failed to write full buffer on iteration %d\n", i); exit(__LINE__); } } FILE_ZERO_DATA_INFORMATION zeroDataInfo; zeroDataInfo.FileOffset.QuadPart = 6 * MB; zeroDataInfo.BeyondFinalZero.QuadPart = 18 * MB; if (!DeviceIoControl(hFile, FSCTL_SET_SPARSE, NULL, 0, NULL, 0, NULL, NULL) || !DeviceIoControl(hFile, FSCTL_SET_ZERO_DATA, &zeroDataInfo, sizeof(zeroDataInfo), NULL, 0, NULL, NULL)) { printf("Error setting zero data: %d\n", GetLastError()); exit(__LINE__); } // Create another file of size 4 MB hFile2 = CreateFileA("R:/another_file.bin", GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); if (hFile2 == INVALID_HANDLE_VALUE) { printf("Error creating second file: %d\n", GetLastError()); exit(__LINE__); } for (int i = 0; i < 4; i++) { if (!WriteFile(hFile2, buffer, MB, &bytesWritten, NULL)) { fprintf(stderr, "WriteFile 2 failed on iteration %d: %x\n", i, GetLastError()); exit(__LINE__); } if (bytesWritten != MB) { fprintf(stderr, "Failed to write full buffer 2 on iteration %d\n", i); exit(__LINE__); } } moveAmount.QuadPart = 12 * MB; SetFilePointerEx(hFile, moveAmount, NULL, FILE_BEGIN); for (i = 0; i < 8; i++) { if (!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)) { printf("Error writing to file: %d\n", GetLastError()); exit(__LINE__); } } return 0; }And that gives us the exact same behavior as in Linux. One of these writes will fail because there is no more disk space for it. What about when we use memory-mapped I/O?HANDLE hMapFile = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 0, NULL); if (hMapFile == NULL) { fprintf(stderr, "Could not create file mapping object: %x\n", GetLastError()); exit(__LINE__); } char* lpMapAddress = MapViewOfFile(hMapFile, FILE_MAP_WRITE, 0, 0, 0); if (lpMapAddress == NULL) { fprintf(stderr, "Could not map view of file: %x\n", GetLastError()); exit(__LINE__); } for (i = 0; i < 20 * MB; i++) { ((char*)lpMapAddress)[i]++; }That results in the expected access violation: I didn’t bother checking Mac or BSD, but I’m assuming that they behave in the same manner. I can’t conceive of anything else that they could reasonably do.You can find my full source here.

Configuration values & Escape hatches

by Oren Eini

posted on: January 27, 2025

RavenDB is meant to be a self-managing database, one that is able to take care of itself without constant hand-holding from the database administrator. That has been one of our core tenets from the get-go. Today I checked the current state of the codebase and we have roughly 500 configuration options that are available to control various aspects of RavenDB’s behavior. These two statements are seemingly contradictory, because if we have so many configuration options, how can we even try to be self-managing? And how can a database administrator expect to juggle all of those options? Database configuration is a really finicky topic. For example, RocksDB’s authors flat-out admit that out loud:Even we as RocksDB developers don't fully understand the effect of each configuration change. If you want to fully optimize RocksDB for your workload, we recommend experiments and benchmarking.And indeed, efforts were made to tune RocksDB using deep-learning models because it is that complex.RavenDB doesn’t take that approach, tuning is something that should work out of the box, managed directly by RavenDB itself. Much of that is achieved by not doing things and carefully arranging that the environment will balance itself out in an optimal fashion. But I’ll talk about the Zen of RavenDB another time.Today, I want to talk about why we have so many configuration options, the vast majority of which you, as a user, should neither use, care about, nor even know of. The idea is very simple, deploying a database engine is a Big Deal, and as such, something that users are quite reluctant to do. When we hit a problem and a support call is raised, we need to provide some mechanism for the user to fix things until we can ensure that this behavior is accounted for in the default manner of RavenDB.I treat the configuration options more as escape hatches that allow me to muddle through stuff than explicit options that an administrator is expected to monitor and manage. Some of those configuration options control whether RavenDB will utilize vectored instructions or the compression algorithm to use over the wire. If you need to touch them, it is amazing that they exist. If you have to deal with them on a regular basis, we need to go back to the drawing board.

Roslyn Annotations for Code Fix

by Gérald Barré

posted on: January 27, 2025

Roslyn Analyzer can be used to detect patterns in your code and report them. It can also provide code fixes to automatically fix the issues. In this case the code fix takes the existing SyntaxTree and return the new SyntaxTree with the issue fixed. Roslyn does provide the concept of Annotation to m

Partial writes, IO_Uring and safety

by Oren Eini

posted on: January 24, 2025

In my previous post, I discussed how Linux will silently truncate a big write (> 2 GB) for you. That is expected by the interface of write(). The problem is that this behavior also applies when you use IO_Uring. Take a look at the following code:struct io_uring_sqe *sqe = io_uring_get_sqe(&ring); if (!sqe) { return 1; } io_uring_prep_write(sqe, fd, buffer, BUFFER_SIZE, 0); io_uring_submit(&ring); struct io_uring_cqe *cqe; ret = io_uring_wait_cqe(&ring, &cqe); if (ret < 0) { return 2; }If BUFFER_SIZE is 3 GB, then this will write about 2 GB to the file. The number of bytes written is correctly reported, but the complexity this generates is huge. Consider the following function:int32_t rvn_write_io_ring( void *handle, int32_t count, struct page_to_write *buffers, int32_t *detailed_error_code);There is a set of buffers that I want to write, and the natural way to do that is:int32_t rvn_write_io_ring( void *handle, int32_t count, struct page_to_write *buffers, int32_t *detailed_error_code) { struct handle *handle_ptr = handle; for (size_t i = 0; i < count; i++) { struct io_uring_sqe *sqe = io_uring_get_sqe( &handle_ptr->global_state->ring); io_uring_prep_write(sqe, handle_ptr->file_fd, buffers[i].ptr, buffers[i].count_of_pages * VORON_PAGE_SIZE, buffers[i].page_num * VORON_PAGE_SIZE ); } return _submit_and_wait(&handle_ptr->global_state->ring, count, detailed_error_code); } int32_t _submit_and_wait( struct io_uring* ring, int32_t count, int32_t* detailed_error_code) { int32_t rc = io_uring_submit_and_wait(ring, count); if(rc < 0) { *detailed_error_code = -rc; return FAIL_IO_RING_SUBMIT; } struct io_uring_cqe* cqe; for(int i = 0; i < count; i++) { rc = io_uring_wait_cqe(ring, &cqe); if (rc < 0) { *detailed_error_code = -rc; return FAIL_IO_RING_NO_RESULT; } if(cqe->res < 0) { *detailed_error_code = -cqe->res; return FAIL_IO_RING_WRITE_RESULT; } io_uring_cqe_seen(ring, cqe); } return SUCCESS; }In other words, send all the data to the IO Ring, then wait for all those operations to complete. We verify complete success and can then move on. However, because we may have a write that is greater than 2 GB, and because the interface allows the IO Uring to write less than we thought it would, we need to handle that with retries.After thinking about this for a while, I came up with the following implementation:int32_t _submit_writes_to_ring( struct handle *handle, int32_t count, struct page_to_write *buffers, int32_t* detailed_error_code) { struct io_uring *ring = &handle->global_state->ring; off_t *offsets = handle->global_state->offsets; memset(offsets, 0, count * sizeof(off_t)); while(true) { int32_t submitted = 0; for (size_t i = 0; i < count; i++) { off_t offset = offsets[i]; if(offset == buffers[i].count_of_pages * VORON_PAGE_SIZE) continue; struct io_uring_sqe *sqe = io_uring_get_sqe(ring); if (sqe == NULL) // the ring is full, flush it... break; io_uring_sqe_set_data(sqe, i); io_uring_prep_write(sqe, handle->file_fd, buffers[i].ptr + offset, buffers[i].count_of_pages * VORON_PAGE_SIZE - offset, buffers[i].page_num * VORON_PAGE_SIZE + offset); submitted++; } if(submitted == 0) return SUCCESS; int32_t rc = io_uring_submit_and_wait(ring, submitted); if(rc < 0) { *detailed_error_code = -rc; return FAIL_IO_RING_SUBMIT; } struct io_uring_cqe *cqe; uint32_t head = 0; uint32_t i = 0; bool has_errors = false; io_uring_for_each_cqe(ring, head, cqe) { i++; uint64_t index = io_uring_cqe_get_data64(cqe); int result = cqe->res; if(result < 0) { has_errors = true; *detailed_error_code = -result; } else { offsets[index] += result; if(result == 0) { // there shouldn't be a scenario where we return 0 // for a write operation, we may want to retry here // but figuring out if this is a single happening, of if // we need to retry this operation (_have_ retried it?) is // complex enough to treat this as an error for now. has_errors = true; *detailed_error_code = EIO; } } } io_uring_cq_advance(ring, i); if(has_errors) return FAIL_IO_RING_WRITE_RESULT; } }That is a lot of code, but it is mostly because of how C works. What we do here is scan through the buffers we need to write, as well as scan through an array of offsets that store additional information for the operation.If the offset to write doesn’t indicate that we’ve written the whole thing, we’ll submit it to the ring and keep going until we either fill the entire ring or run out of buffers to work with. The next step is to submit the work and wait for it to complete, then run through the results, check for errors, and update the offset that we wrote for the relevant buffer.Then, we scan the buffers array again to find either partial writes that we have to complete (we didn’t write the whole buffer) or buffers that we didn’t write at all because we filled the ring. In either case, we submit the new batch of work to the ring and repeat until we run out of work. This code assumes that we cannot have a non-error state where we write 0 bytes to the file and treats that as an error. We also assume that an error in writing to the disk is fatal, and the higher-level code will discard the entire IO_Uring if that happens. The Windows version, by the way, is somewhat simpler. Windows explicitly limits the size of the buffer you can pass to the write() call (and its IO Ring equivalent). It also ensures that it will write the whole thing, so partial writes are not an issue there. It is interesting to note that the code above will effectively stripe writes if you send very large buffers. Let’s assume that we send it two 4 GB buffers, like so:OffsetSizeBuffer 11 GB 4 GBBuffer 210 GB6 GBThe patterns of writes that will actually be executed are:1GB .. 3 GB, 10 GB .. 12 GB3 GB .. 5 GB, 12 GB .. 14 GB14 GB .. 16 GBI can “fix” that by never issuing writes that are larger than 2 GB and issuing separate writes for each 2 GB range, but that leads to other complexities (e.g., tracking state if I split a write and hit the full ring status, etc.). At those sizes, it doesn’t actually matter in terms of efficiency or performance. Partial writes are almost always a sign of either very large writes that were broken up or some underlying issue that is already problematic, so I don’t care that much about that scenario in general. For the vast majority of cases, this will always issue exactly one write for each buffer.What is really interesting from my point of view, however, is how even a pretty self-contained feature can get pretty complex internally. On the other hand, this behavior allows me to push a whole bunch of work directly to the OS and have it send all of that to the disk as fast as possible.In our scenarios, under load, we may call that with thousands to tens of thousands of pages (each 8 KB in size) spread all over the file. The buffers are actually sorted, so ideally, the kernel will be able to take advantage of that, but even if not, just reducing the number of syscalls will result in performance savings.

Answer

by Oren Eini

posted on: January 22, 2025

I previously asked what the code below does, and mentioned that it should give interesting insight into the kind of mindset and knowledge a candidate has. Take a look at the code again:#include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <sys/stat.h> #define BUFFER_SIZE (3ULL * 1024 * 1024 * 1024) // 3GB in bytes int main() { int fd; char *buffer; struct stat st; buffer = (char *)malloc(BUFFER_SIZE); if (buffer == NULL) { return 1; } fd = open("large_file.bin", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR); if (fd == -1) { return 2; } if (write(fd, buffer, BUFFER_SIZE) == -1) { return 3; } if (fsync(fd) == -1) { return 4; } if (close(fd) == -1) { return 5; } if (stat("large_file.bin", &st) == -1) { return 6; } printf("File size: %.2f GB\n", (double)st.st_size / (1024 * 1024 * 1024)); free(buffer); return 0; }This program will output: File size: 2.00 GBAnd it will write 2 GB of zeros to the file:~$ head large_file.bin | hexdump -C 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 7ffff000The question is why? And the answer is quite simple. Linux has a limitation of about 2 GB for writes to the disk. Any write call that attempts to write more than that will only write that much, and you’ll have to call the system again. This is not an error, mind. The write call is free to write less than the size of the buffer you passed to it.Windows has the same limit, but it is honest about itIn Windows, all write calls accept a 32-bit int as the size of the buffer, so this limitation is clearly communicated in the API. Windows will also ensure that for files, a WriteFile call that completes successfully writes the entire buffer to the disk.And why am I writing 2 GB of zeros? In the code above, I’m using malloc(), not calloc(), so I wouldn’t expect the values to be zero. Because this is a large allocation, malloc() calls the OS to provide us with the buffer directly, and the OS is contractually obligated to provide us with zeroed pages.