skip to content
Relatively General .NET

Optimizing non obvious costs

by Oren Eini

posted on: April 07, 2021

One of the “fun” aspects of running in the cloud is the fact that certain assumptions that you take for granted are broken, sometimes seriously so. Today post is about an issue a customer run into in the cloud. They were seeing some cases of high latency of operations from RavenDB. In the cloud, the usual answer is to provision more resources, but we generally recommend that only when we can show that the load is much higher than expected to be handled on the hardware.The customer was running on a cluster with disk that were provisioned with 1,000 IOPS and 120 MB /sec, that isn’t a huge amount, but it is certainly respectable. Looking at the load, we can see fairly constant writes and the number of indexes is around 30. Looking at the disk, we can see that we are stalling there, the queue length is very high and the disk latency has a user visible impact. All told, we would expect to see a significant amount of I/O operations as a result of that, but the fact that we hit the limits of the provisioned IOPS was worth a second look. We started pulling at the details and it became clear that there was something that we could do about it. During indexing, we create some temporary files to store the Lucene segments before we commit them to the index. Each indexing run can create between four and six such files. When we create them, we do that with the flag DeleteOnClose, this is a flag that exists on Windows, but not on Linux. On Linux, we are running on ext4 with journaling enabled, which means that each file system metadata modification requires a journal write at the file system level. Those temporary files live for a very short amount of time, however. We delete them on close, after all, and the indexing run is very short.6 files per index times 30 indexes means 180 files. Each one of those will be created and destroyed (generating a journal event each time) and there is a constant low volume of writes. That means that there are 360 IOPS at the file system level just because of this issue. The fix for that was two folds. First, for small files, under 128KB, we never hit the disk. We can keep them completely in memory. For larger file, we want to avoid using too much memory, so we spill them to disk, but instead of creating new files each time, we’ll reuse them between indexing run.The end result is that we are issuing fewer I/O operations, reducing the amount of trivial IOPS we consume and can get a lot more done with the same hardware. The actual fix is fairly small and targeted, but the impact is felt across the entire system.

Speed Up Docker Compose with Parallel Builds

by Ardalis

posted on: April 07, 2021

I've been using docker-compose quite a bit lately for a distributed app that includes 3 front end apps, 2 databases, RabbitMQ, and PaperCut…Keep Reading →

Using raw html with isolated CSS in Blazor

by Gérald Barré

posted on: April 05, 2021

In a project I had to create something similar to a syntax highlighter for a Blazor application. The code takes a string and highlights some parts of it. So, the idea is to create a component which outputs a raw html fragment that uses <span class="style-1">text</span> to colorize some

Building a phone book

by Oren Eini

posted on: April 02, 2021

In the past two posts in the series, I talked about ways to store phone book records in a file. During the candidates review process, I noticed that many candidates failed to make their lives significantly easier by placing limits on themselves.For example:Using variable length records.Using a single file.Choosing simple algorithm to do the whole task.If we force fixed length records, either directly or via record splitting (if each record is 64 bytes, a record that is bigger than that would reside in some multiple of the record size), the task become much easier. I’ve mostly ignored that in my code so far because I’m using binary offsets, but it can really make the code a lot simpler. Using a single file lead to complications, because you have to do internal space management (where do the records live, where is the metadata?). It also make it much harder to recover used space in many cases. The last one is probably the most interesting limitation, and not something that I would expect a junior developer to figure out. The use of a single option is typically limiting you to whatever a particular algorithm is providing, but you can extend on that significantly.Let’s see another approach to building a persistent phone book. I’m going to effectively build an LSM here. You can see the code here.I called it a pretty horrible LSM (Log Structure Merge), but all the relevant pieces are there. It is just horribly inefficient. The key problem, by the way, is around the number of times it will open a file handle. That can be really slow on Windows and end up being a real issue for any significant size.There are also probably a lot of other bugs there, but also enough to figure out how this is actually built.And with this post, I can can say that I explicitly scratched this itch.A fun task to take this further, by the way, is to try to implement a persistent trie for the phone book.

Static methods considered evil?

by Vladimir Khorikov

posted on: April 01, 2021

Are static methods good or bad? Over the course of my career I did a full circle on this topic. In this article, I’ll try to describe this evolution and the reasoning behind it.

Unveiling Gavran: RavenDB re-written in C

by Oren Eini

posted on: April 01, 2021

RavenDB is written in C# and .NET, unlike most of the database engines out there. The other databases are mostly written in C, C++ and sometimes Java. I credit the fact that I wrote RavenDB in C# as a major part of the reason I was able to drive it forward to the point it is today. That wasn’t easy and there are a number of issues that we had to struggle with as a result of that decision. And, of course, all the other databases at the playground look at RavenDB strangely for being written in C#.In RavenDB 4.0, we have made a lot of architectural changes. One of them was to replace some of the core functionality of RavenDB with a C library to handle core operations. Here is what this looks like: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters EXPORT int32_t rvn_allocate_more_space(int64_t new_length_after_adjustment, void *handle, void **new_address, int32_t *detailed_error_code); EXPORT int32_t rvn_open_journal_for_writes(const char *file_name, int32_t transaction_mode, int64_t initial_file_size, int32_t durability_support, void **handle, int64_t *actual_size, int32_t *detailed_error_code); EXPORT int32_t rvn_close_journal(void* handle, int32_t* detailed_error_code); EXPORT int32_t rvn_write_journal(void* handle, void* buffer, int64_t size, int64_t offset, int32_t* detailed_error_code); EXPORT int32_t rvn_open_journal_for_reads(const char *file_name, void **handle, int32_t *detailed_error_code); view raw rvn.h hosted with ❤ by GitHub However, there is still a lot to be done in this regard and that is just a small core. Due to the COVID restrictions, I found myself with some time on my hands and decided that I can spare a few weekends to re-write RavenDB from scratch in C. I considered using Rust, but that seemed like to be over the top. The results of that can be seen here. I kept meticulous records of the process of building this, which I may end up publishing at some point. Here is an example of how the code looks like: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters static result_t btree_defrag(txn_t* tx, page_t* p) { void* buffer; ensure(txn_alloc_temp(tx, PAGE_SIZE, &buffer)); memcpy(buffer, p->address, PAGE_SIZE); memset(p->address + p->metadata->tree.floor, 0, PAGE_SIZE - p->metadata->tree.floor); p->metadata->tree.ceiling = PAGE_SIZE; uint16_t* positions = p->address; size_t max_pos = p->metadata->tree.floor / sizeof(uint16_t); for (size_t i = 0; i < max_pos; i++) { uint64_t size; uint16_t cur_pos = positions[i]; void* end = varint_decode( varint_decode(buffer + cur_pos, &size) + size, &size); if (p->metadata->tree.page_flags == page_flags_tree_leaf) { end++; // flags } uint16_t entry_size = (uint16_t)(end - (buffer + cur_pos)); p->metadata->tree.ceiling -= entry_size; positions[i] = p->metadata->tree.ceiling; memcpy(p->address + p->metadata->tree.ceiling, buffer + cur_pos, entry_size); } return success(); } view raw btree.c hosted with ❤ by GitHub The end result is that I was able to take all the knowledge of building and running RavenDB for so long and create a compatible system in not that much code. When reading the code, you’ll note methods like defer() and ensure(). I’m using compiler extensions and some macro magic to get a much nicer language support for RAII. That is pretty awesome to do in C, even if I say so myself and has dramatically reduced the cognitive load of writing with manual memory management.An, of course, following my naming convention, Gavran is Raven in Croatian. I’ll probably take some time to finish the actual integration, but I have very high hopes for the future of Gavran and its capabilities. I’m currently running benchmarks, you can expect them by May 35th.

Working with the Enron dataset

by Oren Eini

posted on: March 31, 2021

Every now and then I need to do some work with text, and the Enron data set is one of the most well known corpuses. I ended up writing the parsing code for that so many times that it isn’t even funny. Therefor, I decided to make my life easier and just post it somewhere that I can refer back to it.This code simply unpack the Enron dataset into a .NET object, from where you can start processing the text in interesting ways. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters /* Use: <PackageReference Include="MimeKitLite" Version="2.11.0" /> <PackageReference Include="Newtonsoft.Json" Version="13.0.1" /> <PackageReference Include="SharpCompress" Version="0.28.1" /> */ using MimeKit; using Newtonsoft.Json; using System; using System.Collections.Generic; using System.IO; using System.Linq; namespace EnronReader { class Program { static void Main(string[] args) { // download from https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz var path = @"enron_mail_20150507.tar.gz"; var tar = SharpCompress.Readers.Tar.TarReader.Open(File.OpenRead(path)); while (tar.MoveToNextEntry()) { if (tar.Entry.IsDirectory) continue; using var s = tar.OpenEntryStream(); var msg = MimeMessage.Load(s); var my = new Message { Bcc = msg.Bcc?.Select(x => x.ToString()).ToList(), Cc = msg.Cc.Select(x => x.ToString()).ToList(), Date = msg.Date, From = msg.From?.Select(x => x.ToString()).ToList(), To = msg.To?.Select(x => x.ToString()).ToList(), References = msg.References?.Select(x => x).ToList(), ReplyTo = msg.ReplyTo?.Select(x => x.ToString()).ToList(), Importance = msg.Importance, InReplyTo = msg.InReplyTo, MessageId = msg.MessageId, Headers = msg.Headers?.GroupBy(x=>x.Id).ToDictionary(g => g.Key.ToHeaderName(), g => g.Select(x=>x.Value).ToList() ), Priority = msg.Priority, Sender = msg.Sender?.ToString(), Subject = msg.Subject, TextBody = msg.GetTextBody(MimeKit.Text.TextFormat.Plain), XPriority = msg.XPriority }; var js = JsonConvert.SerializeObject(my, Formatting.Indented); Console.WriteLine(js); } } } public class Message { public MessagePriority Priority { get; set; } public XMessagePriority XPriority { get; set; } public string Sender { get; set; } public List<string> From { get; set; } public List<string> ReplyTo { get; set; } public List<string> To { get; set; } public List<string> Cc { get; set; } public MessageImportance Importance { get; set; } public string Subject { get; set; } public DateTimeOffset Date { get; set; } public List<string> References { get; set; } public string InReplyTo { get; set; } public string MessageId { get; set; } public string TextBody { get; set; } public List<string> Bcc { get; set; } public Dictionary<string, List<string>> Headers { get; set; } } } view raw enron-reader.cs hosted with ❤ by GitHub