skip to content
Relatively General .NET

High performance .NET

by Oren Eini

posted on: June 08, 2022

I run into this project, which aims to be a Redis clone with better performance and ease of use. I found it interesting because one of the main selling points there was that it is able to run in a multi threaded mode (instead of Redis’ single thread per process model). They use memtier_benchmark (part of Redis) to test their performance. I got curious about how much performance I could get out of the system if I built my own Redis clone in C#. The first version I built was done pretty naively. The idea is to write it in a high level manner, and see where that puts us. To make things interesting, here are the test scenarios: The memtier_benchmark is going to run on c6g.2xlarge instance, using 8 cores and 32 GB of memory. The tested instance is going to run on c6g.4xlarge, using 16 cores and 64 GB of memory. Both of those instances are running on the same availability zone. The command I’m going to run is: memtier_benchmark –s $SERVER_IP -t 8 -c 16 --test-time=30 --distinct-client-seed -d 256 --pipeline=30 What this says is that we’ll use 8 threads (number of cores on the client instance) with 32 connections per thread, we’ll use 20% writes & 80% reads with data size that is 256 bytes in size. In total, we’ll have 256 clients and out tests are going to continuously push more data into the system. The server is being run using: dotnet run –c Release Here is an example of the server while under this test: I chose 30 seconds for the test duration to balance doing enough work to feel what is going on (multiple GC cycles, etc) while keeping the test duration short enough that I won’t get bored. Here are the naïve version results: ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 86300.19 --- --- 8.14044 0.92700 99.83900 196.60700 25610.97 Gets 862870.15 36255.57 826614.58 8.10119 0.91900 99.32700 196.60700 42782.42 Waits 0.00 --- --- --- --- --- --- --- Totals 949170.34 36255.57 826614.58 8.10476 0.91900 99.32700 196.60700 68393.39 So the naïve version, using C#, doing almost nothing, is almost touching the 1 million queries / sec. The latency, on the other hand, isn’t that good. With the p99 at almost 100ms. Now that I got your attention with the numbers and pretty graphs, let me show you the actual code that I'm running. This is a “Redis Clone” in under 100 lines of code. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters using System.Collections.Concurrent; using System.Net.Sockets; var listener = new TcpListener(System.Net.IPAddress.Any, 6379); listener.Start(); var redisClone = new RedisClone(); while (true) { var client = listener.AcceptTcpClient(); var _ = redisClone.HandleConnection(client); // run async } public class RedisClone { ConcurrentDictionary<string, string> _state = new(); public async Task HandleConnection(TcpClient client) { using var _ = client; using var stream = client.GetStream(); using var reader = new StreamReader(stream); using var writer = new StreamWriter(stream) { NewLine = "\r\n" }; try { var args = new List<string>(); while (true) { args.Clear(); var line = await reader.ReadLineAsync(); if (line == null) break; if (line[0] != '*') throw new InvalidDataException("Cannot understand arg batch: " + line); var argsv = int.Parse(line.Substring(1)); for (int i = 0; i < argsv; i++) { line = await reader.ReadLineAsync(); if (line == null || line[0] != '$') throw new InvalidDataException("Cannot understand arg length: " + line); var argLen = int.Parse(line.Substring(1)); line = await reader.ReadLineAsync(); if (line == null || line.Length != argLen) throw new InvalidDataException("Wrong arg length expected " + argLen + " got: " + line); args.Add(line); } var reply = ExecuteCommand(args); if(reply == null) { await writer.WriteLineAsync("$-1"); } else { await writer.WriteLineAsync($"${reply.Length}\r\n{reply}"); } await writer.FlushAsync(); } } catch (Exception e) { try { string? line; var errReader = new StringReader(e.ToString()); while ((line = errReader.ReadLine()) != null) { await writer.WriteAsync("-"); await writer.WriteLineAsync(line); } await writer.FlushAsync(); } catch (Exception) { // nothing we can do } } } string? ExecuteCommand(List<string> args) { switch (args[0]) { case "GET": return _state.GetValueOrDefault(args[1]); case "SET": _state[args[1]] = args[2]; return null; default: throw new ArgumentOutOfRangeException("Unknown command: " + args[0]); } } } view raw Redis.Naive.cs hosted with ❤ by GitHub Just a few notes on the implementation. I’m not actually doing much. Most of the code is there to parse the Redis protocol. And the code is full of allocations. Each command parsing is done using multiple string splits and concats. Replies to the client require even more concats. The “store” for the system is actually just a simple ConcurrentDictionary, without anything to avoid contention or high costs. The manner in which we handle I/O is pretty horrible, and… I think you get where I’m going here, right? My goal is to see how I can use this (pretty simple) example to get more performance without having to deal with a lot of extra fluff. Given my initial attempt is already at nearly 1M QPS, that is a pretty good start, even if I say so myself. The next step that I want to take it to handle the allocations that are going on here. We can probably do better here, and I aim to try. But I’ll do that in the next post.

Unit testing RavenDB

by Oren Eini

posted on: June 07, 2022

Steven has written a post about unit testing RavenDB which shows how easy it is to just write the code you’ll do anyway, and be able to seamlessly and easily test that without any hassles whatsoever.

Quickly Trimming Video Files

by Ardalis

posted on: June 07, 2022

I do a fair bit of video editing as part of producing content for Pluralsight, clients, and YouTube. Recently I took on the task of editing…Keep Reading →

Using AV1 video codec to reduce web page size

by Gérald Barré

posted on: June 06, 2022

This post is part of the series 'Web Performance'. Be sure to check out the rest of the blog posts of the series!Website performance: Why and how to measure?Website performance: How I've improved the performance of this website?Using AV1 video codec to reduce web page size (this post)Using Avif cod

Challenge

by Oren Eini

posted on: June 03, 2022

The following code does not output the right value, can you tell why? This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters record Item(string Name, dynamic Extra = null); static void Main(string[] args) { var obj = new Item("Milk") { Extra = new Dictionary<string, object> { ["Skim"] = false, ["Viscosity"] = 39m } ["Type"] = "Food", }; Console.WriteLine(JsonConvert.SerializeObject(obj, Formatting.Indented)); } view raw strange.cs hosted with ❤ by GitHub This has been a real bug that we ran into (with only slightly more complicated code. The sort of things that make you just stare at the screen in disbelief when you realize what is going on…

re

by Oren Eini

posted on: June 02, 2022

I run into this fascinating blog post discussing the performance of BonsaiDb. To summarize the post, the author built a database and benchmarked it, but wasn’t actually calling fsync() to ensure durability. When they added the fsync() call at the right time, the performance didn’t degrade as expected. It turned out that they were running benchmarks on tmpfs, which is running in memory and where fsync() has no impact. Once they tested on a real system, there was a much higher cost, to the point where the author is questioning whether to continue writing the database library he has developed. Forgetting to call fsync() is an understandable issue (they called the flush() method, but that didn’t translate to an fsync() call). I recall that at one point the C#’s API once had a bug where a Flush() would not call fsync() if you were making writes with 4KB alignment (that is… super bad). After figuring out the issue, the author has out to figure exactly how to ensure durability. That is a journey that is fraught with peril, and he has some really interesting tidbits there. Including sync_file_range(), fdatasync() and how you cannot write a durable database for Mac or iOS. From my perspective, you cannot really trust anything beyond O_DIRECT | O_DSYNC or fdatasync() for durability. Almost a decade ago I wrote about performance testing that I did for various databases. My code was the 2nd fastest around for the tested scenarios. It was able to achieve almost 23,000 writes, almost 25% of the next slowest database. However, the fastest database around was Esent, which clocked at 786,782 writes. I dug deep into how this is done and I realized that there is a fundamental difference between how all other databases were working and how Esent was handling things. All other databases issued fsync() calls (or fdatasync()). While Esent skipped that altogether. Instead, it opened a file with FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_DIRECT (the Unix version is O_DIRECT | O_DSYNC). That change alone was responsible for a major performance difference. When using O_DIRECT | O_DSYNC, the write is sent directly to persistent medium, skipping all buffers. That means that you don’t have to flush anything else that is waiting to be written. If you are interested, I wrote a whole chapter on the topic of durable writes. It is a big topic. The other thing that has a huge impact on performance is whether you are doing transaction merging or not. If you have multiple operations running at roughly the same time, are you going to do a separate disk write for each one of them, or will you be able to do that in a single write. The best example that I can think of is the notion of taking the bus. If you send a whole bus for each passenger, you’ll waste a lot of time and fuel. If you pack the bus as much as possible, for almost the same cost, you’ll get a far better bang. In other words, your design has to include a way for the database to coalesce such operations into a single write. Yesterday there was an update to this task, which more or less followed that direction. The blog post covers quite a lot of ground and is going in the right direction, in my opinion. However, there are a few things there that I want to comment upon. First, pre-allocation of disk space can make a huge impact on the overall performance of the system. Voron does that by allocating up to 1GB of disk space at a time, which dramatically reduces the amount of I/O work you have to do. Just to give some context, that turns a single disk write to multiple fsyncs that you have to do, on both your file and the parent directory, on each write. That is insanely expensive. The storage engine discussed here used append only mode, which makes this a bit of a problem, but not that much. You can still preallocate the disk space. You have to scan the file from the end on startup anyway, and the only risk here is the latency for reading the whole pre-allocation size on startup if we shut down immediately after the preallocation happened. It’s not ideal, but it is good enough. Second, the way you manage writes shouldn’t rely on fsync and friends. That is why we have the log for, and you can get away with a lot by letting just the log handle the durability issue. The log is pre-allocated to a certain size (Voron uses dynamic sizes, with a max of 256MB) and written to using O_DIRECT | O_ O_DSYNC each time. But because this is expensive, we have something like this (Python code, no error handling, demo code, etc): This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters class WriteAheadLog: def __init__(self) -> None: self.queue = [] self.hasWrites = threading.Event() self.log = open("write-ahead.log", O_DIRECT | O_DSYNC) th = threading.Thread(target=self.backgroundThread) th.start() def writeToLog(self, pages): future = loop.create_future() self.queue.append((pages, future)) return future def backgroundThread(self): while self.hasWrites.wait(): self.hasWrites.clear() pending = self.queue self.queue = [] pagesToWrite = [] for (pages, _) in pending: pagesToWrite = pagesToWrite + pages dataToWrite = CompressPages(pagesToWrite) self.log.write(dataToWrite) for (_, future) in pending: future.set_result(True) view raw WriteAheadLog.py hosted with ❤ by GitHub The idea is that you can call writeToLog() each time and you’ll get a future on the write to the log file. You can continue with your transaction when the log holds the data. Note that in this model, if you have concurrent writes, they will be merged into a single disk write. You can also benefit significantly from reduced amount we write to disk by applying compression. Third, something that has been raised as an option here is a new storage format. I’m not sure that I 100% get what is the intention, but… what I understand I don’t like. I think that looking at how LMDB does things would help a lot here. It is a COW model (which the append only is very similar to). The key difference is that the new model is going to store every header twice. Probably with a CURRENT and NEXT model, where you can switch between the two configurations. That… works, but it is pretty complex. Given that you have a log involved, there is no need for any of this. You can just store the most recent changes in the log and use that to find what the current version is of the data is. I don’t like append only models, since they require you to do compaction at some point. A better model is what LMDB does, where it re-used the same pages (with copy on write). It requires you to manage a free list of pages, of course, but that isn’t that big a task.

Managing Up and Getting Ahead

by Ardalis

posted on: June 01, 2022

If you're an employee, you probably have a boss, manager, supervisor, or similarly titled person to whom you report. Normally, you figure it…Keep Reading →

Understanding PathBase in ASP.NET Core

by Andrew Lock

posted on: May 31, 2022

In this post I'll describe a lesser-known property on HttpRequest called `PathBase`. I describe what it does, when it's useful, and how to use it.…