skip to content
Relatively General .NET

High performance .NET

by Oren Eini

posted on: June 20, 2022

Now that I’m done with the low hanging fruits, I decided to shift the Redis implementation to use System.IO.Pipelines. That is a high performance I/O API that is meant specifically for servers that need to eke out all the performance out of the system. The API is a bit different, but it follows a very logical pattern and makes a lot of sense. Here is the main loop of handling commands from a client: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public async Task HandleConnection() { while (true) { var result = await _netReader.ReadAsync(); var (consumed, examined) = ParseNetworkData(result); _netReader.AdvanceTo(consumed, examined); await _netWriter.FlushAsync(); } } view raw server.cs hosted with ❤ by GitHub The idea is that we get a buffer from the network, we read everything (including pipelined commands) and then we flush to the client. The more interesting things happen when we start processing the actual commands, because now we aren’t utilizing StreamReader but PipeReader. So we are working at the level of bytes, not strings. Here is what this (roughly) looks like, I’m not showing the whole thing because I want to focus on the issue that I ran into: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters (SequencePosition Consumed, SequencePosition Examined) ParseNetworkData(ReadResult result) { var reader = new SequenceReader<byte>(result.Buffer); while (true) { _cmds.Clear(); if (reader.TryReadTo(out ReadOnlySpan<byte> line, (byte)'\n') == false) return (reader.Consumed, reader.Position); if (line.Length == 0 || line[0] != '*' || line[line.Length - 1] != '\r') ThrowBadBuffer(result.Buffer); if (Utf8Parser.TryParse(line.Slice(1), out int argc, out int bytesConsumed) == false || bytesConsumed + 2 != line.Length) // account for the * and \r ThrowBadBuffer(result.Buffer); for (int i = 0; i < argc; i++) { // **** redacted - reading cmd to _cmds buffer } ExecCommand(_cmds); } } view raw parser.cs hosted with ❤ by GitHub The code is reading from the buffer, parsing the Redis format and then executing the commands. It supports multiple commands in the same buffer (pipelining) and it has absolutely atrocious performance. Yes, the super speedy API that is significantly harder to get right (compared to the ease of working with strings) is far slower. And by far slower I mean the following, on my development machine: The previous version clocks at around 126,017.72 operations per second. This version clocks at less than 100 operations per second. Yes, you read that right, less than one hundred operations per second compared to over hundred thousands for the unoptimized version. That was… surprising, as you can imagine. I actually wrote the implementation twice, using different approaches, trying to figure out what I was doing wrong. Surely, it can’t be that bad. I took a look at the profiler output, to try to figure out what is going on: It says, quite clearly, that the implementation is super bad, no? Except, that this is what you are supposed to be using. So what is going on? The underlying problem is actually fairly simple and relates to how the Pipelines API achieves its performance. Instead of doing small calls, you are expected to get a buffer and process that. Once you are done processing the buffer you can indicate what amount of data you consumed, and then you can issue another call. However, there is a difference between consumed data and examined data. Consider the following data: *3 $3 SET $15 memtier-2818567 $256 xxxxxxxxxx ... xxxxxx *2 $3 GET $15 memtier-7689405 *2 $3 GET $15memt What you can see here is a pipelined command, with 335 bytes in the buffer.  We’ll process all of those commands in a single hit, except… look at the highlighted portion. What do we have there? We have a partial command. In other words, we are expected to execute a GET with a key size of 15 bytes, but we only have the first 4 bytes here. That is actually expected and fine. We consumed all the bytes until the highlighted portion (thus letting the PipeReader know that we are done with them). The problem is that when we issue a call now, we’ll get the highlighted portion (which we didn’t consume), but we aren’t ready to process that. Data is missing. We indicate that to the PipeReader using the examined portion. So the PipeReader knows that it needs to read more from the network. However… my code has a subtle bug. It will report that it examined the yellow highlight, not the green one. In other words, we tell the PipeReader that we consumed some portion of the buffer, and examined some more, but there are still bytes on the buffer that are neither consumed nor examined. That means that when we issue the read call, expecting to get data from the network, we’ll actually get the same buffer again, to do the exact same processing. Eventually, we’ll have more data in the buffer from the other side, so the correctness of the solution isn’t impacted. But it will kill your performance. The fix is really simple, we need to tell the PipeReader that we examined the entire buffer, so it will not do a busy wait and wait for more data from the network. Here is the bug fix: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters 9c9 < return (reader.Consumed, reader.Position); --- > return (reader.Consumed, result.Buffer.End); view raw fix.diff hosted with ❤ by GitHub With that change in place, we can hit 187,104.21 operations per second! That is 50% better, which is awesome. I haven’t profiled things yet properly, because I also want to address another issue, how are we going to deal with the data from the network. More on that in my next post.

Observing all http requests in a .NET application

by Gérald Barré

posted on: June 20, 2022

.NET provides multiple APIs to send http requests. You can use the HttpClient class and the obsolete HttpWebRequest and WebClient classes. Also, you may use libraries that send requests out of your control. So, you need to use the hooks provided by .NET to observe all http requests.Program.cs (C#)c

Creating, Inspecting and Decompiling the World’s (Nearly) Smallest C# Program

by Steve Gordon

posted on: June 15, 2022

In this post, I thought it might be fun to create the world’s (nearly) shortest C# program and then deep dive into some of the fine details of what happens behind the scenes. This post is not intended to solve a real-world problem but I hope it’s well worth your time spent reading it. By […]

The value of self contained diagnostics

by Oren Eini

posted on: June 14, 2022

I’m inordinately fond of the Fallacies of Distributed Computing, these are a set of common (false) assumptions that people make when building distributed systems, to their sorrow. Today I want to talk about one of those fallacies: There is one administrator. I like to add the term competent in there as well. A pretty significant amount of time in the development of RavenDB was dedicated to addressing that issue. For example, RavenDB has a lot of code and behavior around externalizing metrics. Both its own and the underlying system. That is a duplication of effort, surely. Let’s consider the simplest stuff, such as CPU, memory and I/O resource utilization. RavenDB makes sure to track those values, plot them in the user interface and expose that to external monitoring systems. All of those have better metrics sources. You can ask the OS directly about those details, and it will likely give you far better answers (with more details) than RavenDB can. There have been numerous times where detailed monitoring from the systems that RavenDB runs on was the thing that allowed us to figure out what is going on. Having the underlying hardware tell us in detail about its status is wonderful. Plug that into a monitoring system so you can see trends and I’m overjoyed. So why did we bother investing all this effort to add support for this to RavenDB? We would rather have the source data, not whatever we expose outside. RavenDB runs on a wide variety of hardware and software systems. By necessity, whatever we can provide is only a partial view. The answer to that is that we cannot assume that the administrator has set up such monitoring. Nor can we assume that they are able to. For example, the system may be running on a container in an environment where the people we talk to have no actual access to the host machine to pull production details. Having a significant investment in self-contained set of diagnostics means that we aren’t limited to whatever the admin has set up (and has the permissions to view) but have a consistent experience digging into issues. And since we have our own self contained diagnostics, we can push them out to create a debug package for offline analysis or even take active actions in response to the state of the system. If we were relying on external monitoring, we would need to integrate that, each and every time. The amount of work (and quality of the result) in such an endeavor is huge. We build RavenDB to last in production, and part of that is that it needs to be able to survive even outside of the hothouse environment.

A brief introduction to DiagnosticSource

by Andrew Lock

posted on: June 14, 2022

In this post I describe the DiagnosticSource infrastructure, how it compares to other logging APIs, and how to use it to listen to framework events…

High performance .NET

by Oren Eini

posted on: June 13, 2022

After achieving 1.25 million ops/sec, I decided to see what would happen if I would change the code to support pipelining. That ended up being quite involved, because I needed to both keep track of all the incoming work as well as send the work to multiple locations. The code itself is garbage, in my opinion. It is worth it only as far as it points me inthe right direction in terms of the overall architecture. You can read it below, but it is a bit complex. We read from the client as much as we are able, then we send it to each of the dedicated threads to run it. In terms of performance, it is actually slower than the previous iteration (by about 20%!), but it serves a very important aspect, it makes it easy to tell where the costs are. Take a look at the following profiler result: You can see that we are spending a lot of time in I/O and in string processing. The GC time is also quite significant. Conversely, when we actually process the commands from the clients, we are spending most of the time simply idling. I want to tackle this in stages. The first part is to stop using strings all over the place. The next stage after that will likely be to change the I/O model. For now, here is where we stand: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters using System.Collections.Concurrent; using System.Net.Sockets; using System.Text; using System.Threading.Channels; var listener = new TcpListener(System.Net.IPAddress.Any, 6379); listener.Start(); ShardedDictionary _state = new(Environment.ProcessorCount / 2); while (true) { var tcp = listener.AcceptTcpClient(); var stream = tcp.GetStream(); var client = new Client(tcp, new StreamReader(stream), new StreamWriter(stream) { AutoFlush = true }, _state); var _ = client.ReadAsync(); } class Client { public readonly TcpClient Tcp; public readonly StreamReader Reader; public readonly StreamWriter Writer; public readonly ShardedDictionary Dic; public struct Command { public string Key; public string? Value; public bool Completed; } private List<string> _args = new(); private Task<string?> _nextLine; private Command[] _commands = Array.Empty<Command>(); private int _commandsLength = 0; private StringBuilder _buffer = new(); private int _shardFactor; public Client(TcpClient tcp, StreamReader reader, StreamWriter writer, ShardedDictionary dic) { Tcp = tcp; Reader = reader; Writer = writer; Dic = dic; _shardFactor = dic.Factor; } public async Task ReadAsync() { try { while (true) { if (_buffer.Length != 0) { await Writer.WriteAsync(_buffer); _buffer.Length = 0; } var lineTask = _nextLine ?? Reader.ReadLineAsync(); if (lineTask.IsCompleted == false) { if (_commandsLength != 0) { _nextLine = lineTask; Dic.Enqueue(this, Math.Abs(_commands[0].Key.GetHashCode()) % _shardFactor); return; } } var line = await lineTask; _nextLine = null; if (line == null) { using (Tcp) // done reading... { return; } } await ReadCommand(line); AddCommand(); } } catch (Exception e) { await HandleError(e); } } private async Task ReadCommand(string line) { _args.Clear(); if (line[0] != '*') throw new InvalidDataException("Cannot understand arg batch: " + line); var argsv = int.Parse(line.Substring(1)); for (int i = 0; i < argsv; i++) { line = await Reader.ReadLineAsync() ?? string.Empty; if (line[0] != '$') throw new InvalidDataException("Cannot understand arg length: " + line); var argLen = int.Parse(line.Substring(1)); line = await Reader.ReadLineAsync() ?? string.Empty; if (line.Length != argLen) throw new InvalidDataException("Wrong arg length expected " + argLen + " got: " + line); _args.Add(line); } } private void AddCommand() { if (_commandsLength >= _commands.Length) { Array.Resize(ref _commands, _commands.Length + 8); } ref Command cmd = ref _commands[_commandsLength++]; cmd.Completed = false; switch (_args[0]) { case "GET": cmd.Key = _args[1]; cmd.Value = null; break; case "SET": cmd.Key = _args[1]; cmd.Value = _args[2]; break; default: throw new ArgumentOutOfRangeException("Unknown command: " + _args[0]); } } public async Task NextAsync() { try { WriteToBuffer(); await ReadAsync(); } catch (Exception e) { await HandleError(e); } } private void WriteToBuffer() { for (int i = 0; i < _commandsLength; i++) { ref Command cmd = ref _commands[i]; if (cmd.Value == null) { _buffer.Append("$-1\r\n"); } else { _buffer.Append($"${cmd.Value.Length}\r\n{cmd.Value}\r\n"); } } _commandsLength = 0; } public async Task HandleError(Exception e) { using (Tcp) { try { string? line; var errReader = new StringReader(e.ToString()); while ((line = errReader.ReadLine()) != null) { await Writer.WriteAsync("-"); await Writer.WriteLineAsync(line); } await Writer.FlushAsync(); } catch (Exception) { // nothing we can do } } } internal void Execute(Dictionary<string, string> localDic, int index) { int? next = null; for (int i = 0; i < _commandsLength; i++) { ref var cmd = ref _commands[i]; var cur = Math.Abs(cmd.Key.GetHashCode()) % _shardFactor; if (cur == index) // match { cmd.Completed = true; if (cmd.Value != null) { localDic[cmd.Key] = cmd.Value; } else { localDic.TryGetValue(cmd.Key, out cmd.Value); } } else if (cmd.Completed == false) { next = cur; } } if (next != null) { Dic.Enqueue(this, next.Value); } else { _ = NextAsync(); } } } class ShardedDictionary { Dictionary<string, string>[] _dics; BlockingCollection<Client>[] _workers; public int Factor => _dics.Length; public ShardedDictionary(int shardingFactor) { _dics = new Dictionary<string, string>[shardingFactor]; _workers = new BlockingCollection<Client>[shardingFactor]; for (int i = 0; i < shardingFactor; i++) { var dic = new Dictionary<string, string>(); var worker = new BlockingCollection<Client>(); _dics[i] = dic; _workers[i] = worker; var index = i; // readers new Thread(() => { ExecWorker(dic, index, worker); }) { IsBackground = true, }.Start(); } } private static void ExecWorker(Dictionary<string, string> dic, int index, BlockingCollection<Client> worker) { while (true) { worker.Take().Execute(dic, index); } } public void Enqueue(Client c, int index) { _workers[index].Add(c); } } view raw Redis.3.cs hosted with ❤ by GitHub

Using Avif codec for images to reduce web page size

by Gérald Barré

posted on: June 13, 2022

This post is part of the series 'Web Performance'. Be sure to check out the rest of the blog posts of the series!Website performance: Why and how to measure?Website performance: How I've improved the performance of this website?Using AV1 video codec to reduce web page sizeUsing Avif codec for image

High performance .NET

by Oren Eini

posted on: June 10, 2022

My previous attempts to write a Redis clone were done in about as straightforward a way as possible. Open a socket to listen on, have a separate Task for each client that reads from the network, parse the command and execute it. There are some smarts around supporting pipelining, but that is pretty much it. Let’s take a step back and build ourselves a Redis clone that matches the actual Redis architecture more closely. In order to do that, I’ll need to do everything in a single thread. That is… surprisingly hard to do in C#. There are no APIs for doing the kind of work that Redis is doing. To be rather more exact, there is the Socket.Select() method, but that requires building everything on top of that (meaning that we have to handle buffering, string handling, etc). Given that this is a way station to the final proposed architecture, I decided to skip this entirely. Instead, I’m going to focus first on removing the major bottleneck in the system, the ConcurrentDictionary. The profiler results show that the biggest cost we have here is the scalability of the concurrent dictionary. Even when we tried to shard it across 1024 locks, it still took almost 50% of our runtime. The question is, can we do better? One good option that we can try is to shard things directly. Instead of using a single concurrent dictionary, we will split it to separate dictionaries, each one of them would be accessed without concurrency. The idea goes like this, we’ll have the usual read & write for the clients. But instead of processing the command inline, we’ll route it to a dedicated thread (with its own dictionary) to do the work. I set it so we’ll have 10 such threads (assuming they will reside on individual cores and that I’ll be able to process all I/O on the other 6 cores. Here are the results after the change: ============================================================================================================================ Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ---------------------------------------------------------------------------------------------------------------------------- Sets 113703.56 --- --- 3.06261 0.95900 25.59900 39.93500 33743.38 Gets 1137015.79 19211.78 1117804.01 3.06109 0.95900 25.59900 39.93500 49150.52 Waits 0.00 --- --- --- --- --- --- --- Totals 1250719.35 19211.78 1117804.01 3.06122 0.95900 25.59900 39.93500 82893.90 Note that we are now at 1.25 million, almost 25% better than the previous run. Here are some profiler results of running this code: So in this case, we are spending a lot of time doing string processing of various kinds, waiting for GC (almost 30%). The costs for collections went down a lot (but we’ll see that it shifted somewhat). There are some other things that pop to mind, take a look here: That is a surprising cost for a “simple” property lookup. The substrings calls are also expensive, over 6% of the overall runtime. When looking at other parts of the system, we have: This is really interesting, because we spend a lot of time just waiting for items in the queue. We could probably do more things in there rather than just wait. I also tried various other concurrency values. With a single ExecWorker running, we have 404,187 ops/sec and with two of them we are at 715,157 ops/sec. When running with four threads dedicated to processing the requests, we are at 1,060,622.24 ops/sec. So it is obvious that we need to rethink this approach for concurrency. We aren’t able to properly scale to bigger values. Note that this approach also does not take advantage of pipelining. We process each command separately from all else. My next move is to add support for pipelining with this approach and measure that impact. On the one hand, we are still at around the million mark, but given that I spent very little time (and not a lot of complexity) getting an extra 250,000 ops/second from that level of change is encouraging. The profiler is also telling us that there are more things that we can do, but I want to focus on fixing the approach we take first. Here is the current state of the code, so you can compare it to the original one. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters using System.Collections.Concurrent; using System.Net.Sockets; using System.Threading.Channels; var listener = new TcpListener(System.Net.IPAddress.Any, 6379); listener.Start(); var redisClone = new RedisClone(); while (true) { var client = listener.AcceptTcpClient(); var _ = redisClone.HandleConnection(client); // run async } public class RedisClone { ShardedDictionary _state = new(Environment.ProcessorCount / 2); public async Task HandleConnection(TcpClient tcp) { var _ = tcp; var stream = tcp.GetStream(); var client = new Client { Tcp = tcp, Dic = _state, Reader = new StreamReader(stream), Writer = new StreamWriter(stream) { NewLine = "\r\n" } }; await client.ReadAsync(); } } class Client { public TcpClient Tcp; public StreamReader Reader; public StreamWriter Writer; public string Key; public string? Value; public ShardedDictionary Dic; List<string> Args = new(); public async Task ReadAsync() { try { Args.Clear(); var lineTask = Reader.ReadLineAsync(); if (lineTask.IsCompleted == false) { await Writer.FlushAsync(); } var line = await lineTask; if (line == null) { using (Tcp) { return; } } if (line[0] != '*') throw new InvalidDataException("Cannot understand arg batch: " + line); var argsv = int.Parse(line.Substring(1)); for (int i = 0; i < argsv; i++) { line = await Reader.ReadLineAsync(); if (line == null || line[0] != '$') throw new InvalidDataException("Cannot understand arg length: " + line); var argLen = int.Parse(line.Substring(1)); line = await Reader.ReadLineAsync(); if (line == null || line.Length != argLen) throw new InvalidDataException("Wrong arg length expected " + argLen + " got: " + line); Args.Add(line); } switch (Args[0]) { case "GET": Key = Args[1]; Value = null; break; case "SET": Key = Args[1]; Value = Args[2]; break; default: throw new ArgumentOutOfRangeException("Unknown command: " + Args[0]); } Dic.Run(this); } catch (Exception e) { await HandleError(e); } } public async Task NextAsync() { try { if (Value == null) { await Writer.WriteLineAsync("$-1"); } else { await Writer.WriteLineAsync($"${Value.Length}\r\n{Value}"); } await ReadAsync(); } catch (Exception e) { await HandleError(e); } } public async Task HandleError(Exception e) { using (Tcp) { try { string? line; var errReader = new StringReader(e.ToString()); while ((line = errReader.ReadLine()) != null) { await Writer.WriteAsync("-"); await Writer.WriteLineAsync(line); } await Writer.FlushAsync(); } catch (Exception) { // nothing we can do } } } } class ShardedDictionary { Dictionary<string, string>[] _dics; BlockingCollection<Client>[] _workers; public ShardedDictionary(int shardingFactor) { _dics = new Dictionary<string, string>[shardingFactor]; _workers = new BlockingCollection<Client>[shardingFactor]; for (int i = 0; i < shardingFactor; i++) { var dic = new Dictionary<string, string>(); var worker = new BlockingCollection<Client>(); _dics[i] = dic; _workers[i] = worker; // readers new Thread(() => { ExecWorker(dic, worker); }) { IsBackground = true, }.Start(); } } private static void ExecWorker(Dictionary<string, string> dic, BlockingCollection<Client> worker) { while (true) { var client = worker.Take(); if (client.Value != null) { dic[client.Key] = client.Value; client.Value = null; } else { dic.TryGetValue(client.Key, out client.Value); } var _ = client.NextAsync(); } } public void Run(Client c) { var reader = _workers[c.GetHashCode() % _workers.Length]; reader.Add(c); } } view raw Redis.2.cs hosted with ❤ by GitHub

High performance .NET

by Oren Eini

posted on: June 09, 2022

In the previous post, I wrote a small Redis clone using the most naïve manner. It was able to hit nearly 1M queries per second on our test instance (c6g.4xlarge, using 16 cores and 64 GB of memory). Before we get any deeper into optimization, it is worth understanding where the time is actually being spent. I run the server under a profiler, to see the various costs. I like using dotTrace as a profiler, while using the Tracing mode, since that gives me execution time as well as the number of calls. Often enough I can reason a lot about the system performance just from those details. Take a look at the following stats, this is the breakdown of costs in the actual processing of the connection: And here it is when we break it up by You can see that the cost of FlushAsync() dominates. I’m going to form a hypothesis here. When we call FlushAsync() on the StreamWriter, we’ll also flush to the underlying stream. Looking deeper into the call stack that looks like we’ll need a separate packet per command at the TCP level. What will happen if we’ll change the StreamWriter’s AutoFlush to true, which will cause it to write immediately to the underlying stream, but won’t call the flush on the TCP stream. That will allow the TCP stream to buffer writes more efficiently. The code change involved is removing the FlushAsync() calls and initializing the StreamWiter like so: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters using var writer = new StreamWriter(stream) { NewLine = "\r\n", AutoFlush = true, }; view raw AutoFlush.cs hosted with ❤ by GitHub Let’s run the benchmark again, which will give us (on my development machine): 138,979.57 QPS – using AutoFlush = true 139,653.98 QPS – using FlushAsync Either option is a wash, basically. But here is why: Basically, AutoFlush set to true will flush not just the current stream, but also the underlying stream, putting us in the same position. The problem is that we need to flush, otherwise we may buffer results in memory that won’t be sent to the client. Redis benchmarks rely heavily on pipelining (sending multiple commands at once), but it is entirely possible that you’ll get a bunch of commands, write them (to the buffer) and then not send anything to the client since the output buffer isn’t full. We can optimize this quite easily, using the following change: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters 34c34,39 < var line = await reader.ReadLineAsync(); --- > var lineTask = reader.ReadLineAsync(); > if(lineTask.IsCompleted == false) > { > await writer.FlushAsync(); > } > var line = await lineTask; 62d66 < await writer.FlushAsync(); view raw diff.patch hosted with ❤ by GitHub What I’m doing here is writing to the StreamWriter directly, and I’ll only flush the buffer if there is no more input waiting. That should reduce the number of packets we send significantly, and it does. Running the benchmark again gives us: 229,783.30 QPS – using delayed flushing That is almost twice as fast, which is impressive, for such a small change. The idea is that we are able to buffer our writes far more, but not delay them too much. If we write enough to the StreamWriter buffer, it will flush itself automatically, and we’ll only actually flush the StreamWriter manually when we have nothing further to read, which we do in parallel with the reading itself. Here is the new cost structure: And the actual methods called: If we’ll compare this to the first profiling results, we can find some really interesting numbers. Before, we have called FlushAsync per command (see the ExecuteCommand & FlushAsync), now we call this a lot less often). You can see that most of the time is now in the “business logic” for this system, and from the subsystems breakdown, a lot of the cost is now in the collections. The GC costs here also went down significantly (~5%). I’m fairly certain that this is because we flush to the TCP stream, but I didn’t check too much. Note that string processing and GC take a lot of time, but the Collections / ExecuteCommand is taking the vast majority of the costs. If we look into that, we’ll see: And that is… interesting. Mostly because the major costs are in TryAddInternal. We know that there is high contention in this scenario, but 92% of the time spent in the method directly? What is it doing? Looking at the code, it becomes obvious: The ConcurrentDictionary is sharding the calls between the locks. And the number of locks is defined by the number of the cores we have by default. The more concurrency we have, the more we can benefit from increasing the amount. I tried setting this to 1024 and running it under the profiler, and this gave me a few percentage points improvements, but not much more. Valuable, but not at the level we are playing with. Even so, we managed to get some interesting details from this exploration. We know that we’ll have to deal with the dictionary implementation, since it takes roughly 50% of our time. I also want to pay some attention to these numbers: Right now, we need to figure out how to make it faster in terms of collections, but we also have to consider overall GC costs as well as the string processing details. More on that in the next post.