This post is part of the series 'Web Performance'. Be sure to check out the rest of the blog posts of the series!Website performance: Why and how to measure?Website performance: How I've improved the performance of this website?Using AV1 video codec to reduce web page sizeUsing Avif codec for image
My previous attempts to write a Redis clone were done in about as straightforward a way as possible. Open a socket to listen on, have a separate Task for each client that reads from the network, parse the command and execute it. There are some smarts around supporting pipelining, but that is pretty much it.
Let’s take a step back and build ourselves a Redis clone that matches the actual Redis architecture more closely. In order to do that, I’ll need to do everything in a single thread. That is… surprisingly hard to do in C#. There are no APIs for doing the kind of work that Redis is doing. To be rather more exact, there is the Socket.Select() method, but that requires building everything on top of that (meaning that we have to handle buffering, string handling, etc).
Given that this is a way station to the final proposed architecture, I decided to skip this entirely. Instead, I’m going to focus first on removing the major bottleneck in the system, the ConcurrentDictionary.
The profiler results show that the biggest cost we have here is the scalability of the concurrent dictionary. Even when we tried to shard it across 1024 locks, it still took almost 50% of our runtime. The question is, can we do better? One good option that we can try is to shard things directly. Instead of using a single concurrent dictionary, we will split it to separate dictionaries, each one of them would be accessed without concurrency.
The idea goes like this, we’ll have the usual read & write for the clients. But instead of processing the command inline, we’ll route it to a dedicated thread (with its own dictionary) to do the work. I set it so we’ll have 10 such threads (assuming they will reside on individual cores and that I’ll be able to process all I/O on the other 6 cores.
Here are the results after the change:
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 113703.56 --- --- 3.06261 0.95900 25.59900 39.93500 33743.38
Gets 1137015.79 19211.78 1117804.01 3.06109 0.95900 25.59900 39.93500 49150.52
Waits 0.00 --- --- --- --- --- --- ---
Totals 1250719.35 19211.78 1117804.01 3.06122 0.95900 25.59900 39.93500 82893.90
Note that we are now at 1.25 million, almost 25% better than the previous run.
Here are some profiler results of running this code:
So in this case, we are spending a lot of time doing string processing of various kinds, waiting for GC (almost 30%). The costs for collections went down a lot (but we’ll see that it shifted somewhat).
There are some other things that pop to mind, take a look here:
That is a surprising cost for a “simple” property lookup. The substrings calls are also expensive, over 6% of the overall runtime.
When looking at other parts of the system, we have:
This is really interesting, because we spend a lot of time just waiting for items in the queue. We could probably do more things in there rather than just wait.
I also tried various other concurrency values. With a single ExecWorker running, we have 404,187 ops/sec and with two of them we are at 715,157 ops/sec. When running with four threads dedicated to processing the requests, we are at 1,060,622.24 ops/sec.
So it is obvious that we need to rethink this approach for concurrency. We aren’t able to properly scale to bigger values.
Note that this approach also does not take advantage of pipelining. We process each command separately from all else. My next move is to add support for pipelining with this approach and measure that impact.
On the one hand, we are still at around the million mark, but given that I spent very little time (and not a lot of complexity) getting an extra 250,000 ops/second from that level of change is encouraging. The profiler is also telling us that there are more things that we can do, but I want to focus on fixing the approach we take first.
Here is the current state of the code, so you can compare it to the original one.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
using System.Collections.Concurrent;
using System.Net.Sockets;
using System.Threading.Channels;
var listener = new TcpListener(System.Net.IPAddress.Any, 6379);
listener.Start();
var redisClone = new RedisClone();
while (true)
{
var client = listener.AcceptTcpClient();
var _ = redisClone.HandleConnection(client); // run async
}
public class RedisClone
{
ShardedDictionary _state = new(Environment.ProcessorCount / 2);
public async Task HandleConnection(TcpClient tcp)
{
var _ = tcp;
var stream = tcp.GetStream();
var client = new Client
{
Tcp = tcp,
Dic = _state,
Reader = new StreamReader(stream),
Writer = new StreamWriter(stream)
{
NewLine = "\r\n"
}
};
await client.ReadAsync();
}
}
class Client
{
public TcpClient Tcp;
public StreamReader Reader;
public StreamWriter Writer;
public string Key;
public string? Value;
public ShardedDictionary Dic;
List<string> Args = new();
public async Task ReadAsync()
{
try
{
Args.Clear();
var lineTask = Reader.ReadLineAsync();
if (lineTask.IsCompleted == false)
{
await Writer.FlushAsync();
}
var line = await lineTask;
if (line == null)
{
using (Tcp)
{
return;
}
}
if (line[0] != '*')
throw new InvalidDataException("Cannot understand arg batch: " + line);
var argsv = int.Parse(line.Substring(1));
for (int i = 0; i < argsv; i++)
{
line = await Reader.ReadLineAsync();
if (line == null || line[0] != '$')
throw new InvalidDataException("Cannot understand arg length: " + line);
var argLen = int.Parse(line.Substring(1));
line = await Reader.ReadLineAsync();
if (line == null || line.Length != argLen)
throw new InvalidDataException("Wrong arg length expected " + argLen + " got: " + line);
Args.Add(line);
}
switch (Args[0])
{
case "GET":
Key = Args[1];
Value = null;
break;
case "SET":
Key = Args[1];
Value = Args[2];
break;
default:
throw new ArgumentOutOfRangeException("Unknown command: " + Args[0]);
}
Dic.Run(this);
}
catch (Exception e)
{
await HandleError(e);
}
}
public async Task NextAsync()
{
try
{
if (Value == null)
{
await Writer.WriteLineAsync("$-1");
}
else
{
await Writer.WriteLineAsync($"${Value.Length}\r\n{Value}");
}
await ReadAsync();
}
catch (Exception e)
{
await HandleError(e);
}
}
public async Task HandleError(Exception e)
{
using (Tcp)
{
try
{
string? line;
var errReader = new StringReader(e.ToString());
while ((line = errReader.ReadLine()) != null)
{
await Writer.WriteAsync("-");
await Writer.WriteLineAsync(line);
}
await Writer.FlushAsync();
}
catch (Exception)
{
// nothing we can do
}
}
}
}
class ShardedDictionary
{
Dictionary<string, string>[] _dics;
BlockingCollection<Client>[] _workers;
public ShardedDictionary(int shardingFactor)
{
_dics = new Dictionary<string, string>[shardingFactor];
_workers = new BlockingCollection<Client>[shardingFactor];
for (int i = 0; i < shardingFactor; i++)
{
var dic = new Dictionary<string, string>();
var worker = new BlockingCollection<Client>();
_dics[i] = dic;
_workers[i] = worker;
// readers
new Thread(() =>
{
ExecWorker(dic, worker);
})
{
IsBackground = true,
}.Start();
}
}
private static void ExecWorker(Dictionary<string, string> dic, BlockingCollection<Client> worker)
{
while (true)
{
var client = worker.Take();
if (client.Value != null)
{
dic[client.Key] = client.Value;
client.Value = null;
}
else
{
dic.TryGetValue(client.Key, out client.Value);
}
var _ = client.NextAsync();
}
}
public void Run(Client c)
{
var reader = _workers[c.GetHashCode() % _workers.Length];
reader.Add(c);
}
}
view raw
Redis.2.cs
hosted with ❤ by GitHub
In the previous post, I wrote a small Redis clone using the most naïve manner. It was able to hit nearly 1M queries per second on our test instance (c6g.4xlarge, using 16 cores and 64 GB of memory). Before we get any deeper into optimization, it is worth understanding where the time is actually being spent. I run the server under a profiler, to see the various costs.
I like using dotTrace as a profiler, while using the Tracing mode, since that gives me execution time as well as the number of calls. Often enough I can reason a lot about the system performance just from those details.
Take a look at the following stats, this is the breakdown of costs in the actual processing of the connection:
And here it is when we break it up by
You can see that the cost of FlushAsync() dominates. I’m going to form a hypothesis here. When we call FlushAsync() on the StreamWriter, we’ll also flush to the underlying stream. Looking deeper into the call stack that looks like we’ll need a separate packet per command at the TCP level.
What will happen if we’ll change the StreamWriter’s AutoFlush to true, which will cause it to write immediately to the underlying stream, but won’t call the flush on the TCP stream. That will allow the TCP stream to buffer writes more efficiently.
The code change involved is removing the FlushAsync() calls and initializing the StreamWiter like so:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
using var writer = new StreamWriter(stream)
{
NewLine = "\r\n",
AutoFlush = true,
};
view raw
AutoFlush.cs
hosted with ❤ by GitHub
Let’s run the benchmark again, which will give us (on my development machine):
138,979.57 QPS – using AutoFlush = true
139,653.98 QPS – using FlushAsync
Either option is a wash, basically. But here is why:
Basically, AutoFlush set to true will flush not just the current stream, but also the underlying stream, putting us in the same position.
The problem is that we need to flush, otherwise we may buffer results in memory that won’t be sent to the client. Redis benchmarks rely heavily on pipelining (sending multiple commands at once), but it is entirely possible that you’ll get a bunch of commands, write them (to the buffer) and then not send anything to the client since the output buffer isn’t full. We can optimize this quite easily, using the following change:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
34c34,39
< var line = await reader.ReadLineAsync();
---
> var lineTask = reader.ReadLineAsync();
> if(lineTask.IsCompleted == false)
> {
> await writer.FlushAsync();
> }
> var line = await lineTask;
62d66
< await writer.FlushAsync();
view raw
diff.patch
hosted with ❤ by GitHub
What I’m doing here is writing to the StreamWriter directly, and I’ll only flush the buffer if there is no more input waiting. That should reduce the number of packets we send significantly, and it does. Running the benchmark again gives us:
229,783.30 QPS – using delayed flushing
That is almost twice as fast, which is impressive, for such a small change. The idea is that we are able to buffer our writes far more, but not delay them too much. If we write enough to the StreamWriter buffer, it will flush itself automatically, and we’ll only actually flush the StreamWriter manually when we have nothing further to read, which we do in parallel with the reading itself.
Here is the new cost structure:
And the actual methods called:
If we’ll compare this to the first profiling results, we can find some really interesting numbers. Before, we have called FlushAsync per command (see the ExecuteCommand & FlushAsync), now we call this a lot less often).
You can see that most of the time is now in the “business logic” for this system, and from the subsystems breakdown, a lot of the cost is now in the collections.
The GC costs here also went down significantly (~5%). I’m fairly certain that this is because we flush to the TCP stream, but I didn’t check too much.
Note that string processing and GC take a lot of time, but the Collections / ExecuteCommand is taking the vast majority of the costs.
If we look into that, we’ll see:
And that is… interesting.
Mostly because the major costs are in TryAddInternal. We know that there is high contention in this scenario, but 92% of the time spent in the method directly? What is it doing? Looking at the code, it becomes obvious:
The ConcurrentDictionary is sharding the calls between the locks. And the number of locks is defined by the number of the cores we have by default. The more concurrency we have, the more we can benefit from increasing the amount. I tried setting this to 1024 and running it under the profiler, and this gave me a few percentage points improvements, but not much more. Valuable, but not at the level we are playing with.
Even so, we managed to get some interesting details from this exploration. We know that we’ll have to deal with the dictionary implementation, since it takes roughly 50% of our time. I also want to pay some attention to these numbers:
Right now, we need to figure out how to make it faster in terms of collections, but we also have to consider overall GC costs as well as the string processing details. More on that in the next post.
I run into this project, which aims to be a Redis clone with better performance and ease of use. I found it interesting because one of the main selling points there was that it is able to run in a multi threaded mode (instead of Redis’ single thread per process model). They use memtier_benchmark (part of Redis) to test their performance. I got curious about how much performance I could get out of the system if I built my own Redis clone in C#.
The first version I built was done pretty naively. The idea is to write it in a high level manner, and see where that puts us. To make things interesting, here are the test scenarios:
The memtier_benchmark is going to run on c6g.2xlarge instance, using 8 cores and 32 GB of memory.
The tested instance is going to run on c6g.4xlarge, using 16 cores and 64 GB of memory.
Both of those instances are running on the same availability zone.
The command I’m going to run is:
memtier_benchmark –s $SERVER_IP -t 8 -c 16 --test-time=30 --distinct-client-seed -d 256 --pipeline=30
What this says is that we’ll use 8 threads (number of cores on the client instance) with 32 connections per thread, we’ll use 20% writes & 80% reads with data size that is 256 bytes in size. In total, we’ll have 256 clients and out tests are going to continuously push more data into the system.
The server is being run using:
dotnet run –c Release
Here is an example of the server while under this test:
I chose 30 seconds for the test duration to balance doing enough work to feel what is going on (multiple GC cycles, etc) while keeping the test duration short enough that I won’t get bored.
Here are the naïve version results:
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 86300.19 --- --- 8.14044 0.92700 99.83900 196.60700 25610.97
Gets 862870.15 36255.57 826614.58 8.10119 0.91900 99.32700 196.60700 42782.42
Waits 0.00 --- --- --- --- --- --- ---
Totals 949170.34 36255.57 826614.58 8.10476 0.91900 99.32700 196.60700 68393.39
So the naïve version, using C#, doing almost nothing, is almost touching the 1 million queries / sec. The latency, on the other hand, isn’t that good. With the p99 at almost 100ms.
Now that I got your attention with the numbers and pretty graphs, let me show you the actual code that I'm running. This is a “Redis Clone” in under 100 lines of code.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
using System.Collections.Concurrent;
using System.Net.Sockets;
var listener = new TcpListener(System.Net.IPAddress.Any, 6379);
listener.Start();
var redisClone = new RedisClone();
while (true)
{
var client = listener.AcceptTcpClient();
var _ = redisClone.HandleConnection(client); // run async
}
public class RedisClone
{
ConcurrentDictionary<string, string> _state = new();
public async Task HandleConnection(TcpClient client)
{
using var _ = client;
using var stream = client.GetStream();
using var reader = new StreamReader(stream);
using var writer = new StreamWriter(stream)
{
NewLine = "\r\n"
};
try
{
var args = new List<string>();
while (true)
{
args.Clear();
var line = await reader.ReadLineAsync();
if (line == null) break;
if (line[0] != '*')
throw new InvalidDataException("Cannot understand arg batch: " + line);
var argsv = int.Parse(line.Substring(1));
for (int i = 0; i < argsv; i++)
{
line = await reader.ReadLineAsync();
if (line == null || line[0] != '$')
throw new InvalidDataException("Cannot understand arg length: " + line);
var argLen = int.Parse(line.Substring(1));
line = await reader.ReadLineAsync();
if (line == null || line.Length != argLen)
throw new InvalidDataException("Wrong arg length expected " + argLen + " got: " + line);
args.Add(line);
}
var reply = ExecuteCommand(args);
if(reply == null)
{
await writer.WriteLineAsync("$-1");
}
else
{
await writer.WriteLineAsync($"${reply.Length}\r\n{reply}");
}
await writer.FlushAsync();
}
}
catch (Exception e)
{
try
{
string? line;
var errReader = new StringReader(e.ToString());
while ((line = errReader.ReadLine()) != null)
{
await writer.WriteAsync("-");
await writer.WriteLineAsync(line);
}
await writer.FlushAsync();
}
catch (Exception)
{
// nothing we can do
}
}
}
string? ExecuteCommand(List<string> args)
{
switch (args[0])
{
case "GET":
return _state.GetValueOrDefault(args[1]);
case "SET":
_state[args[1]] = args[2];
return null;
default:
throw new ArgumentOutOfRangeException("Unknown command: " + args[0]);
}
}
}
view raw
Redis.Naive.cs
hosted with ❤ by GitHub
Just a few notes on the implementation. I’m not actually doing much. Most of the code is there to parse the Redis protocol. And the code is full of allocations. Each command parsing is done using multiple string splits and concats. Replies to the client require even more concats. The “store” for the system is actually just a simple ConcurrentDictionary, without anything to avoid contention or high costs.
The manner in which we handle I/O is pretty horrible, and… I think you get where I’m going here, right? My goal is to see how I can use this (pretty simple) example to get more performance without having to deal with a lot of extra fluff.
Given my initial attempt is already at nearly 1M QPS, that is a pretty good start, even if I say so myself.
The next step that I want to take it to handle the allocations that are going on here. We can probably do better here, and I aim to try. But I’ll do that in the next post.
Steven has written a post about unit testing RavenDB which shows how easy it is to just write the code you’ll do anyway, and be able to seamlessly and easily test that without any hassles whatsoever.
I do a fair bit of video editing as part of producing content for Pluralsight, clients, and YouTube. Recently I took on the task of editing…Keep Reading →
In this post I describe the difficulties of adding calls to UsePathBase with .NET 6 WebApplication programs, and describe two approaches to work around it…
CodingAfterWork has published a three parts series (more than 5 hours!) showing how you can build the Next Tech Event using Blazor and RavenDB. Take a look, it’s interesting.
This post is part of the series 'Web Performance'. Be sure to check out the rest of the blog posts of the series!Website performance: Why and how to measure?Website performance: How I've improved the performance of this website?Using AV1 video codec to reduce web page size (this post)Using Avif cod
The following code does not output the right value, can you tell why?
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
record Item(string Name, dynamic Extra = null);
static void Main(string[] args)
{
var obj = new Item("Milk")
{
Extra = new Dictionary<string, object>
{
["Skim"] = false,
["Viscosity"] = 39m
}
["Type"] = "Food",
};
Console.WriteLine(JsonConvert.SerializeObject(obj, Formatting.Indented));
}
view raw
strange.cs
hosted with ❤ by GitHub
This has been a real bug that we ran into (with only slightly more complicated code.
The sort of things that make you just stare at the screen in disbelief when you realize what is going on…