Boost your dev workflow with the C# Dev Kit August 2024 release, which brings enhanced Razor IntelliSense, updated Project Status Bar, and new Project Configuration Options!
As a software developer, you might spend most of your time immersed in code, solving problems, and building innovative solutions. But have…Keep Reading →
Automated tests are important to ensure the quality of your application. In the literature, you'll find many kinds of automated tests such as unit tests, integration tests, functional tests, smoke tests, and so on. I don't like this naming because some tests don't fit in a category and people give
In this post I discus the recent pollyfill.io supply-chain attack and describe how to protect against similar attacks using Subresource Integrity (SRI)…
RavenDB has a hidden feature, enabled by default and not something that you usually need to be aware of. It has built-in support for caching. Consider the following code:async Task<Dictionary<string, int>> HowMuchWorkToDo(string userId)
{
using var session = _documentStore.OpenAsyncSession();
var results = await session.Query<Item>()
.GroupBy(x =>new { x.Status, x.AssignedTo })
.Where(g => g.Key.AssignedTo == userId && g.Key.Status != "Closed")
.Select(g => new
{
Status = g.Key.Status,
Count = g.Count()
})
.ToListAsync();
return results.ToDictionary(x => x.Status, x => x.Count);
}What happens if I call it twice with the same user? The first time, RavenDB will send the query to the server, where it will be evaluated and executed. The server will also send an ETag header with the response. The client will remember the response and its ETag in its own memory.The next time this is called on the same user, the client will again send a request to the server. This time, however, it will also inform the server that it has a previous response to this query, with the specified ETag. The server, when realizing the client has a cached response, will do a (very cheap) check to see if the cached response matches the current state of the server. If so, it can inform the client (using 304 Not Modified) that it can use its cache.In this way, we benefit twice:First, on the server side, we avoid the need to compute the actual query.Second, on the network side, we aren’t sending a full response back, just a very small notification to use the cached version.You’ll note, however, that there is still an issue. We have to go to the server to check. That means that we still pay the network costs. So far, this feature is completely transparent to the user. It works behind the scenes to optimize server query costs and network bandwidth costs.We have a full-blown article on caching in RavenDB if you care to know more details instead of just “it makes things work faster for me”.Aggressive Caching in RavenDBThe next stage is to involve the user. Enter the AggressiveCache() feature (see the full documentation here), which allows the user to specify an additional aspect. Now, when the client has the value in the cache, it will skip going to the server entirely and serve the request directly from the cache. What about cache invalidation? Instead of having the client check on each request if things have changed, we invert the process. The client asks the server to notify it when things change, and until it gets notice from the server, it can serve responses completely from the local cache.I really love this feature, that was the Good part, now let’s talk about the other pieces:There are only two hard things in Computer Science: cache invalidation and naming things.-- Phil KarltonThe bad part of caching is that this introduces more complexity to the system. Consider a system with two clients that are using the same database. An update from one of them may show up at different times in each. Cache invalidation will not happen instantly, and it is possible to get into situations where the server fails to notify the client about the update, meaning that we didn’t clear the cache.We have a good set of solutions around all of those, I think. But it is important to understand that the problem space itself is a problem. In particular, let’s talk about dealing with the following query:var emps = session.Query<Employee>()
.Include(x => x.Department)
.Where(x => x.Location.City == "London")
.ToListAsync();When an employee is changed on the server, it will send a notice to the client, which can evict the item from the cache, right? But what about when a department is changed? For that matter, what happens if a new employee is added to London? How do we detect that we need to refresh this query?There are solutions to those problems, but they are super complicated and have various failure modes that often require more computing power than actually running the query. For that reason, RavenDB uses a much simpler model. If the server notifies us about any change, we’ll mark the entire cache as suspect. The next request will have to go to the server (again with an ETag, etc) to verify that the response hasn’t changed. Note that if the specific query results haven’t changed, we’ll get OK (304 Not Modified) from the server, and the client will use the cached response.Conservatively aggressive approachIn other words, even when using aggressive caching, RavenDB still has to go to the server sometimes. What is the impact of this approach when you have a system under load?We’ll still use aggressive caching, but you’ll see brief periods where we aren’t checking with the server (usually be able to cache for about a second or so), followed by queries to the server to check for any changes.In most cases, this is what you want. We still benefit from the cache while reducing the number of remote calls by about 50%, and we don’t have to worry about missing updates. The downside is that, as application developers, we know that this particular document and query are independent, so we want to cache them until we get notice about that particular document being changed.The default aggressive caching in RavenDB will not be of major help here, I’m afraid. But there are a few things you can do.You can use Aggressive Caching in the NoTracking mode. In that mode, the client will not ask the server for notifications on changes, and will cache the responses in memory until they expire (clock expiration or size expiration only).There is also a feature suggestion that calls for updating the aggressive cache in a background manner, I would love to hear more feedback on this proposal. Another option is to take this feature higher than RavenDB directly, but still use its capabilities. Since we have a scenario where we know that we want to cache a specific set of documents and refresh the cache only when those documents are updated, let’s write it.Here is the code:public class RecordCache<T>
{
private ConcurrentLru<string, T> _items =
new(256, StringComparer.OrdinalIgnoreCase);
private readonly IDocumentStore _documentStore;
public RecordCache(IDocumentStore documentStore)
{
const BindingFlags Flags = BindingFlags.Instance |
BindingFlags.NonPublic | BindingFlags.Public;
var violation = typeof(T).GetFields(Flags)
.FirstOrDefault(f => f.IsInitOnly is false);
if (violation != null)
{
throw new InvalidOperationException(
"You should cache *only* immutable records, but got: " +
typeof(T).FullName + " with " + violation.Name +
" which is not read only!");
}
var changes = documentStore.Changes();
changes.ConnectionStatusChanged += (_, args) =>
{
_items = new(256, StringComparer.OrdinalIgnoreCase);
};
changes.ForDocumentsInCollection<T>()
.Subscribe(e =>
{
_items.TryRemove(e.Id, out _);
})
;
_documentStore = documentStore;
}
public ValueTask<T> Get(string id)
{
if (_items.TryGetValue(id, out var result))
{
return ValueTask.FromResult(result);
}
return new ValueTask<T>(GetFromServer(id));
}
private async Task<T> GetFromServer(string id)
{
using var session = _documentStore.OpenAsyncSession();
var item = await session.LoadAsync<T>(id);
_items.Set(id, item);
return item;
}
}There are a few things to note about this code. We are holding live instances, so we ensure that the values we keep are immutable records. Otherwise, we may hand the same instance to two threads which can be… fun. Note that document IDs in RavenDB are case insensitive, so we pass the right string comparer.Finally, the magic happens in the constructor. We register for two important events. Whenever the connection status of the Changes() connection is modified, we clear the cache. This handles any lost updates scenarios that occurred while we were disconnected.In practice, the subscription to events on that particular collection is where we ensure that after the server notification, we can evict the document from the cache so that the next request will load a fresh version.Caching + Distributed Systems = 🤯🤯🤯I’m afraid this isn’t an easy topic once you dive into the specifics and constraints we operate under. As I mentioned, I would love your feedback on the background cache refresh feature, or maybe you have better insight into other ways to address the topic.
Raspberry Pi's are very useful, but I find the LEDs very annoying. In this post, I describe how to disable all LEDs on a Rasberry PI.NoteTested on Raspberry Pi 4B and Raspberry Pi 3B+Connect to the Raspberry Pi with SSHOpen the file /boot/config.txt as admin using sudo nano /boot/config.txtAdd the
RavenDB is a pretty old codebase, hitting 15+ years in production recently. In order to keep it alive & well, we make sure to follow the rule of always leaving the code in a better shape than we found it. Today’s tale is about the StreamBitArray class, deep in the guts of Voron, RavenDB’s storage engine. The class itself isn’t really that interesting, it is just an implementation of a Bit Array that we have for a bitmap. We wrote it (based on Mono’s code, it looks like) very early in the history of RavenDB and have never really touched it since.The last time anyone touched it was 5 years ago (fixing the namespace), 7 years ago we created an issue from a TODO comment, etc. Most of the code dates back to 2013, actually. And even then it was moved from a different branch, so we lost the really old history. To be clear, that class did a full tour of duty. For over a decade, it has served us very well. We never found a reason to change it, never got a trace of it in the profiler, etc. As we chip away at various hurdles inside RavenDB, I ran into this class and really looked at it with modern sensibilities. I think that this makes a great test case for code refactoring from the old style to our modern one.Here is what the class looks like:Already, we can see several things that really bug me. That class is only used in one context, to manage the free pages bitmap for Voron. That means we create it whenever Voron frees a page. That can happen a lot, as you might imagine. A single bitmap here covers 2048 pages, so when we create an instance of this class we also allocate an array with 64 ints. In other words, we need to allocate 312 bytes for each page we free. That isn’t fun, and it actually gets worse. Here is a typical example of using this class:using (freeSpaceTree.Read(section, out Slice result))
{
sba = !result.HasValue ?
new StreamBitArray() :
new StreamBitArray(result.CreateReader());
}
sba.Set((int)(pageNumber % NumberOfPagesInSection), true);
using (sba.ToSlice(tx.Allocator, out Slice val))
freeSpaceTree.Add(section, val);And inside the ToSlice() call, we have:public ByteStringContext.InternalScope ToSlice(ByteStringContext context,
ByteStringType type, out Slice str)
{
var buffer = ToBuffer();
var scope = context.From(buffer, 0, buffer.Length,
type, out ByteString byteString);
str = new Slice(byteString);
return scope;
}
private unsafe byte[] ToBuffer()
{
var tmpBuffer = new byte[(_inner.Length + 1)*sizeof (int)];
unsafe
{
fixed (int* src = _inner)
fixed (byte* dest = tmpBuffer)
{
*(int*) dest = SetCount;
Memory.Copy(dest + sizeof (int), (byte*) src,
tmpBuffer.Length - 1);
}
}
return tmpBuffer;
}In other words, ToSlice() calls ToBuffer(), which allocates an array of bytes (288 bytes are allocated here), copies the data from the inner buffer to a new one (using fixed on the two arrays, which is a performance issue all in itself) and then calls a method to do the actual copy. Then in ToSlice() itself we allocate it again in native memory, which we then write to Voron, and then discard the whole thing. In short, somehow it turns out that freeing a page in Voron costs us ~1KB of memory allocations. That sucks, I have to say. And the only reasoning I have for this code is that it is old. Here is the constructor for this class as well:public StreamBitArray(ValueReader reader)
{
SetCount = reader.ReadLittleEndianInt32();
unsafe
{
fixed (int* i = _inner)
{
int read = reader.Read((byte*)i, _inner.Length * sizeof(int));
if (read < _inner.Length * sizeof(int))
throw new EndOfStreamException();
}
}
}This accepts a reader to a piece of memory and does a bunch of things. It calls a few methods, uses fixed on the array, etc., all to get the data from the reader to the class. That is horribly inefficient. Let’s write it from scratch and see what we can do. The first thing to notice is that this is a very short-lived class, it is only used inside methods and never held for long. This usage pattern tells me that it is a good candidate to be made into a struct, and as long as we do that, we might as well fix the allocation of the array as well. Note that I have a hard constraint, I cannot change the structure of the data on disk for backward compatibility reasons. So only in-memory changes are allowed. Here is my first attempt at refactoring the code:public unsafe struct StreamBitArray
{
private fixed uint _inner[64];
public int SetCount;
public StreamBitArray()
{
SetCount = 0;
Vector256<uint>.Zero.StoreUnsafe(ref _inner[0]);
Vector256<uint>.Zero.StoreUnsafe(ref _inner[8]);
Vector256<uint>.Zero.StoreUnsafe(ref _inner[16]);
Vector256<uint>.Zero.StoreUnsafe(ref _inner[24]);
Vector256<uint>.Zero.StoreUnsafe(ref _inner[32]);
Vector256<uint>.Zero.StoreUnsafe(ref _inner[40]);
Vector256<uint>.Zero.StoreUnsafe(ref _inner[48]);
Vector256<uint>.Zero.StoreUnsafe(ref _inner[56]);
}
public StreamBitArray(byte* ptr)
{
var ints = (uint*)ptr;
SetCount = (int)*ints;
var a = Vector256.LoadUnsafe(ref ints[1]);
var b = Vector256.LoadUnsafe(ref ints[9]);
var c = Vector256.LoadUnsafe(ref ints[17]);
var d = Vector256.LoadUnsafe(ref ints[25]);
var e = Vector256.LoadUnsafe(ref ints[33]);
var f = Vector256.LoadUnsafe(ref ints[41]);
var g = Vector256.LoadUnsafe(ref ints[49]);
var h = Vector256.LoadUnsafe(ref ints[57]);
a.StoreUnsafe(ref _inner[0]);
b.StoreUnsafe(ref _inner[8]);
c.StoreUnsafe(ref _inner[16]);
d.StoreUnsafe(ref _inner[24]);
e.StoreUnsafe(ref _inner[32]);
f.StoreUnsafe(ref _inner[40]);
g.StoreUnsafe(ref _inner[48]);
h.StoreUnsafe(ref _inner[56]);
}
}That looks like a lot of code, but let’s see what changes I brought to bear here. Using a struct instead of a class saves us an allocation.Using a fixed array means that we don’t have a separate allocation for the buffer.Using [SkipLocalsInit] means that we ask the JIT not to zero the struct. We do that directly in the default constructor.We are loading the data from the ptr in the second constructor directly.The fact that this is a struct and using a fixed array means that we can create a new instance of this without any allocations, we just need 260 bytes of stack space (the 288 we previously allocated also included object headers).Let’s look at the actual machine code that these two constructors generate. Looking at the default constructor, we have:StreamBitArray..ctor()
L0000: push ebp
L0001: mov ebp, esp
L0003: vzeroupper
L0006: xor eax, eax
L0008: mov [ecx+0x100], eax
L000e: vxorps ymm0, ymm0, ymm0
L0012: vmovups [ecx], ymm0
L0016: vmovups [ecx+0x20], ymm0
L001b: vmovups [ecx+0x40], ymm0
L0020: vmovups [ecx+0x60], ymm0
L0025: vmovups [ecx+0x80], ymm0
L002d: vmovups [ecx+0xa0], ymm0
L0035: vmovups [ecx+0xc0], ymm0
L003d: vmovups [ecx+0xe0], ymm0
L0045: vzeroupper
L0048: pop ebp
L0049: retThere is the function prolog and epilog, but the code of this method uses 4 256-bit instructions to zero the buffer. If we were to let the JIT handle this, it would use 128-bit instructions and a loop to do it. In this case, our way is better, because we know more than the JIT.As for the constructor accepting an external pointer, here is what this translates into:StreamBitArray..ctor(Byte*)
L0000: push ebp
L0001: mov ebp, esp
L0003: vzeroupper
L0006: mov eax, [edx]
L0008: mov [ecx+0x100], eax
L000e: vmovups ymm0, [edx+4]
L0013: vmovups ymm1, [edx+0x24]
L0018: vmovups ymm2, [edx+0x44]
L001d: vmovups ymm3, [edx+0x64]
L0022: vmovups ymm4, [edx+0x84]
L002a: vmovups ymm5, [edx+0xa4]
L0032: vmovups ymm6, [edx+0xc4]
L003a: vmovups ymm7, [edx+0xe4]
L0042: vmovups [ecx], ymm0
L0046: vmovups [ecx+0x20], ymm1
L004b: vmovups [ecx+0x40], ymm2
L0050: vmovups [ecx+0x60], ymm3
L0055: vmovups [ecx+0x80], ymm4
L005d: vmovups [ecx+0xa0], ymm5
L0065: vmovups [ecx+0xc0], ymm6
L006d: vmovups [ecx+0xe0], ymm7
L0075: vzeroupper
L0078: pop ebp
L0079: retThis code is exciting to me because we are also allowing instruction-level parallelism. We effectively allow the CPU to execute all the operations of reading and writing in parallel. Next on the chopping block is this method:public int FirstSetBit()
{
for (int i = 0; i < _inner.Length; i++)
{
if (_inner[i] == 0)
continue;
return i << 5 | HighestBitSet(_inner[i]);
}
return -1;
}
private static int HighestBitSet(int v)
{
v |= v >> 1; // first round down to one less than a power of 2
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
return MultiplyDeBruijnBitPosition[(uint)(v * 0x07C4ACDDU) >> 27];
}We are using vector instructions to scan 8 ints at a time, trying to find the first one that is set. Then we find the right int and locate the first set bit there. Here is what the assembly looks like:StreamBitArray.FirstSetBit()
L0000: push ebp
L0001: mov ebp, esp
L0003: vzeroupper
L0006: xor edx, edx
L0008: cmp [ecx], cl
L000a: vmovups ymm0, [ecx+edx*4]
L000f: vxorps ymm1, ymm1, ymm1
L0013: vpcmpud k1, ymm0, ymm1, 6
L001a: vpmovm2d ymm0, k1
L0020: vptest ymm0, ymm0
L0025: jne short L0039
L0027: add edx, 8
L002a: cmp edx, 0x40
L002d: jl short L000a
L002f: mov eax, 0xffffffff
L0034: vzeroupper
L0037: pop ebp
L0038: ret
L0039: vmovmskps eax, ymm0
L003d: tzcnt eax, eax
L0041: add eax, edx
L0043: xor edx, edx
L0045: tzcnt edx, [ecx+eax*4]
L004a: shl eax, 5
L004d: add eax, edx
L004f: vzeroupper
L0052: pop ebp
L0053: retIn short, the code is simpler, shorter, and more explicit about what it is doing. The machine code that is running there is much tighter. And I don’t have allocations galore. This particular optimization isn’t about showing better numbers in a specific scenario that I can point to. I don’t think we ever delete enough pages to actually see this in a profiler output in such an obvious way. The goal is to reduce allocations and give the GC less work to do, which has a global impact on the performance of the system.