Page 104 • Relatively General .NET

Challenge

by Oren Eini

posted on: November 02, 2021

In my previous post, I discussed a bug that brought up in code review, that bug made me go into near panic mode. Here is the issue: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public byte[] ComputeHash(string file) { retyrn (byte[])_cache.Get(file) ?? ComputeHashAndPutInCache(file); } private byte[] ComputeHashAndPutInCache(string file) { byte[] hash = ArrayPool<byte>.Shared.Rent(32); HashTheFile(file, hash); // actual work that we are caching _cache.Set(file, hash, new MemoryCacheEntryOptions { Size = 32, PostEvictionCallbacks = { new PostEvictionCallbackRegistration { EvictionCallback = EvictionCallback } } }); return hash; } private void EvictionCallback(object key, object value, EvictionReason reason, object state) { ArrayPool<byte>.Shared.Return((byte[])value); } view raw bug.cs hosted with ❤ by GitHub In order to understand this bug, you have to take into account multiple concurrent threads at the same time. Look at the ComputeHashAndPutInCache() method, where we register an eviction callback for the item in the cache. When we evict the item, we return the buffer to the buffer pool. We want to avoid allocating memory, so that is certainly something desirable, no? However, consider what happens if I have a thread in ComputeHash(), getting a value from the cache. Before I have a chance to look at the value, however, the cache will decide to evict it. At which point the eviction callback will run. We’ll return the buffer back to the buffer pool, where it may be used again by something else. I am also using this buffer to do other things from the caller of the ComputeHash() call. This is a somewhat convoluted use after free issue, basically. And I find is super scary bug because of its affects: Randomly, and rarely, some buffer will contain the wrong data, leading to wrong results (but hard to track it down). Trying to find such a bug after the fact, however, is nearly impossible. Most of the debugging techniques (repeating the operation for a particular value) will make it go away (the cache will keep the value and not evict it). In short, this is a recipe for a really nasty debug session and an impossible to resolve bug. All from code that looks very much like an innocent bystander. Now, I can obviously fix it by not using the array pool, but that may cause me to allocate more memory than I should. How would you approach fixing this issue?

Improving logging performance with source generators

by Andrew Lock

posted on: November 02, 2021

Exploring .NET 6 - Part 8

Challenge

by Oren Eini

posted on: November 01, 2021

I was writing a fairly low level piece of code recently, this is deep in the guts of how RavenDB handles query. I’m adding a cache for some processing inside of RavenDB and the performance benefits are amazing (three orders of magnitude better). As you can imagine, this is the sort of things that I would really like to get into the main codebase. So I did all the usual taxes, created a pull request and moved to other things. Part of our process is to review all pull requests by another pair of eyes. And while there was some minor comments about the code, there was one comment asking about a particular line that had be break out in cold sweat. I created a simpler reproduction for discussion purposes, here is the code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public class FileHasher { private MemoryCache _cache; public FileHasher() { _cache = new MemoryCache(new MemoryCacheOptions { SizeLimit = 1024 }); } public byte[] ComputeHash(string file) { return (byte[])_cache.Get(file) ?? ComputeHashAndPutInCache(file); } private byte[] ComputeHashAndPutInCache(string file) { byte[] hash = ArrayPool<byte>.Shared.Rent(32); HashTheFile(file, hash); // actual work that we are caching _cache.Set(file, hash, new MemoryCacheEntryOptions { Size = 32, PostEvictionCallbacks = { new PostEvictionCallbackRegistration { EvictionCallback = EvictionCallback } } }); return hash; } private void EvictionCallback(object key, object value, EvictionReason reason, object state) { ArrayPool<byte>.Shared.Return((byte[])value); } private static void HashTheFile(string file, byte[] hash) { // let's pretend to hash the file here // this may take some time, but will be cached 2nd time around // concurrent requests may cause duplicated work, don't care bout that var _ = file; Random.Shared.NextBytes(hash); } } view raw FileHasher.cs hosted with ❤ by GitHub Take a look, and let’s me know if you see the bug and its consequences.

Visual Studio Tips and tricks: Subword navigation

by Gérald Barré

posted on: November 01, 2021

Starting with Visual Studio 2022, you can navigate between subwords using the Subword Navigation feature. This is very convenient when you want to move the caret to a part of an identifier, for example. So PascalCaseIdentifier has 3 sub-words, Pascal, Case, and Identifier. Using subword navigation,

Pragmatic BDD with RavenDB

by Oren Eini

posted on: October 29, 2021

For most developers, Behavior Driven Development is a synonym for Given-When-Then syntax and Cucumber-like frameworks that are supporting it. In this session, we will step back from DSLs like Gherkin and show that BDD in its essence is about the mental approach to writing software based on a specification that describes business scenarios. Additionally, we will use RavenDB to provide an easy and effortless way to implement integrations tests based on the BDD approach. Finally, we will show how the BDD approach can be introduced in your projects without the need to learn any new frameworks.

Make the error useful, and save the support call

by Oren Eini

posted on: October 28, 2021

Another code review comment, this time this is an error message being raised:The comment I made here is that this error message lacks a very important detail. How do I fix this? In this case, the maximum number of connections is controlled by a configuration setting, so telling that to the user as well as what is the actual configuration entry to use is paramount.If we give them all the information, we save the support call that would follow. If the user don’t read the error message, the support engineer likely would, and be able to close the ticket within moment.

Exception.Message property considered harmful

by Oren Eini

posted on: October 27, 2021

I just went over the following pull request, where I found this nugget:I have an admittedly very firm views on the subject of error handling. The difference between good error handling and the merely mediocre can be ten times more lines of code, but also hundred times less troubleshooting in production.And a flat out rule that I have is that whenever you have a usage of the Exception.Message property, that means that you are discarding valuable information aside. This is a pretty horrible thing to do. You should never use the Message property, always use the ToString(), so you’ll get the full details. Yes, that is relevant even if you are showing to the user. Because if you are showing an exception message to the user, you have let it bubbled up all the way, so you might as well give full details.

Dotnet Format and File Scoped Namespaces

by Ardalis

posted on: October 26, 2021

The dotnet format tool is now a part of the dotnet CLI with .NET 6, and you can use it to easily adopt the new file scoped namespace feature…Keep Reading →

Analyzers for ASP.NET Core in .NET 6

by Andrew Lock

posted on: October 26, 2021

Exploring .NET 6 - Part 7

Looking into convergent encryption

by Oren Eini

posted on: October 25, 2021

I’ll start with saying that this is not something that is planned in any capacity, I run into this topic recently and decided to dig a little deeper. This post is mostly about results of my research.If you run a file sharing system, you are going to run into a problem very early on. Quite a lot of the files that people are storing are shared, that is a good thing. Instead of storing the same file multiple times, you can store it once, and just keep a reference counter. That is how RavenDB internally deals with attachments, for example. Two documents that have the same attachment (content, not the file name, obviously) will only have a single entry inside of the RavenDB database.When you run a file sharing system, one of the features you really want to offer is the ability to save space in this manner. Another one is not being able to read the users’ files. For many reasons, that is a highly desirable property. For the users, because they can be assured that you aren’t reading their private files. For the file sharing system, because it simplify operations significantly (if you can’t look at the files’ contents, there is a lot that you don’t need to worry about).In other words, we want to store the users’ file, but we want to do that in an encrypted manner. It may sound surprising that the file sharing system doesn’t want to know what it is storing, but it actually does simplify things. For example, if you can look into the data, you may be asked (or compelled) to do so. I’m not talking about something like a government agency doing that, but even feature requests such as “do virus scan on my files”, for example. If you literally cannot do that, and it is something that you present as an advantage to the user, that is much nicer to have.The problem is that if you encrypt the files, you cannot know if they are duplicated. And then you cannot use the very important storage optimization technique of de-duplication. What can you do then?This is where convergent encryption comes into play. The idea is that we’ll use an encryption system that will give us the same output for the same input even when using different keys. To start with, that sounds like a tall order, no? But it turns out that it is quite easy to do. Consider the following symmetric key operations: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public static byte[] Encrypt(byte[] message, byte[] nonce, byte[] key); public static byte[] Decrypt(byte[] cipherText, byte[] nonce, byte[] key); view raw encryption.cs hosted with ❤ by GitHub One of the key aspects of modern cryptography is the use of nonce, that ensures that for the same message and key, we’ll always get a different ciphertext. In this case, we need to go the other way around. To ensure that for different keys, the same content will give us the same output. Here is how we can utilize the above primitives for this purpose. Here is what this looks like: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public (byte[] CipherText, byte[] UserKey) ConvergentEncryption(byte[] message, byte[] key) { var hash = Hash(message); var emptyNonce = new byte[32]; // initialized to zero var ciphertext = Encrypt(message, emptyNonce, key: hash); var nonce = GenerateNonce(); var encryptedKey = Encrypt(message: hash, nonce, key); return (ciphertext, nonce.Concat(enecryptedKey)); } public byte[] ConvergentDecryption(byte[] ciphertext, byte[] userKey, byte[] key) { var nonce = userKey[0..32]; var encryptedKey = userKey[32..64]; var hash = Decrypt(encryptedKey, nonce, key); var emptyNonce = new byte[32]; // zeroed return Decrypt(ciphertext, emptyNonce, hash); } view raw encryption.cs hosted with ❤ by GitHub In other words, we have a two step process. First, we compute the cryptographic hash of the message, and use that as the key to encrypt the data. We use a static nonce for this part, more on that later. We then take the hash of the file and encrypt that normally, with the secret key of the user and a random nonce. We can then push both the ciphertext and the nonce + encrypted key to the file sharing service. Two users, using different keys, will always generate the same output for the same input in this model. How is that? Both users will compute the same cryptographic hash for the same file, of course. Using a static nonce completes the cycle and ensures that we’re basically running the same operation. We can then push both the encrypted file and our encrypted key for that to the file sharing system. The file sharing system can then de-duplicate the encrypted blob directly. Since this is the identical operation, we can safely do that. We do need to keep, for each user, the key to open that file, encrypted with the user’s key. But that is a very small value, so likely not an issue.Now, what about this static nonce? The whole point of a nonce is that it is a value that you use once. How can we use a static value here safely? The problem that nonce is meant to solve is that with most ciphers, if you XOR the output of two cipher texts, you’ll get the difference between them. If they were encrypted using the same key and nonce, you’ll get the result of XOR between their plain texts. That can have catastrophic impact on the security of the system. To get anywhere here, you need to encrypt two different messages with the same key and nonce. In this case, however, that cannot happen. Since we use cryptographic hash of the content as the key, we know that any change in the message will ensure that we have a different key. That means that we never reuse the key, so there is no real point in using a nonce at all. Given that the cryptographic operation requires it, we can just pass a zeroed nonce and not think about it further.This is a very attractive proposition, since it can amounts to massive space savings. File sharing isn’t the only scenario where this is attractive. Consider the case of a messaging application, where you want to forward messages from one user to another. Memes and other viral details are a common scenario here. You can avoid having to re-upload the file many times, because even in an end to end encryption model, we can still avoid sharing the file contents with the storage host.However, this lead to one of the serious issues that we have to cover for convergent encryption. For the same input, you’ll get the same output. That means that if an adversary know the source file, it can tell if a user has that file. In the context of a messaging application, it can spell trouble. Consider the following image, which is banned by the Big Meat industry:Even with end to end encryption, if you use convergent encryption for media files, you can tell that a particular piece of content is accessed by a user. If this is a unique file, the server can’t really tell what is inside it. However, if this is a file that has a known source, we can detect that the user is looking at a picture of salads and immediately snitch to Big Meat. This is called configuration of file attack, and it is the most obvious problem for this scenario. The other issue is that you using convergent encryption, you may allow an adversary to guess about values in the face on known structure. Let’s consider the following scenario, I have a system service where users upload their data using convergent encryption, given that many users may share the same file, we allow any user to download a file using:GET /file?id=71e12496e9efe0c7e31332533cb53abf1a3677f2a802ef0c555b08ba6b8a0f9fNow, let’s assume that I know what typical files are stored in the service. For example, something like this:Looking at this form, there are quite a few variables that you can plug here, right? However, we can generate options for all of those quite easily, and encryption is cheap. So we can speculate on the possible values. We can then run the convergent encryption on each possibility, then fetch that from the server. If we have a match, we figured out the remaining information.Consider this another aspect of trying to do password cracking, and using a set of algorithms that are quite explicitly meant to be highly efficient, so they lend themselves to offline work. That is enough on the topic, I believe. I don’t have any plans of doing something with that, but it was interesting to figure out. I actually started this looking at this feature in WhatsApp:It seams that this is a client side enforced policy, rather than something that is handled via the protocol itself. I initially thought that this is done via convergent encryption, but it looks like just a counter in the message, and then you the client side shows the warning as well as applies limits to it.