skip to content
Relatively General .NET

Indexing only recent data - adventures with large datasets & archiving

by Oren Eini

posted on: July 26, 2024

We recently got a support request from a user in which they had the following issue:We have an index that is using way too much disk space. We don’t need to search the entire dataset, just the most recent documents. Can we do something like this?from d in docs.Events where d.CreationDate >= DateTime.UtcNow.AddMonths(-3) select new { d.CreationDate, d.Content };The idea is that only documents from the past 3 months would be indexed, while older documents would be purged from the index but still retained. The actual problem is that this is a full-text search index, and the actual data size required to perform a full-text search across the entire dataset is higher than just storing the documents (which can be easily compressed). This is a great example of an XY problem. The request was to allow access to the current date during the indexing process so the index could filter out old documents. However, that is actually something that we explicitly prevent. The problem is that the current date isn’t really meaningful when we talk about indexing. The indexing time isn’t really relevant for filtering or operations, since it has no association with the actual data. The date of a document and the time it was indexed are completely unrelated. I might update a document (and thus re-index it) whose CreationDate is far in the past. That would filter it out from the index. However, if we didn’t update the document, it would be retained indefinitely, since the filtering occurs only at indexing time.Going back to the XY problem, what is the user trying to solve? They don’t want to index all data, but they do want to retain it forever. So how can we achieve this with RavenDB?Data Archiving in RavenDBOne of the things we aim to do with RavenDB is ensure that we have a good fit for most common scenarios, and archiving is certainly one of them. In RavenDB 6.0 we added explicit support for Data Archiving.When you save a document, all you need to do is add a metadata element: @archive-at and you are set. For example, take a look at the following document:{ "Name": "Wilman Kal", "Phone": "90-224 8888", "@metadata": { "@archive-at": "2024-11-01T12:00:00.000Z", "@collection": "Companies", } }This document is set to be archived on Nov 1st, 2024. What does that mean? From that day on, RavenDB will automatically mark it as an archived document, meaning it will be stored in a compressed format and excluded from indexing by default.In fact, this exact scenario is detailed in the documentation. You can decide (on a per-index basis) whether to include archived documents in the index. This gives you a very high level of flexibility without requiring much manual effort. In short, for this scenario, you can simply tell RavenDB when to archive the document and let RavenDB handle the rest. RavenDB will do the right thing for you.

Cryptographically impossible bug hunt

by Oren Eini

posted on: July 24, 2024

I’m currently deep in the process of modifying the internals of Voron, trying to eke out more performance out of the system. I’m making great progress, but I’m also touching parts of the code that haven’t even been looked at for a long time. In other words, I’m mucking about with the most stable and most critical portions of the storage engine. It’s a lot of fun, and I’m actually seeing some great results, but it is also nerve-wracking. We have enough tests that I’ve great confidence I would catch any actual stability issues, but the drive back toward a fully green build has been a slog.The process is straightforward:Change something.Verify that it works better than before.Run the entire test suite (upward of 30K tests) to see if there are any breaks.The last part can be frustrating because it takes a while to run this sort of test suite. That would be bad enough, but some of the changes I made were things like marking a piece of memory that used to be read/write as read-only. Now any access to that memory would result in an access violation. I fixed those in the code, of course, but we have a lot of tests, including some tests that intentionally corrupt data to verify that RavenDB behaves properly under those conditions. One such test writes garbage to the RavenDB file, using read-write memory. The idea is to verify that the checksum matches on read and abort early. Because that test directly modifies what is now read-only memory, it generates a crash due to a memory access violation. That doesn’t just result in a test failure, it takes the whole process down.I’ve gotten pretty good at debugging those sorts of issues (--blame-crash is fantastic) and was able to knock quite a few of them down and get them fixed. And then there was this test, which uses encryption-at-rest. That test started to fail after my changes, and I was pretty confused about exactly what was going on. When trying to read data from disk, it would follow up a pointer to an invalid location. That is not supposed to happen, obviously. Looks like I have a little data corruption issue on my hands. The problem is that this shouldn’t be possible. Remember how we validate the checksum on read? When using encryption-at-rest, we are using a mechanism called AEAD (Authenticated Encryption with Associated Data). That means that in order to successfully decrypt a page of data from disk, it must have been cryptographically verified to be valid.My test results showed, pretty conclusively, that I was generating valid data and then encrypting it. The next stage was to decrypt the data (verifying that it was valid), at which point I ended up with complete garbage.RavenDB trusts that since the data was properly decrypted, it is valid and tries to use it. Because the data is garbage, that leads to… excitement. Once I realized what was going on, I was really confused. I’m pretty sure that I didn’t break 256-bit encryption, but I had a very clear chain of steps that led to valid data being decrypted (successfully!) to garbage. It was also quite frustrating to track because any small-stage test that I wrote would return the expected results. It was only when I ran the entire system and stressed it that I got this weird scenario.I started practicing for my Fields medal acceptance speech while digging deeper. Something here had to be wrong. It took me a while to figure out what was going on, but eventually, I tracked it down to registering to the TransactionCommit event when we open a new file. The idea is that when we commit the transaction, we’ll encrypt all the data buffers and then write them to the file. We register for an event to handle that, and we used to do that on a per-file basis. My changes, among other things, moved that logic to apply globally. As long as we were writing to a single file, everything just worked. When we had enough workload to need a second file, we would encrypt the data twice and then write it to the file. Upon decryption, we would successfully decrypt the data but would end up with still encrypted data (looking like random fluff). The fix was simply moving the event registration to the transaction level, not the file level. I committed my changes and went back to the unexciting life of bug-fixing, rather than encryption-breaking and math-defying hacks.

What’s new in .NET Aspire 8.1 for cloud native developers!

by Mitch Denny

posted on: July 23, 2024

Let's take a look at what is new with .NET Aspire 8.1 for building cloud native applications!

Add AI to Your .NET Apps Easily with Prompty

by Bruno Capuano

posted on: July 22, 2024

Learn how to integrate AI into your .NET applications with Prompty, a powerful Visual Studio Code extension.

Stop a script when an error occurs in PowerShell

by

posted on: July 22, 2024

By default, PowerShell doesn't stop the script when an error occurs. Instead, it writes the error to the error stream and continues executing the script. You can change this behavior by setting the $ErrorActionPreference variable to Stop. When you do this, PowerShell stops the script when an error

Introducing CoreWCF and WCF Client Azure Queue Storage bindings for .NET

by Subhrajit Saha

posted on: July 18, 2024

The initial beta release of the official libraries Microsoft.CoreWCF.Azure.StorageQueues and Microsoft.WCF.Azure.StorageQueues.Client library for .NET is now available.

.NET 6 will reach End of Support on November 12, 2024

by Rahul Bhandari (MSFT)

posted on: July 18, 2024

.NET 6 will reach end of support on November 12, 2024, this blog breaks down all the valuable information you need to know and how to update to .NET 8.

Disambiguating types with the same name with extern alias

by Andrew Lock

posted on: July 16, 2024

In this post I describe how to solve Error CS0433, where you have two types with the exact same name and namespace coming from two different packages…

Temporal cattle and other important jargon

by Oren Eini

posted on: July 15, 2024

I was talking to a colleague about a particular problem we are trying to solve. He suggested that we solve the problem using a particular data structure from a recently published paper. As we were talking, he explained how this data structure works and how that should handle our problem.The solution was complex and it took me a while to understand what it was trying to achieve and how it would fit our scenario. And then something clicked in my head and I said something like:Oh, that is just epoch-based, copy-on-write B+Tree with single-producer/ concurrent-readers? If this sounds like nonsense to you, it is fine. Those are very specific terms that we are using here. The point of such a discussion is that this sort of jargon serves a very important purpose. It allows us to talk with clarity and intent about fairly complex topics, knowing that both sides have the same understanding of what we are actually talking about.The idea is that we can elevate the conversation and focus on the differences between what the jargon specifies and the topic at hand. This is abstraction at the logic level, where we can basically zoom out a lot of details and still keep high intent accuracy.Being able to discuss something at this level is hugely important because we can convey complex ideas easily. Once I managed to put what he was suggesting in context that I could understand, we were able to discuss the pros and cons of this data structure for the scenario. I do appreciate that the conversation basically stopped making sense to anyone who isn’t already well-versed in the topic as soon as we were able to (from my perspective) clearly and effectively communicate.“When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means just what I choose it to mean — neither more nor less.”Clarity of communication is a really important aspect of software engineering. Being able to explain, hopefully in a coherent fashion, why the software is built the way it is and why the code is structured just so can be really complex. Leaning on existing knowledge and understanding can make that a lot simpler.There is also another aspect. When using jargon like that, it is clear when you don’t know something. You can go and research it. The mere fact that you can’t understand the text tells you both that you are missing information and where you can find it. For software, you need to consider two scenarios. Writing code today and explaining how it works to your colleagues, and looking at code that you wrote ten years ago and trying to figure out what was going on there.In both cases, I think that this sort of approach is a really useful way to convey information.

.NET 9 Preview 6 is now available!

by .NET Team

posted on: July 15, 2024

Try out the latest features in .NET 9 Preview 6 across the .NET runtime, SDK, libraries, ASP.NET Core, Blazor, and more!