I had a long conversation with a dev team that are building a non trivial business system. One of the chief problems that they have to deal with is that the “business logic” that they are asked to work with is extremely mutable, situation dependent and changes frequently. That isn’t a new compliant, of course, but given that I have run into this in the past, I can deeply emphasize. The key issue is that the business rules (I refuse to call it logic) are in a constant state of flux. Trying to encode them into the software itself leads to a great deal of mess in both the code and architecture.For example, consider the field of insurance. There are certain aspects of the insurance field that are pretty much fixed in stone (and codified into law). But there are (many) others that are much more flexible, because they relate to the business of selling insurance rather than the actual insurance itself. Because certain parts of the system cannot change (by law), all the modifications happen in other places, and those places see a lot of churn. A marketing executive came up with a brilliant idea, let’s market a health insurance policy for young athletic people. This is the same as the usual young policy, but you get a 5% discount on the monthly premium if you have over 20 days in the month with over 10,000 steps recorded. Conversely, you get penalized with 5% surcharge if you don’t have at least 15 days of over 10,000 steps recorded. Please note that this is a real service and not something that I just made up. Consider what such a feature means? We have to build the integration with FitBit, the ability to pull the data in, etc. But what happens next? You can be sure that there are going to be a lot more plans and offers that will use those options. You can envision another offer for a policy that gives discounts for 40+ who jogs regularly, etc. What does this kind of thing looks like in your code? The typical answer is that this can be one of a few options:Just Say No – in some IT organizations, this is just rejected. They don’t have the capacity or ability to implement such options, therefor the business won’t be able to implement it.Yes man – whatever the business wants, the business gets. And if the code gets a little ugly, well, that is life, isn’t it? Structured – those organizations were able to figure out how to divide the static pieces and the frequently changing parts in such a way that they can ensure long term maintainability of the system.In many cases, organizations start as the 2nd option and turn into the 1st. In the early 2000, cellular phones plans in Israel were expensive. A family plan could cost as much as a mortgage payment. I’m not kidding, it was really that bad. One of the cellular companies had an inherent advantage, however. They were able to make new offers and plans so much faster than the companies. Summer Vacation plan for your teenagers – speak free after midnight with 50 texts a week.Off hours dedicated phones discounts – you can talk to 5 phone numbers between 20:00 – 08:00 and on weekends for fixed price.All sort of stuff like that, and that worked. Some people would switch plans on a regular basis, trying to find the optimal option. The reason that this company was able to do that had to do with the manner in which they did billing.What they did was quite amazing, even decades later. Their billing systems aggregated all the usage of a particular plan based and pushed that into a report. Then there was a directory filled with VBScript files that they would run over the report. The VBScripts were responsible for apply the logics for the plans. The fact that they wrote them in VBScript meant that they had a very well defined structure. There was all the backend work that gathered the data, then they applied the policies in the scripts. Making those kind of changes and introducing new policies was easy.If the technique is familiar to you, that is because I talked about this in the past. In fact, I wrote a book about it. But this isn’t the time to talk about a book a dozen years old or a story from twenty years ago. Let’s talk about how we can apply this today, shall we?For scripting, I’m going to use MoonSharp, which is a managed Lua implementation. Lua is a great scripting language, it is quite capable and at the same time usable for people without too much technical knowledge. Among other things, it also offers builtin debugging support, which can be a crucial feature for large scale systems.At any rate, let’s consider the following logic:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
if
policy.Type == PolicyType.House and
policy.address.country == 'Germany' and
policy.address.zip == '50374'
then
policy.adjust_policy('flood risk', 50)
end
view raw
flood_prone_places.lua
hosted with ❤ by GitHub
As you can see, this script raise the monthly rate for a house insurance policy in a particular location. To execute this code, you’ll need something like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
UserData.RegisterType<Policy>();
UserData.RegisterType<Address>();
UserData.RegisterType<PolicyType>();
Script script = new Script();
script.Globals["PolicyType"] = UserData.CreateStatic<PolicyType>();
script.Globals["policy"] = policy;
script.DoString(scriptCode);
view raw
run_script.cs
hosted with ❤ by GitHub
Let’s look at a slightly more complex example, implementing the FitBit discount:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
days_with_10K_plus_steps = 0
for i, day in ipairs(fitbit.period) do
if day.number_of_steps > (10 * 1000) then
days_with_10K_plus_steps = days_with_10K_plus_steps + 1
end
end
if days_with_10K_plus_steps >= 20 then
policy.apply_discount('Walked 10K or more for 20+ days this month", 0.05) # 5% discount, yeah
elseif days_with_10K_plus_steps < 15 then
policy.apply_discount('Walked 10K or more for less than 15 days this month", -0.05) # 5% penalty, boo!
end
view raw
fitbit_discount.lua
hosted with ❤ by GitHub
Those are the mechanics of how this works. How you can use MoonSharp to apply arbitrary logic to a policy. As I mentioned, I literally wrote a book about this, discussing many of the details you need to create such a system. Right now, I want to focus on the architecture impact.The kind of code we’ll write in those scripts is usually pretty boring. Those are business rules in all their glory, quite specific and narrow, carefully applied. They are simple to understand in isolation, and as long as we keep them this way, we can reduce the overall complexity on the system.Let’s consider how we’ll actually use them, shall we? Here is what the user will work with to customize the system. I’m saying user, but this is likely going to be used by integrators, rather than the end user. That data is then going to be stored directly in our Policy object, and we can apply it as needed. A more complex solution may have the ability to attach multiple scripts to various events in the system. This change the entire manner of your system, because you are focused on enabling the behavior of the system, rather than having to figure out how to apply the rules. The end result is that there is a very well defined structure (there has to be, for things to work) and in general an overall more maintainable system.
About twenty years ago, I remember looking at a typical business application and most of the code was basically about massaging data to and from the database. The situation has changed, but even the most sophisticated of applications today spent an inordinate amount of time just shuffling data around. It may require lot less code, but CRUD still makes the most of the application codebase. On the one hand, that is pretty boring code to write, but on the other hand, that is boring code to write. That is an excellent property. Boring code is predictable, it will work without issues, it is easy to understand and modify and in general it lacks surprises. I love surprises when it comes to birthday parties and refunds from the IRS, I like them a lot less in my production code. Here is a typical example of such code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
public record PolicyPaid(string PolicyId, string PolicyType, decimal Amount, bool PaidInFull, DateTime At);
public class PayPolicyHandler : BaseRequestHandler<PayPolicy>
{
public Task Handle(PayPolicy request, CancellationToken token)
{
var policy = await Session.LoadAsync<Policy>(request.PolicyId, token);
var full = policy.ApplyPayment(request.Payment, request.PaymentDate);
await Session.StoreaAync(new PolicyPaid(request.PolicyId, policy.Type, request.Amount, full, request.PaymentDate));
await Session.SaveChangesAsync(token);
}
}
view raw
InsurancePolicyPayment.cs
hosted with ❤ by GitHub
As you can see, we are using a command handling pattern and here we can choose one of a few architectural options:Write the code and behavior directly inside the command handlers.The command handlers are mostly about orchestration and we’ll write the business logic inside the business entities (such as the example above).There is another aspect to the code here that is interesting, however. Take a look at the first line of code. We define there a record, a data class, that we use to note that an event happened. You might be familiar with the notion of event sourcing, where we are recording the incoming events to the system so we’ll be able to replay them if our logic changes. In this case, that is the exact inverse of that, our code emits business events that can be processed by other pieces of the system.The nice thing in the code above is that the business event in this case is simply writing the data record to the database. In this manner, we can participate in the overall transaction and seamlessly integrate into the overarching architecture. There isn’t much to do here, after all. You can utilize this pattern to emit whenever something that is interesting or potentially interesting happens in your application.Aside from holding up some disk space, why exactly would you want to do this?Well, now that you have the business events in the database, you can start operating on them. For example, we can create a report based on the paid policies by policy types and month. Of far greater interest, however, is the ability to handle such events in code. You can do that using RavenDB subscriptions.That gives us a very important channel for extending the behavior for the system. Given the code above, let’s say that we want to add a function that would send a note to the user if their policy isn’t paid in full. I can handle that by writing the following subscription:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
from PolicyPaid where PaidInFul == false
view raw
subscription.sql
hosted with ❤ by GitHub
And then we can write a script to process that:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
#!/usr/bin/python3
import os
def process_partially_paid_policies(batch):
for policy in batch.items:
# send an email, etc...
print("Pay up: " + policy.PolicyId)
with document_store.DocumentStore(urls=[os.getenv("RAVENDB_URL")], database=os.getenv("RAVENDB_DATABASE_NAME")) as store:
store.initialize()
connection_options = SubscriptionWorkerOptions("PartiallyPaidPolicies")
with self.store.subscriptions.get_subscription_worker() as subscription:
subscription.run(process_partially_paid_policies)
view raw
subscription.py
hosted with ❤ by GitHub
I intentionally show the example here as a Python script, because that doesn’t have to be a core part of the system. That can be just something that you add, either directly or as part of the system customization by an integrator.The point is that this isn’t something that was thought of and envision by the core development team. Instead, we are emitting business events and using RavenDB subscriptions to respond to them and enhance the system with additional behavior.One end result of this kind of system is that we are going to have two tiers of the system. One, where most of the action happens, is focused almost solely on the data management and the core pieces of the system. All the other behavior in the system is done elsewhere, in a far more dynamic manner. That gives you a lot of flexibility. It also means that there is a lot less hidden behavior, you can track and monitor all the parts much more easily, since everything is done in the open.
There are many cases where you want to smoothly remove an element from the UI. A solution is to use an animation to fade out the element before removing it. You can start the animation by adding a class to the element you want to remove. Then, you can use the transitionend or animationend event to
Tracking down a customer’s performance issue, we eventually tracked things down to a single document modification that would grind the entire server to a halt. The actual save was working fine, it was when indexing time came around that we saw the issues. The entire system would spike in terms of memory usage and disk I/O, but CPU utilization wasn’t too bad.We finally tracked it down to a fairly big document. Let’s assume that we have the following document:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
{
"name": "Eat, Slay, Love: The Good Guys, Book 10",
"author": "Eric Ugland",
"rating": "Excellent",
"reviews": [
{
"title": "Fantastic",
"body": "I really like how the story moves forward",
"author": "users/clark",
"verified": true,
"rating": 5
},
/// *******************
/// *Many* such reviews
/// *******************
{
"title": "Horrible",
"body": "Author writes based on a template and is boring, I'll keep reading because I want to send him money, but will complain.",
"author": "users/snark",
"verified": true,
"rating": 1
}
]
}
view raw
big.json
hosted with ❤ by GitHub
Note that this can be big. As in, multiple megabyte range in some cases, with thousands of reviews. The case we looked at, the document was over 5MB in size and had over 1,500 reviews.That isn’t ideal, and RavenDB will issue an performance hint when dealing with such documents, but certainly workable. The problem was with the index, which looked like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
from book in docs.Books
from review in book.reviews
select new
{
review_title review.title,
review_rating = review.rating,
review_verfied = review.verified,
review_author = review.author
book = book,
book_rating = book.rating,
book_name = book.name,
book_author = book.author
}
view raw
opps.index.cs
hosted with ❤ by GitHub
This index is also setup to store all the fields being indexed. Take a look at the index, and read it a few times. Can you see what the problem is? This is a fanout index, which I’m not a fan of, but that is certainly something that we can live with. 1,500 results from a single index isn’t even in the top order of magnitude that we have seen. And yet this index will cause RavenDB to consume a lot of resources, even if we have just a single document to index.What is going on here?Here is the faulty issue:Give it a moment to sink in, please.We are indexing the entire document here, once for each of the reviews that you have in the index. When RavenDB encounters a complex value as part of the indexing process, it will index that as a JSON value. There are some scenarios that call for that, but in this case, what this meant is that we would, for each of the reviews in the document:Serialize the entire document to JSONStore that in the index5MB times 1,500 reviews gives us a single document costing us nearly 8GB in storage space alone. And will allocate close to 100GB (!) of memory during its operation (won’t hold 100GB, just allocate it). Committing such an index to disk requires us to temporarily use about 22GB of RAM and force us to do a single durable write that exceed the 7GB mark. Then there is the background work to clean all of that.The customer probably meant to index book_id, but got this by mistake, and then we ended up with extreme resource utilization every time that document was modified. Removing this line meant that indexing the document went from ~8GB to 2MB. That is three orders of magnitude difference.We are going to be adding some additional performance hints to make it clear that something is wrong in such a scenario. We had a few notices, but it was hard to figure out exactly what was going on there.
There are many cases where you need to compute the execution time of a method. The classic way is to use the Stopwatch class. Stopwatch is a class, so instantiating it may increase the pressure on the Garbage Collector. In the case of a single-threaded application, you can reuse the instance to avo
Terraform is all the rage on the DevOps world now, and we decided to take a look. In general, Terraform is written in Go, but it uses a pretty nifty plugins system to work with additional providers. A terraform provider is an executable that you run, but it communicates with the terraform controller using gRPC. What happens is that Terraform will invoke the executable, and then communicate with it over gRPC whose details are provided in the standard output. That means that you don’t need to write in Go, any language will do. Indeed, Samuel Fisher did all the hard work in making it possible. Please note that this post likely makes no sense if you didn’t read Samuel’s post first.However, his work assumes that you are running on Linux, and there are a few minor issues that you have to deal with in order to get everything working properly.For safety, Terraform uses TLS for communication, and ensures that the data is safe in transit. The provider will usually generate a self signed key and provide the public key on the standard output. I would like to understand what the security model they are trying to protect from, but at any rate, that method should be fairly safe against anything that I can think of. The problem is that there is an issue with the way the C# Terraform provider from Samuel handle the certificate in Windows. I sent a pull request to resolve it, it’s just a few lines, and it quite silly.The next challenge is how to make Terraform actually execute your provider. The suggestion by Samuel is to copy the data to where Terraform will cache it, and use that. I don’t really like that approach, to be honest, and it looks like there are better options. You can save a %APPDATA%\terraform.rc file with the following content:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
provider_installation {
filesystem_mirror {
path = "f:/terraform/providers"
include = ["example.com/*/*"]
}
direct {
exclude = ["example.com/*/*"]
}
}
view raw
terraform.rc
hosted with ❤ by GitHub
This will ensure that your provider will be loaded from the local directory, instead of fetched over the network. Finally, there is another challenge, Terraform expects the paths and names to match, which can be quite annoying for development.I had to run the following code to get it working:cp F:\TerraformPluginDotNet\samples\SampleProvider\bin\release\net5.0\win-x64\publish\* F:\terraform\providers\example.com\example\dotnetsample\1.0.0\windows_amd64
mv F:\terraform\providers\example.com\example\dotnetsample\1.0.0\windows_amd64\SampleProvider.exe F:\terraform\providers\example.com\example\dotnetsample\1.0.0\windows_amd64\terraform-provider-dotnetsample.exeWhat this does is ensure that the files are in the right location with the right name for Terraform to execute it. From there on, you can go on as usual developing your provider.
A RavenDB customer called us with an interesting issue. Every now and then, RavenDB will stop process any and all requests. These pauses could last for as long as two to three minutes and occurred on a fairly random, if frequent, basis. A team of anteaters was dispatched to look at the issue (best bug hunters by far), but we couldn’t figure out what was going on. During these pauses, there was absolutely no activity on the machine. There was hardly any CPU utilization, there was no network or high I/o load and RavenDB was not responding to requests, it was also not doing anything else. The logs just… stopped for that duration. That was something super strange.We have seen similar pauses in the past, I’ll admit. Around 2014 / 2015 we had a spate of issues very similar, with RavenDB paused for a very long time. Those issues were all because of GC issues. At the time, RavenDB would do a lot of allocations and it wasn’t uncommon to end up with the majority of the time spent on GC cleanup. The behavior at those time, however, was very different. We could see high CPU utilization and all metrics very clearly pointed out that the fault was the GC. In this case, absolutely nothing was going on.Here is what such a pause looked like when we gathered the ETW metrics:Curiouser and curiouser, as Alice said.This was a big instance, with quite a bit of work going on, so we spent some time analyzing the process behavior. And absolutely nothing appeared to be wrong. We finally figured out that the root cause is the GC, as you can see here:The problem is that the GC is doing absolutely nothing here. For that matter, we spend an inordinate amount of time making sure that the GC won’t have much to do. I mentioned 2014/2015 earlier, as a direct result of those issues, we have fixed that by completely re-architecting RavenDB. The database uses a lot less managed memory in general and is far faster. So what the hell is going on here? And why weren’t we able to see those metrics before? It took a lot of time to find this issue, and GC is one of the first things we check.In order to explain the issue, I would like to refer you to the Book of the Runtime and the discussion of threads suspension. The .NET GC will eventually need to run a blocking collection, when that happens, it needs to ensure that the heap is in a known state. You can read the details in the book, but the short of it is that there are what is known as GC Safe Points. If the GC needs to run a blocking collection, all managed threads must be as a safe point. What happens if they aren’t, however? Well, the GC will let them run until they reach such a point. There is a whole machinery in place to make sure that this happens. I would also recommend reading the discussion here. And Konard’s book is a great resource as well. Coming back to the real issue, the GC cannot start until all the managed threads are at a safe point, so in order to suspend the threads, it will let them run to a safe point and suspend them there. What is a safe point, it is a bit complex, but the easiest way to think about it is that whenever there is a method call, the runtime ensures that the GC would have stable information. The distance between method calls is typically short, so that is great. The GC is not likely to wait for long for the thread to come to a safe point. And if there are loops that may take a while, the JIT will do the right thing to ensure that we won’t wait too long. In this scenario, however, that was very much not the case. What is going on?We got a page fault, which can happen anywhere, and until we return from the page fault, we cannot get to the GC Safe Point, so all the threads are suspended waiting for this page fault to complete.And in this particular case, we had a single page fault, reading 16KB of data, that took close to two minutes to complete. So the actual fault is somewhere in storage, which is out of scope for RavenDB, but a single slow write had a cascading effect to pause the whole server. The investigation continues and I’ll probably have another post on the topic when I get the details.For what it is worth, this is a “managed language” issue, but a similar scenario can happen when we are running in native code. A page fault while holding the malloc lock would soon have the same scenario (although I think that this would be easier to figure out). I wanted to see if I can reproduce the same issue on my side, but run into a problem. We don’t know what caused the slow I/O, and there is no easy way to do it in Windows. On the other hand, Linux has userfaultfd(), so I decided to use that.The userfaultfd() doesn’t have a managed API, so I wrote something that should suffice for my (very limited) needs. With that, I can write the following code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
using Microsoft.Win32.SafeHandles;
using System;
using System.IO;
using System.Runtime.InteropServices;
using System.Threading;
// from: https://gist.github.com/ayende/175e5764e3104196d962f77c050955f8
using static fault.LinuxSystem;
namespace fault
{
class Program
{
static unsafe void Main()
{
(int fd, nint mem) = SetupMemoryRegionForUserFaultFd();
ThreadPool.QueueUserWorkItem(AccessMemoryBackground, mem);
var sfh = new SafeFileHandle((IntPtr)fd, ownsHandle: true);
var file = new FileStream(sfh, FileAccess.Read);
var buffer = new byte[sizeof(uffd_msg)];
while (true)
{
var read = file.Read(buffer);
if (read != sizeof(uffd_msg))
{
throw new InvalidOperationException("Read failed");
}
var msg = MemoryMarshal.Cast<byte, uffd_msg>(buffer.AsSpan())[0];
if (msg.@event != UFFD_EVENT_PAGEFAULT)
{
throw new InvalidOperationException("Expected page fault");
}
Console.WriteLine("Got page fault at " + msg.pagefault.address);
Console.WriteLine("Calling GC...");
GC.Collect();
Console.WriteLine("Done with GC...");
return;
}
}
private static unsafe (int fd, nint mem) SetupMemoryRegionForUserFaultFd()
{
int fd = userfaultfd(0);
var api = new uffdio_api { api = UFFD_API };
int rc = ioctl(fd, UFFDIO_API, ref api);
if (api.api != UFFD_API || rc != 0)
{
throw new InvalidOperationException("Version mismatch!");
}
nint mem = mmap64(0, 16 * 1024, MmapProts.PROT_READ, MmapFlags.MAP_ANONYMOUS | MmapFlags.MAP_PRIVATE, 0, 0);
if (mem < 0)
{
throw new InvalidOperationException("Failed to allocate mem!");
}
var reg = new uffdio_register
{
range = { start = (ulong)mem, len = 16 * 1024 },
mode = UFFDIO_REGISTER_MODE_MISSING
};
rc = ioctl(fd, UFFDIO_REGISTER, ref reg);
if (rc != 0 && reg.ioctls != UFFD_API_RANGE_IOCTLS)
{
throw new InvalidOperationException("Failed to register");
}
return (fd, mem);
}
private static unsafe void AccessMemoryBackground(object mem)
{
var ptr = (byte*)(nint)mem;
var msg = mem + " accessed and got: ";
Console.WriteLine(mem + " about to access");
var v = *ptr;
Console.WriteLine(msg + v);
}
}
}
view raw
Program.cs
hosted with ❤ by GitHub
If you’ll run this with: dotnet run –c release on a Linux system, you’ll get the following output:139826051907584 about to access
Got page fault at 139826051907584
Calling GC...
And that would be… it. The system is hang. This confirms the theory, and is quite an interesting use of the Linux features to debug a problem that happened on Windows.
We use cookies to analyze our website traffic and provide a better browsing experience. By
continuing to use our site, you agree to our use of cookies.