skip to content
Relatively General .NET

Will you pay the consistency costs?

by Oren Eini

posted on: February 24, 2021

I was talking to a developer recently and had a really interesting discussion around the notion of consistency. For simplicity’s sake, let’s imagine that we are talking about a game and we need to deal with awarding achievements. The story in question begins with a seemingly innocent business requirement:We want to announce a unique achievement in the game. The first player to kill 10,000 rabbits will get a unique class-appropriate item.Conceptually, that means that we need to write the following code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters after_kill = function (state, character, mob) if mobtype != "rabbit" then return -- not applicable end if characters.kill_count["rabbit"] < (10 * 1000) then return -- didn't hit the required state end if state.world.achivements["10k_rabbit_killer"] then return -- already awarded the achivement end state.world.achivements["10k_rabbit_killer"] = character.id character.notify({ header = "World First!", body = "You were the first to kill 10,000 rabbits. You are a awarded the title Lapine Slayer.", changes = { title = { name = "Lapine Slayer" }, items = { state.generate_item_for(character, "Unique" ) } } }) end view raw 1st_10k_rabbit_killer.lua hosted with ❤ by GitHub The code is simple, easy and trivial. I wish we could end the post at this point, but as you can imagine, the situation is a bit more complex. Consider the typical architecture of a game, we have the game world, which is composed of multiple servers, often located in different parts of the world. In this case, we can assume that we have three data centers, in separate continents.Notice that we have a world first scenario here? That means that we need to synchronize the state across all of them in some manner, including when there are issues in the network, failures, etc.For that matter, when we are talking about an achievement for a character, we can at least be sure that there is a single chain of events that we can follow, but what happens if we were to apply this achievement to a guild? In this case, multiple players may be competing to complete this achievement. If we allow to to also run on different servers (and regions), the situation now become quite complex.Are there are technical ways to resolve this? Of course there are.We can use distributed transactions to handle this, at first glance. However, that introduce an utterly unworkable spanner in the works (See what I did here ). Games care a lot about performance and latency, forcing us to run a globally distributed transaction on every rabbit kill is not a possible solution. There are stiff performance costs associated with it. For that matter, in a partial failure scenario, you still want to allow players to kill rabbits, even if you lost some servers in another continent. That lead to several additional scenarios that you need to cover:What happened if a player killed rabbits but wasn’t able to record that using the world scope state? Do we also store that in a character / server level state? What happens if we already awarded the achievement and then find out that there are delayed updates in the mix?What happens when two players kill their 10,000th rabbit at times T1 and T2 (where T2 > T1), but this happens near enough to one another that the write about the 2nd player is committed first?Note that this can happen on the same server, on the same region or across different regions.What happens when two players kill their 10,000th rabbit at the same instant? Note that this can happen on the same server, on the same region or across different regions.Note that this can actually happen at different times, but clock divergence will say it happened at the same timeWhat happens when a player kill their 10,000 rabbit, but we had a network hiccup and couldn’t record the action in time? In fact, there are a multitude of issues that we have to deal with, if we accept the scenario as is. And the impact on the entire system because of the unstated requirements of a single achievement are huge. That is likely not what the intended result is. We can do some minimal changes to the system and get pretty much what we want at a much reduced cost.  We start by giving up the implicit assumption that we have to award the achievement for the 10,000th within the same tick as the actual kill.  If we give up this requirement, it means that we have a far more relaxed environment to deal with.We can say that we’ll process the achievements at the end of the next hour, for example. That gives us enough time to get updates from the rest of the system, settle the dust and avoid millisecond level decision making processes. In almost all businesses, there is no such thing as a race condition. The best example of that is “the check is in the mail”. The payment date for a check is not when you cashed it but when you posted it. My uncle used to go to the post office at 4:50 PM on a Friday to post checks. They would get the right time stamp, but would sit in the post office over the weekend before actually being delivered. That gave his 3 – 5 extra days before the check was cashed. In the case of our game, giving us the time for things to settle make sense. It also make sense from a marketing and community perspective. World wide announcements don’t just happen, they can be scheduled for a particular time frame to maximize impact. Delaying the announcement of such an achievement make it a lot easier from a technical perspective but also give us more business level benefits to work with.In many situation, we jump to assumptions about the kind of requirements that we have to meet, but usually there is a lot more flexibility. And discovering where you can get some slack can be of tremendous value.Possible reaction from the business in this scenario can be:We don’t actually care if this is exact. We want to give the achievement when this happens but it is okay if there is a little fudge factor and the “wrong” person won if there is a tiny amount of time difference.It is actually fine if two players get the achievement when it happens at the “same” time.Oh, I didn’t realize that this is so hard. Can we make it a server-wide achievement, then? Would that be easier?Any of those responses will translate to a great simplification of what we have to deliver. In the first case, we can track the state of bunny killing without the use of distributed transactions and only apply a distributed transaction when we get to the 10,000th rabbit. That is going to be a rare event, so it is fine to get to pay a little extra there. For that matter, a good question is to ask is how important the operation is. What happened when we have a failure in the distributed transaction when we record the 10,000th rabbit? I’m not talking about someone else getting there first, I”m talking bout a network hiccup or a faulty wire that cause the operation to fail. Do we retry? Do we output an error (and if so, to where?) or just ignore this? Do we need to have a secondary mechanism for checking for errors in the process? The answer is that it depends, you can get into effectively infinite loop of trying to solve ever more unlikely scenarios. The question is what is the impact on the system at large and is it worth the cost?For a game achievement, the answer is probably no (but then again, I’m not a gamer). People usually draw the line at money as the place where they’ll do their utmost to avoid issues.  This is a strange scenario, because money specifically has so many ways that you can run compensating actions as part of the normal workflow that you shouldn’t bend yourself into a pretzel to avoid that. Let’s say that we are offering three unique mounts in our game, selling them for 9.99$. We obviously have to deal with race conditions here. There are only three mounts, but there are more users who want it and are willing to pay for the privilege. We are dealing with money here, so we can assume that we want to be careful about that, the question is, how careful?We have three mounts, but more users. The payment process itself takes at least a few seconds, so how do you deal with it?Throw the purchase attempts into a queue.Pull the first three offers from the queue and attempt to charge them.If charging failed, we mark the purchase attempt as errored and move on to the next item in the queue.Once we successfully processed three purchases, we mark all the other purchase attempts as failed.Simple, right? What happens when a charge took longer than expected and a request timed out, but the charge actually happened? This can happen if you have SMS authentication in the process. So the card company will send a message to the card holder and they have to send an SMS back to approve the transaction. That can take long enough that the transaction will time out. But the user did authorize the transaction, funds were transferred, but you already sold the unique mount to someone else, with worse credit card security features.What happens then? You can refund the money, provide another unique mount, etc.The key here is that even for something that may consider critical, the number of failure modes is too high. Trying to handle them and ensuring a consistent world is usually too expensive. It is far better to have the hooks in place to handle failures and apply compensations. They are far rarer and much easier to deal with in isolation. When you start seeing that something happens on a frequent basis, that is when you want to figure out a way to automate that failure mode.

Didja know: Network failure due to the disk full error

by Oren Eini

posted on: February 23, 2021

Just had to trouble shoot a really strange problem in a production system. The situation is pretty simple. At one point in time, a machine had an issue and shortly afterward became utterly unable to reach the outside world. I was luckily able to SSH into the machine, but any outgoing connection would fail.Given that TCP worked (I was using it or SSH), how can that be? I can create a connection to the system and have bidirectional communication, but not initiate any calls to the outside world.Strange doesn’t begin to cover how weird this is. The reason for the failure, by the way, was that the disk was absolutely full. The reason for the network outage was a mystery.It took a while to figure out that the error was that the disk was full, which caused an error is systemd-resolved, which failed to setup DNS properly. I was also unable to make connections via IP address, which was strange, but the problem went away after the disk was cleared.I have to say, network failure due to disk errors was not how I expected to start the day.

Reading candidates’ GitHub profilers

by Oren Eini

posted on: February 22, 2021

We are hiring again (this time for Junior C# Dev positions in Israel). That means that I go through CVs (quite a few, actually).  I like going over the resumes directly, to get a feel for not just a particular candidate but what is, for lack of a better term, the state of the market. This time, I noticed a much higher percentage of resumes with a GitHub repository link. Anytime that I see such a link, I go and look at what they have there. That is often really interesting. Then again, you run into things like this:On the one hand, this is non production code, it is obviously a teaching project, which is awesome. On the other hand, I find such code painful to look at. In the past, I would rate highly anyone that would show a GitHub account in the CV, since I could expect to see some of their projects there, usually unique ones. This time? I’m seeing a lot of basically homework assignments, and those aren’t really that interesting to review or look at. Especially since a lot of the candidates apparently had the same courses, so I saw the same 5 projects repeated over and over again.In other words, just a GitHub account with some repositories are no longer that interesting or meaningful.Another thing that I noticed was that a lot of those candidates had profiles with profile pictures like:A small tip, if you expect people to visit your profile (and I assume you do, since you provided the link in the resume), it is worth it to put a (professional) picture of yourself there. The profiler readme on GitHub is also surprising attractive when looking at a candidate.Another tip, if you see a position for a C# Junior Developer, it is acceptable to apply if you don’t have all the requirements, or if you exceed them. But if you are trying to find a new job as a lawyer specializing in family law, maybe don’t try to apply to a tech company. And yes, I’m using this post as a want to vent while going over so many CVs. Most CVs are dry, but one candidate just got bumped to the next stage based solely on the fact that in they had a “Making awesome pancakes” in the CV, which made me laugh.

Multi-targeting a Roslyn analyzer

by Gérald Barré

posted on: February 22, 2021

This post is part of the series 'Roslyn Analyzers'. Be sure to check out the rest of the blog posts of the series!Writing a Roslyn analyzerWriting language-agnostic Roslyn Analyzers using IOperationWorking with types in a Roslyn analyzerReferencing an analyzer from a projectPackaging a Roslyn Analy

Modeling temporal data with RavenDB

by Oren Eini

posted on: February 18, 2021

We got an interesting modeling question from a customer: “What is the optimal manner to represent time sensitive information in RavenDB?” The initial thought was that they would use revisions and they asked about querying those. The issue is that this isn’t really the purpose for revisions, they are great if you want to see what the state of the system at a particular point in time, but not so good if your business logic has meaning over time.The best scenario for temporal data that I’m aware of is payroll. You have a lot of data that make sense only in the context of the time if was relevant for. For example, consider an employee that is hired at a given salary level, then given raises over time. The data in this case is divided into several layers.I’m going to use paper documents as the model here, because it makes it much easier to consider the modeling implications than when talking about JSON or class structures. On the most basic model, we have the Payslip document, which represent the amount (work, deductions, taxes, etc) that was paid to an employee at a particular point in time. This is similar to this:Once created, such a document will not change. It represent an action that happened in the past and is immutable. From this you can figure out taxes, overall payments, etc.The Payslip is computed from the Timesheet document, which is similar to this one:A Timehseet document is updated during the Payroll Period whenever an employee signs in or out. At the end of the Payroll Period, a manager will sign off on the Timesheet and approve it for payment. At this point, all the relevant business rules are run and the final Payslip is generated for each employee. Once the Timesheet is signed off and paid, it is no longer mutable and will not change. This make sense, since it represent something that already happened. In some cases, you’ll have new information, such as an employee that worked, but didn’t report their hours. They will need to do so in a new Timesheet and a new Payslip will be generated.Using the real world analogy, the Timesheet document is stored at the head office, and you cannot go and update that once it was submitted.So far, we haven’t seen anything related to things that change over time. In fact, the fact that we have separate documents for Payslips and Timesheets means that we can ignore a lot of the complexity that you’ll usually have to deal with in temporal databases.We can’t completely get away from it, however. We need to consider the employee’s Contract, however. Usually when we think about the employment contract we think about something like this:The contract specify details such as the hourly rate, overtime payment, vacation time, etc. In payroll systems, contracts are actually more complex than that, because you have to take into account that they change.For example, consider the following scenario:1996 – Hired as mailroom clerk – 4.75$ / hour1998 – Promoted to junior clerk – 5.25$ / hour1999 – Promoted to clerk – 5.40$ / hour2002 – Promoted to senior clerk – 6.20$ / hourHow do you handle something like that, in terms of modeling?The answer depends quite heavily on how your business logic handles this. One way to handle this is to create revisions. Using the real world logic, we are talking about signing a new contract and expiring the old one. But in reality, that isn’t how things are done. You’ll usually just update the payment terms.How does this looks like in terms of the data modeling when using RavenDB,however? Well, there are two options. We can represent the data as simple values, like so:When the data changes, we update those values (which generates revisions for the old data). However, that isn’t usually ideal, because business logic usually want to access the past values. For example, your contract may change mid payroll period, so your hourly rate is different depending if the hours worked on Monday or Thursday. In this case, you’ll want to represent the values changing in the model directly, like so:In most cases, this is the best option for modeling data where the temporal aspects of the data needs to be directly exposed to the business logic.

ASP.NET Core Dependency Injection: What is the IServiceCollection?

by Steve Gordon

posted on: February 16, 2021

If you’ve built applications using ASP.NET Core then you’ve most likely used the built-in dependency injection container from Microsoft.Extensions.DependencyInjection. This package provides an implementation of the corresponding abstractions found in Microsoft.Extensions.DependencyInjection.Abstractions. In this post, I wanted to take a deeper look at the first concept from the Microsoft Dependency Injection (DI) container, the IServiceCollection hopefully […]