Reproducible builds are important when building NuGet packages from public sources. Indeed, it gives your consumers confidence in your packages by allowing them to validate the package has actually been built using the public sources. To be able to reproduce a build, you need the source files, the
I needed to pay one of our suppliers. That supplier happens to be living in Europe, while Hibernating Rhinos is headquartered in Israel. That means that I have to send an international money transfer to get them paid.So far, that isn’t an issue, this is literally something that we have to do multiple times a week and have been doing for the past decade and a half. This time, however…We run into a problem, we initiated the payment round as usual and let the suppliers know that the money was transferred. And then I forgot about it, anything from this point on is on the accounting department. A few days later, we started getting calls, telling us that the money that we sent didn’t arrive. I called the bank and they checked, it appears that some of the transfers that we made hit an internal bank limit. Basically, there are rules in place for how much money you can send out of the country before you have to get the IRS involved. I believe that the issue is about moving money to off shore accounts in order to provide taxes on that, but that isn’t relevant for the story at hand. What is relevant here is that the bank didn’t process some of the payments in the run (those that hit the limit for the tax block). Some payment did went through, but some didn’t. The issue with such things isn’t so much the block (I can understand why it is there, although I wish there was some earlier notice). The major issue was that we try to have a good payment schedule for our suppliers, meaning that we’ll pay most invoices within a short amount of time. When something like that happens, it means that we have to wade into the bureaucracy of the (international) tax system. That takes time, and in that time, the suppliers aren’t getting paid. Technically, we are okay, we are usually paying far ahead of the invoice due date, but I strongly dislike this.We used a different source for the funds and paid all the suppliers immediately, then set to clear the tax hurdles involved in the usual manner in which we are paying. I also paid to expediate the transfer and they all had the money arrive faster than normal. In all, I would estimate that this meant that we had a delay of just a few days over when the money would arrive normally. But that isn’t where the story ends. For most of the suppliers, the original transfers never happened, because of the tax issue. For two of them, however, the money was gone from our account. One of them also confirmed that they received the money, so I expected the second one to get it as well.They didn’t. But the money was gone from our account. Talking to the bank, the money was in the bank’s account, waiting for the tax permit to go ahead with that. I asked the bank to cancel the order, then I transferred the money using the alternative way.Except… that supplier called me, confused. The money appeared in their account twice. Checking with the bank, it was indeed gone from the original and alternative accounts. Well, that wasn’t expected. Luckily, this is a supplier that I’m doing regular business with, we decided that the simplest option is just to consider the extra payment to be credit for future charges. The supplier sent me a(n already marked as) paid invoice, and aside from shaking my head in disbelief, the issue was done.Except.. that supplier called me, even more confused. Their bank called them, saying that the originating bank has cancelled the transfer, and they need to send the money back. Que: The key here was that their bank wanted them to transfer the money back to us. I had a very negative reaction to that, because this pinged all the hallmarks for a common scam: Overpayment Scam. I asked the supplier to do nothing with that, since if the bank need to transfer the money, they can do it directly.The fear was that the supplier would send the money back, then the bank will refund the money, resulting in the supplier having no money. I mentioned already: ?I talked to my bank and cancelled the cancellation, so hopefully things will stabilize now. This is an ongoing event and I don’t know if we hit peak kafkianess. As for what happens, I suspect that when I asked to cancel the original transfer, the was another logic branch that was followed. Since the money already left my account, they had to record that as a cancellation, but that apparently was sent to the destination bank, along with the money, I guess? At this point, I don’t even wanna know.
I wrote a post a couple of weeks ago called: Architecture foresight: Put a queue on that. I got an interesting comment from Mike Tomaras on the post that deserve its own post in reply.Even though the benefits of an async queue are indisputable, I will respectfully point out that you brush over or ignore the drawbacks. … redacted, see the real comment for details …I think we agree that your sync code example is much easier to reason about than your async one. "Well, it is a bit more complex to manage in the user interface", "And you can play games on the front end" hides a lot of complexity in the FE to accommodate async patterns. Your "At more advanced levels" section presents no benefits really, doing these things in a sync pattern is exactly the same as in async, the complexity is moved to the infrastructure instead of the code.This is a great discussion, and I agree with Mike that there are additional costs to using the async option compared to the synchronous one. There is a really good reason why pretty much all modern languages has something similar to async/await, after all. And anyone who did any work with Node.js and promises without that knows exactly what are the cost of trying to keep the state of the system through multiple levels of callbacks.It is important, however, that my recommendation had nothing to do with async directly, although that is the end result. My recommendation had a lot more to do with breaking apart the behavior of the system, so you aren’t expected to give immediate replies to the user. Consider this: ⏱. When you are processing a user’s request, you have a timer inherent to the operation. That timer can be a real one (how long until the request times out) or it can be a mental one (how long until the user gets bored). That means that you have a very short SLA to run the actual request.What is the impact of that on your system? You have to provision enough capacity in the system to handle the spikes within the small SLA that you have to work with. That is tough. Let’s assume that you are running a website that accepts comments, and you need to run spam detection on the comment before actually posting that. This seems like a pretty standard scenario, right? It doesn’t require specialized scenarios.However, the service you use has a rate limit of 10 comments / sec. That is also something that is pretty common and reasonable. How would you handle something like that if you have a post that suddenly gets a lot of comments? Well, you’ll have something that ensure that you don’t pass the limit, but then the user is sitting there, waiting and thinking that the request timed out. On the other hand, if you accept the request and place it into a queue, you can show it in the UI as accepted immediately and then process that at leisure. Yes, this is more complex than just making the call inline, it requires a higher degree of complexity, but it also ensure that you have proper separation in your system. The front end submit messages to the backend, which will reply when it is done. By having this separation upfront, as part of your overall design, you get options. You can change how you are processing things in the backend quickly. Your front end feel fast (which is usually much more important than being fast, mind you).As for the rate limits and the SLA? In the case of spam API or similar services, sure, this is obvious. But there are usually a lot of implicit SLAs like that. Your database disk is only able to serve so many writes a second, for example. That isn’t usually surfaced to you as X writes / sec limit, but it is true nevertheless. And a queue will smooth over any such issues easily. With making the request directly, you have to ensure that you have enough capacity to handle spikes, and that is usually far more expensive.What is more interesting, in my opinion, is that the queue gives you options that you wouldn’t have otherwise. For example, tracing of all operations (great for audits), retries if needed, easy model for scale out, smoothing out of spikes, etc. You cannot actually put everything into a queue, of course. The typical example is that you’ll want to handle a login page. You cannot really “let the user login immediately and process in the background”. Another example where you don’t want to use asynchronous processing is when you are making a query. There are patterns for async query completions, but they are pretty horrible to work with. In general, the idea is that whenever the is any operation in the system, you throw that to a queue. Reads and certain key aspects are things that you’ll need to run directly.
A user called us to ask about how they can manage to move a particular report from a legacy system to RavenDB. They need to be able to ask questions such as the following one:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
select
count(distinct(City)) as CitiesCount,
Employee,
Company
from Orders
group by Employee, Company
view raw
query.sql
hosted with ❤ by GitHub
This is an interesting issue, when you think about it from the point of view of a database engine. The distinct issue means that we have to keep state (all the unique values) while we evaluate the query, which can be expensive. One of the design principles of RavenDB was that we want to make it hard to accidently create expensive queries. Indeed, a query like that isn’t trivial to implement in RavenDB. We need to have a two stage approach for implementing this feature.First, we’ll introduce a Map/Reduce index, which will aggregate the data on Employee, Company and City. Along the way, it will run the distinct operation on the City, because it will group by it. That gives us a model in which we get the distinct amount for free, and in a highly efficient manner. Here is the index in question:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
/// map
from o in docs.Orders
select new
{
Key = new
{
o.Employee,
o.Company
},
o.ShipTo.City,
Count = 1
}
// reduce
from result in results
group result by new { result.Key, result.City} into g
select new
{
g.Key.Key,
g.Key.City,
Count = g.Sum(x => x.Count)
}
view raw
index.cs
hosted with ❤ by GitHub
The interesting thing about this index is that querying it will not give us the right results. We don’t want to get the details based on Employee, Company and City. We want just by Employee and Company. This is where the second stage comes into play. Instead of running a simple query on the index, we’ll use a faceted query. Here is what it will look like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
from index 'Orders/Stats'
select facet(Key, sum(Count))
view raw
faceted.sql
hosted with ❤ by GitHub
What this does is to aggregate the results (which were already partially aggregated by the Map/Reduce) and give us the totals. And here are the results:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
{
"Name": "Key",
"Values": [
{
"Name": "Count",
"Sum": 3,
"Count": 2,
"Range": "{\"Employee\":\"employees/1-A\",\"Company\":\"companies/1-A\"}"
},
{
"Name": "Count",
"Sum": 2,
"Count": 1,
"Range": "{\"Employee\":\"employees/1-A\",\"Company\":\"companies/10-A\"}"
}
]
}
view raw
results.json
hosted with ❤ by GitHub
The end result is that we are able to do most of the work an indexing time, and the query time is left working on already aggregated data. That means that the queries should be much faster and that there is a lot less work for the database to do.It also isn’t RavenDB’s strong suit. Such queries are typically more inline with OLAP systems, to be honest. If you know what your query patterns looks like, you can use this technique to easily handle such queries, but if there is a wide range of dynamic queries, you may want to use RavenDB as the system of record and then use either SQL ETL or OLAP ETL to push that to a reporting system.
I create a lot of samples, demos, open source projects, etc. and I like to use the fairly standard repository layout of having a solution…Keep Reading →
In my daily work, I’m becoming quite familiar with the ins and outs of using System.Text.Json. For those unfamiliar with this library, it was released along with .NET Core 3.0 as an in-the-box JSON serialisation library. At its release, System.Text.Json was pretty basic in its feature set, designed primarily for ASP.NET Core scenarios to handle […]
Implementing a unit of work in Python can be an interesting challenge. Consider the following code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
class Holder(object):
def __init__(self):
self.items = dict()
def try_set(self, key, name):
if key in self.items:
return
self.items[key] = name
def try_get(self, key):
if key in self.items:
return self.items[key]
return None
view raw
Holder.py
hosted with ❤ by GitHub
This is about as simple a code as possible, to associate a tag to an object, right?However, this code will fail for the following scenario:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
@dataclass
class Item:
name: str
holder = Holder()
cup = Item(name="Cup")
holder.try_set(cup, "cups/1")
view raw
cups.py
hosted with ❤ by GitHub
You’ll get a lovely: “TypeError: unhashable type: 'Item'” when you try this. This is because data classes in Python has a complicated relationship with __hash__().An obvious solution to the problem is to use:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
class Holder(object):
def __init__(self):
self.items = dict()
# this is bad
def try_set(self, key, name):
if id(key) in self.items:
return
self.items[id(key)] = name
# this is bad
def try_get(self, key):
if id(key) in self.items:
return self.items[id(key)]
return None
view raw
Holder2.py
hosted with ❤ by GitHub
However, the id() in Python is not guaranteed to be unique. Consider the following code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
cup = Item(name="Cup")
print(id(cup))
cup = None
cup = Item(name="Cup")
print(id(cup)) # different instance
view raw
opps.py
hosted with ❤ by GitHub
On my machine, running this code gives me:124597181219840
124597181219840In other words, the id has been reused. This makes sense, since this is just the pointer to the value. We can fix that by holding on to the object reference, like so:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Show hidden characters
class RefEq(object):
def __init__(self, ref):
self.ref = ref
def __eq__(self, other):
if id(self.ref) == id(other):
return True
if not isinstance(other, RefEq):
return False
return id(self.ref) == id(other.ref)
def __hash__(self):
return id(self.ref)
class Holder(object):
def __init__(self):
self.items = dict()
def try_set(self, key, name):
if RefEq(key) in self.items:
return
self.items[RefEq(key)] = name
def try_get(self, key):
if RefEq(key) in self.items:
return self.items[RefEq(key)]
return None
view raw
Holder3.py
hosted with ❤ by GitHub
With this approach, we are able to implement proper reference equality and make sure that we aren’t mixing different values.
If you want to improve the performance of a Blazor application, you may want to reduce the number of time the UI is recomputed. In Blazor, this means reducing the number of time the StateHasChanged method is called. This method needs to be called only when the state has changed and the UI needs to
A common question that is raised by customers is how to determine what kind of hardware you need to run RavenDB on. I’m sorry, but the answer is it’s depend, because there are a lot of variables to juggle, in this post, I”m going to try to give some insights about what sort of things you should consider when sizing your instances.In general, you have three axis that you can work with. CPU, Memory and I/O. In terms of the best bang for the buck, optimizing I/O is usually the way to go and will return the most dividends. This is because most of the time, RavenDB will be bottlenecked on the I/O. This is especially true when you are running on the cloud, where 500 IOPS is a fairly common default (that is basically zilch to a database).To give more a concrete answer we’ll need more details. Let’s say that you have an application with a database per customer (common for multi tenant scenarios). The structure of the database is the same, but the databases contain data that is separated from each customer. The database has 20 indexes in total, 15 map / full text search as well as 5 for map-reduce / facets operations. There is also an few ETL tasks and a couple of subscriptions for background work.Let’s breakdown the load on single server in this mode, shall we?100 databases (meaning 100 tx merger threads for I/O).2,000 indexes - 20 indexes x 100 databases (meaning 2,000 indexing threads).Across the cluster, we also have:500 ETL tasks – 5 per database x 100200 subscriptions – you get the drillThe latest items are spread fairly among all the nodes that you have, but the first two are present in all nodes in the cluster. What does this mean? We have 2,100 threads active at any given point in time? Well, that is where things gets a bit complex. We need to know more than just the raw numbers, we need to understand usage.How many of those databases are active at any given point in time? In a multi tenant system, it is common to have many customers using the system sporadically, which can allow you to pack a lot more instances into the same hardware resources. Of more interest, however, is usually the rate of writes. Here we need to ask ourselves what is the write write as well. In general, for reads RavenDB will load all the relevant items into memory and serve directly from there. For writes, given it’s durable nature, RavenDB must hit the disk. And the question now becomes how many database are active at the same time?This is important, because 10 writes per second to a single database are far better than 10 writes / second across 10 databases. This is because RavenDB is able to batch I/O for a single database, but not across databases. Let’s consider the scenario where we have writes that would impact 5 indexes in the database, what is going to happen when we have 10 writes / sec in a single database?1 – 5 writes to the disk for the actual documents writes (depends on a lot of factors, and assuming that we are talking about concurrent requests here).5 – 10 index updates: 1 –2 index updates x 5 relevant indexes (in most cases, we are able to batch indexes even better than documents writes).Total number of writes to disk: 6 – 15 writes.However, if we take the same scenario, but now run it across 10 databases, each having a single write? There is no way for us to batch updates, so we’ll have:10 databases x (1 document writes + 5 index updates) = 50 writes to disk.If the number of relevant indexes is high, or if there are more databases involved, it is easy to hit the limits of I/O, especially on the cloud. I’m actually painting somewhat bleak picture, in most cases you don’t have to worry too much about those details, RavenDB will take care of that for you. However, when you need to consider the sizing, you want to be aware of the possible load that you’ll have. Ironically enough, if you have enough load, RavenDB is able to really optimize things, it is when you have sporadic operations, spread across many locations that we start putting a lot of load on the underlying system.So far, I was talking about I/O only, but there are other factors as well. Let’s assume that you are running 100 databases with 20 indexes each on a system with 4 cores. How is RavenDB going to split the load across the system?The first priority is going to be given to processing requests, and then we’ll start on running indexes. That is actually by design, to ensure that we won’t overwhelmed the underlying system by issuing too much work all at once. That means that we’ll round robin the work across all the indexes that want to run, while keeping enough capacity to process user requests. In this case, more cores will allow us higher degree of parallelism, but if you have an unbalanced system (a lot of CPU but slow I/O), you’re going to see stalls because we’ll wait a lot for I/O.In short, you need to have a fair idea about how your system is going to be used. If you don’t have at least a good guess on the topic, you are probably better off getting more I/O bandwidth than anything else. RavenDB continuously monitor itself and will alert you if there are resource issues. You are then able to shore up anything that is lacking to get the best system performance.
We use cookies to analyze our website traffic and provide a better browsing experience. By
continuing to use our site, you agree to our use of cookies.