Designing for Related Data without Foreign Keys
by Ardalis
posted on: March 03, 2021
A recent discussion on the devBetter.com private server spurred this article. One of the members was trying to work out a fairly complex…Keep Reading →
by Ardalis
posted on: March 03, 2021
A recent discussion on the devBetter.com private server spurred this article. One of the members was trying to work out a fairly complex…Keep Reading →
by Oren Eini
posted on: March 02, 2021
A few days ago I posted about looking at GitHub projects for junior developer candidates. One of the things that is very common in such scenario is to see them use string concatenation for queries, I hate that. I just reached to a random candidate GitHub profile right now and found this gem: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters //check if user exist in system static bool UserExists(string id) { SqlConnection conn = new SqlConnection(); conn.ConnectionString = ConfigurationManager.ConnectionStrings["Table"].ConnectionString; conn.Open(); SqlCommand cmd = new SqlCommand(@"SELECT COUNT(*) FROM [Users_Table] WHERE ID ='"+id+"'", conn); if((int)cmd.ExecuteScalar() > 0) { return true; } else { return false; } } view raw bad-sql.cs hosted with ❤ by GitHub The number of issues that I have with this code is legion. Not closing the connection or disposing the command.The if can be dropped entirely.And, of course, the actual SQL INJECTION vulnerability in the code.There is a reason that I have such a reaction for this type of code, even when looking at junior developer candidates. For them, this is acceptable, I guess. They are learning and focusing mostly on what is going on, not the myriad of taxes that you have to pay in order to get something to production. This is never meant to be production code (I hope, at least). I’m not judging this on that level. But I have to very consciously remind myself of this fact whenever I run into code like this (and as I said, this is all too common).The reason I have such a visceral reaction to this type of code is that I see it in production systems all too often. And that leads to nasty stuff like this: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters pagination_max = "" pagination_min = "" pagination_max = "and s.id < #{max_id}" unless max_id.nil? pagination_min = "and s.id > #{min_id}" unless min_id.nil? Status.find_by_sql " select st.* from ( select s.* from statuses s where s.created_at > NOW() - INTERVAL '7 days' and s.reply is false and ( s.account_id = #{@id} or s.account_id in (select target_account_id from follows where account_id = #{@id}) ) and s.account_id not in (select target_account_id from mutes where account_id = #{@id}) #{pagination_max} #{pagination_min} order by s.created_at desc limit #{limit} ) st left join custom_filters cf on cf.account_id = #{@id} and st.text not like '%' || cf.phrase || '%' where cf.id is null " view raw home_feed.rb hosted with ❤ by GitHub And this code led to a 70GB data leak on Gab. The killer for me that this code was written by someone with 23 years of experience. I actually had to triple check what I was seeing when I read the code the first time, because I wasn’t sure that this is actually possible. I thought maybe this is some fancy processing done to avoid SQL injection, not that this is basically string interpolation. Some bugs are things that you can excuse. A memory leak or a double free are things that will happen to anyone who is writing in C, regardless of experience and how careful they write. They are often subtle and easy to miss, happening in corner cases of error handling. This sort of bug is a big box of red flags. It is also on fire.
by Andrew Lock
posted on: March 02, 2021
In this post I describe an HTTPS issue when using Cloudflare in front of Netlify, and how to create a custom TLS certificate for full encryption.…
by Oren Eini
posted on: March 01, 2021
I run into an interesting scenario the other day. As part of the sign in process, the application will generate a random token that will be used to authenticate the following user requests. Basically, an auth cookie. Here is the code that was used to generate it: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters async function signIn(userId: string, pass: string) : User { var session = store.openSession(); var user = session.load<User>(userId); if (Security.comparePassword(pass, user.hashPwd, user.hashPwdOptions) == false) return null; if( (new Date().valueOf() - user.lastTokenGenerated.valueOf()) > (3600 * 1000)) { // token expired after 1 hour user.securedToken = btoa( fs.open('/dev/urandom').slice(0, 32) .readAsBinaryString()); await session.saveChanges(); } return user; } view raw 1st.ts hosted with ❤ by GitHub This is a pretty nice mechanism. We use cryptographically secured random bytes as the token, so no one can guess what this will be. There was a serious issue with this implementation, however. The problem was that on application startup, two different components would race to complete the signup. As you can see from the code, this is meant to be done in such as a way that two such calls within the space of one hour will return the same result. In most cases, this is exactly what happens. In some cases, however, we got into a situation where the two calls would race each other. Both would load the document from the database at the same time, get a(n obviously) different security token and write it out, then one of them would return the “wrong” security token. At that point, it meant that we got an authentication attempt that was successful, but gave us the wrong token back. The first proposed solution was to handle that using a cluster wide transaction in RavenDB. That would allow us to ensure that in the case of racing operations, we’ll fail one of the transactions and then have to repeat it. Another way to resolve this issue without the need for distributed transactions is to make the sign up operation idempotent. Concurrent calls at the same time will end up with the same result, like so: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters async function signIn(userId: string, pass: string) : User { var session = store.openSession(); var user = session.load<User>(userId); if (Security.comparePassword(pass, user.hashPwd, user.hashPwdOptions) == false) return null; var now = new Date(); var nowStr = now.getYear() +"-" + now.getMonth() +"-" + now.getDate() +"T" + now.getHour(); user.securedToken = btoa( Security.argon2i(user.hashPwd + "-" + nowStr +"-" Security.getPepper(), user.hashPwdOptions); await session.saveChanges(); return user; } view raw 2nd.ts hosted with ❤ by GitHub In this case, we generate the security token using the current time, rounded to an hour basis. We use Argon2i (a password hashing algorithm) to generate the required security token from the user’s own hashed password, the current time and some pepper to make it impossible for outsiders to guess what the security token is even if they know what the password is. By making the output predictable, we make the rest of the system easier.Note that the code above is till not completely okay. If the two request can with millisecond difference from one another, but on different hours, we have the same problem. I’ll leave the problem of fixing that to you, dear reader.
by Gérald Barré
posted on: March 01, 2021
It's common to use int, Guid, or string to represent entity ids because these types are well-supported by databases. The problems come when you have methods with multiple parameters of the same type. In this case, it's easy to mix up parameters when calling these methods.C#copyIssue GetIssue(int pr
by Oren Eini
posted on: February 26, 2021
We have been working on a big benchmark of RavenDB recently. The data size that we are working on is beyond the TB range and we are dealing with over a billion documents. Working with such data sizes can be frustrating, because it takes quite a bit of time for certain things to complete. Since I had downtime while I was waiting for the data to load, I reached to a new toy I just got, a Raspberry PI 400. Basically, a Raspberry Pi 4 that is embedded inside a keyboard. It is a pretty cool machine and an awesome thing to play around with:Naturally, I had to try it out with RavenDB. We have had support on running on ARM devices for a long while now, and we have dome some performance work on the Raspberry PI 3. There are actually a whole bunch of customers that are using RavenDB in production on the Pi. These range from embedding RavenDB in industrial robots, using RavenDB to store traffic analysis data on vehicles and deploying Raspberry PI servers to manage fleets of IoT sensors in remote locations. The Raspberry PI 4 is supposedly much more powerful and I got the model with 4GB of RAM to play around with. And since I already had a data set that hit the TB ranges lying around, I decided to see what we could do with both of those.I scrounged an external hard disk that we had lying around that had sufficient capacity and started the import process. This is where we are after a few minutes:A couple of things to notice about this. At this point the import process is running for about two and half minutes and imported about 4 million documents. I want to emphasize that this is running on an HDD (and a fairly old one at that). Currently I can feel its vibrations on the table, so we are definitely I/O limited there. Once I’ll be done with the data load (which I expect to take a couple of days), we’ll be testing this with queries. Should be quite fun to compare the costs of this to a cloud instance. Given typical cloud machines, we can probably cover the costs of the PI in a few days.
by Ben Foster
posted on: February 26, 2021
Producing a good library goes beyond writing code. This post looks at the tools and practices you can you use to level up your libraries, improving quality and security and increasing adoption and engagement.
by Gérald Barré
posted on: February 26, 2021
In a previous post, I explained how to compute code coverage for a .NET Core project with Azure DevOps and Coverlet. In this post, the code coverage visualization could be done directly in Azure DevOps. As a developer, you want to see these results directly into the IDE.Download the code coverage f
by Steve Gordon
posted on: February 25, 2021
If you’ve built applications using ASP.NET Core then you’ve most likely used the built-in dependency injection container from Microsoft.Extensions.DependencyInjection. This package provides an implementation of the corresponding abstractions found in Microsoft.Extensions.DependencyInjection.Abstractions. In the previous post, we learned about the IServiceCollection, including how service registrations are converted to ServiceDescriptors and added to the collection. We’ll continue learning about […]
by Oren Eini
posted on: February 25, 2021
I mentioned that we are currently hiring for a junior dev position and we have been absolutely swamped with candidates. Leaving aside the divorce lawyer that tried to apply to the position and the several accountants (I don’t really get it either) we typically get people with very little experience.In fact, this position is explicitly open to people with no experience whatsoever. Given that most junior positions require a minimum of two years, I think that got us a lot of candidates.The fact that we don’t require prior experience doesn’t meant that we don’t have prerequisites, of course. We are a database company and the fundamentals are important to us. A typical task in RavenDB involves a lot of taxes, from ACID compliance, distributed computing, strict performance requirements, visibility into the actions of the database, readability of the code, etc. I talked before about the cost of a bad hire, and in the nearly a decade that passed since I wrote that post, I hasn’t changed my mind. I would rather end up with no one than hire someone that isn’t a match for our needs. Our interview process is composed of a phone call, a few coding questions and then an in person interview. At this point, given that I have been doing that for over a decade, I think that I interviewed well over 5,000 people. A job interview stresses some people out a lot. Yesterday I had to listen to a candidate speak so fast that I could barely understand the words and I had to stop a candidate and tell them that they are currently in the 95% percentile of people I spoke to, so they wouldn’t freeze because of a flubbed question.I twitted(anonymously) about the ups and down of the process and seem to have created quite a lot of noise. A typical phone call for a potential candidate takes about 15 – 30 minutes and is mostly there to serve as an explicit filter. If they don’t meet the minimum requirements that we have, there is no point in wasting either of our time. One of the questions that I ask is: Build a phone book application that stores the data in memory and outputs the records in lexical order. This can stump some people, so we have an extra question to help. Instead of trying to output the data in lexical order, how would you ensure that you don’t have a duplicate phone number in such a system? Scanning through the entire list of records each time is obviously not the way to go. If they still can’t think of a way to do that the next hint is to think about O(1) and what data structure would fit this requirement. On the Twitter thread, quite a few people were up in arms about that.Building a phone book is the kind of task that I remember doing in high school programming class as a teenager. Admittedly, that was in Pascal, but I just checked six different computer science degrees and for all of them, data structures was a compulsory course. Moreover, things like “what is the complexity of this operation” are things that we do multiple times a day here. We are building a database here, so operations on data is literally our bread and butter. But even for “normal” operations, that is crucial. A common example, we need to display some information to the user about their database. The information actually come from two sources internally. One is the database list which contains various metadata and one is the active database instance, which can give us the running stats such as the number of requests for this database in the past minute. Let’s take a look at this code: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public List<DatabaseStats> GetDatabasStats() { List<DatabaseMetadata> databaseList = server.Metadata.GetAll(); List<Database> running = server.Database.GetActive(); var results = new List<DatabaseStats>(); foreach(var metadata in databaseList) { var runingStats = running.Single(x=>x.Name == metadata.Name); results.Add(new DatabaseStats { Name = metadata.Name, CreatedAt = metadata.CreationDate, SizeOfDisk = metadata.SizeOfDisk, RequestsPerSecond = runningStats }); } return results; } view raw slow.cs hosted with ❤ by GitHub The complexity of this code is O(N^2). In other words, for ten databases, it would cost us a hundred. But for 250 databases it would cost 62,500 and for 500 it would be 250,000. Almost the same code, but without the needless cost: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters public List<DatabaseStats> GetDatabasStats() { List<DatabaseMetadata> databaseList = server.Metadata.GetAll(); List<Database> running = server.Database.GetActive(); var results = new Dictionary<string, DatabaseStats>(); foreach(var metadata in databaseList) { results.Add(metadata.Name, new DatabaseStats { Name = metadata.Name, CreatedAt = metadata.CreationDate, SizeOfDisk = metadata.SizeOfDisk, }); } foreach(var runStats in running) { results[runStats.Name].RequestsPerSec = runStats.RequestsPerSec; } return results.Values.ToList(); } view raw fast.cs hosted with ❤ by GitHub This is neither theoretical nor rare, and it is part of every programming curriculum that you’ll find. Further more, I’m not even asking about the theoretical properties of various algorithms, or asking a candidate to compute the complexity of a particular piece of code. I’m presenting a fairly simple and common task (make sure that a piece of data in unique) and asking how to perform that efficiently. From a lot of the reactions, it seems that plenty of people believe that data structures aren’t part of the fundamentals and shouldn’t be something that is deeply embedded in the mindset of developers. To me, that is like saying that a carpenter shouldn’t be aware of the difference between a nail or a screw.Rob has an interesting thread on the topic, which I wanted to address specifically:The first and most obvious problem is "what language?". JavaScript, Ruby, and Python have abstracted the idea of an array (and hash) so there are things you can do with them that might qualify as O(1) but then again, might not. I think what Oren was going for was ...— Rob Conery (@robconery) February 25, 2021 It does not matter what language, actually. In JavaScript, you’ll not use a “new Hashtable()”, you’ll use an object, sure, but that is a meaningless detail. In fact, arrays implemented as hashes in JavaScript maintain most of their important properties, actually. O(1) access time by index being the key. If you want to go deeper, than in practice, the JS runtime will usually use an actual array if it detects that this is how you use the variable. And I’m sorry, that is actually really important for many reasons. The underlying structure of our data has huge impact on what you can do with that data and how you can operate on it. That is why you have ArrayBuffer and friends in JavaScript now, because it matters, a lot.And to the people whose 30 years experience never included ever needing to know those details, I have two things to say:Either your code is full of O(N^2) traps (which is sadly common) or you know that, because lists vs. hash is not something that you can really get away with.In this company, implementing basic data structures is literally part of day to day tasks. Over the years, we needed to get customize versions of arrays, lists, dictionaries, trees and many more. This isn’t pie in the sky stuff, that is Monday morning.