NewsBlur (a personal news aggregator) suffered from a data breech / ransomware “attack”. I’m using the term “attack” here in quotes because this is the equivalent to having your car broken into after you left it with the engine running with the keys inside in a bad part of town.As a result of the breech, users’ data including personal RSS feeds, access tokens for social media, email addresses and other sundry items of various import. It looks like about 250GB of data was taken hostage, by the way.The explanation about what exactly happened is really interesting, however.NewsBlur moved their MongoDB instance from its own server to a container. Along the way, they accidently (looks like a Docker default configuration) opened the MongoDB port to the whole wide world. By default, MongoDB will only listen to the localhost, in this case, I think that from the perspective of MongoDB, it was listening to the local port, it is Docker infrastructure that did the port forwarding and tied the public port to the instance. From that point on, it was just a matter of time. It apparently took two hours or so for some automated script to run into the welcome mat and jump in, wreak havoc and move on.I’m actually surprised that it took so long. In some cases, machines are attacked in under a minute from showing up on the public internet. I used the term bad part of town earlier, but it is more accurate to say that the entire internet is a hostile environment and should be threated as such.That lead to the next problem. You should never assume that you are running in anywhere else. In the case above, we have NewsBlur assuming that they are running on a private network where only the internal servers can access. About a year ago, Microsoft had a similar issue, they exposed an Elastic cluster that was supposed to be on an internal network only and lost 250 million customer support records. In both cases, the problem was lack of defense in depth. Once the attacker was able to connect to the system, it was game over. There are monitoring solutions that you can use, but in general, the idea is that you don’t trust your network. You authenticate and encrypt all the traffic, regardless of where you are running it. The additional encryption cost is not usually meaningful for typical workloads (even for demanding workloads), given that most CPUs have dedicated encryption instructions. When using RavenDB, we have taken the steps to ensure that:It is simple and easy to run in a secure mode, using X509 client certificate for authentication and all network communications are encrypted.It is hard and complex to run without security. If you run the RavenDB setup wizard, it takes under two minutes to end up with a secured solution, one that you can expose to the outside world and not worry about your data taking a walk.
A RavenDB database can reside on multiple nodes in the cluster. RavenDB uses a multi master protocol to handle writes. Any node holding the database will be able to accept writes. This is in contrast to other databases that use the leader/follower model. In such systems, only a single instance is able to accept writes at any given point in time.
The node that accepted the write is responsible for disseminating the write to the rest of the cluster. This should work even if there are some breaks in communication, mind, which makes things more interesting.
Consider the case of a write to node A. Node A will accept the write and then replicate that as part of its normal operations to the other nodes in the cluster.
In other words, we’ll have:
A –> B
A –> C
A –> D
A –> E
In a distributed system, we need to be prepare for all sort of strange failures. Consider the case where node A cannot talk to node C, but the other nodes can. In this scenario, we still expect node C to have the full data. Even if node A cannot send the data to it directly.
The simple solution would be to simply have each node replicate the data it get from any source to all its siblings. However, consider the cost involved?
Write to node A (1KB document) will result in 4 replication (4KB)
Node B will replicate to 4 nodes (including A, mind), so that it another 4KB.
Node C will replicate to 4 nodes, so that it another 4KB.
Node D will replicate to 4 nodes, so that it another 4KB.
Node E will replicate to 4 nodes, so that it another 4KB.
In other words, in a 5 nodes cluster, a single 1KB write will generate 20KB of network traffic, the vast majority of it totally unnecessary.
There are many gossip algorithms, and they are quite interesting, but they are usually not meant for a continuous stream of updates. They are focus on robustness over efficiency.
RavenDB takes the following approach, when a node accept a write from a client directly, it will send the new write to all its siblings immediately. However, if a node accept a write from replication, the situation is different. We assume that the node that replicate the document to us will also replicate the document to other nodes in the cluster. As such, we’ll not initiate replication immediately. What we’ll do, instead, it let all the nodes that replicate to us, that we got the new document.
If we don’t have any writes on the node, we’ll check every 15 seconds whatever we have documents that aren’t present on our siblings. Remember that the siblings will report to us what documents they currently have, proactively. There is no need to chat over the network about that.
In other words, during normal operations, what we’ll have is node A replicating the document to all the other nodes. They’ll each inform the other nodes that they have this document and nothing further needs to be done. However, in the case of a break between node A and node C, the other nodes will realize that they have a document that isn’t on node C, in which case they’ll complete the cycle and send it to node C, healing the gap in the network.
I’m using the term “tell the other nodes what documents we have”, but that isn’t what is actually going on. We use change vectors to track the replication state across the cluster. We don’t need to send each individual document write to the other side, instead, we can send a single change vector (a short string) that will tell the other side all the documents that we have in one shot. You can read more about change vectors here.
In short, the behavior on the part of the node is simple:
On document write to the node, replicate the document to all siblings immediately.
On document replication, notify all siblings about the new change vector.
Every 15 seconds, replicate to siblings the documents that they missed.
Just these rules allow us to have a sophisticated system in practice, because we’ll not have excessive writes over the network but we’ll bypass any errors in the network layer without issue.
Infinite scrolling is a way to automatically loads data when you reach the end of the page. It allows you to continue scrolling indefinitely. The method is often used in social media feeds or blogs.In this post, we'll create a Blazor component that you can use like the following:Razorcopy<Infini
Earlier this week, we have released RavenDB 5.2 to the world. This is an exciting release for a few reasons. We have a bunch of new features available and as usual, the x.2 release is our LTS release. RavenDB 5.2 is compatible with all 4.x and 5.x releases, you can simply update your server binaries and you’ll be up and running in no time. RavenDB 4.x clients can talk to RavenDB 5.2 server with no issue. Upgrading in a cluster (from 4.x or 5.x versions) can be done using rolling update mode and mixed version clusters are supported (some features will not be available unless a majority of the cluster is running on 5.2, though).Let’s start by talking about the new features, they are more interesting, I’ll admit. OLAP ETL (see below for full details) – This is the flagship feature for RavenDB 5.2 Rolling index deployment make it easier for RavenDB to control resource usage when introducing new indexesTelegraf integration & Grafana template to aid monitoringCluster wide dashboard to make it easy to track what is going on across the entire clusterSubscriptions tracking allow you to figure out exactly where your subscriptions are and what they are doing Read only certificates allows you to reduce the access level for certain usersSpatial queries have gotten a lot of performance improvements as well as much nicer analysis and debugging capabilitiesCustom Analyzers makes it easy to deploy your code for advanced full text scenariosImproved cluster wide transactions reduce the manual work you do to ensure the correct behavior of transactions across the cluster.I’m going to be posting details about all those features, but I want to point out what is probably the most important aspect of RavenDB, even beyond the feature, OLAP ETL. RavenDB 5.2 is a LTS release. Long Term Support releaseLTS stands for Long Term Support, we support such releases for an extended period of time and they are recommended for production deployments and long term projects.Our previous LTS release, RavenDB 4.2, was released in May 2019 and is still fully supported. Standard support for RavenDB 4.2 will lapse in July 2022 (a year from now), we’ll offer extended support for users who want to use that version afterward.We encourage all RavenDB users to migrate to RavenDB 5.2 as soon as they are able. OLAP ETLThis new feature deserve its own post (which will show up next week), but I wanted to say a few words on that. RavenDB is meant to serve as an application database, serving OLTP workloads. It has some features aimed at reporting, but that isn’t the primary focus. For almost a decade, RavenDB has supported native ETL process that will push data on the fly from RavenDB to a relational database. The idea is that you can push the data into your reporting solution and continue using that infrastructure. Nowadays, people are working with much larger dataset and there is usually not a single reporting database to work with. There are data lakes (and presumably data seas and oceans, I guess) and the cloud has a much higher presence in the story. Furthermore, there is another interesting aspect for us. RavenDB is often deployed on the edge, but there is a strong desire to see what is going across the entire system. That means pushing data from all the edge locations into the cloud and offering reports based on that.To answer those needs, RavenDB 5.2 has the OLAP ETL feature. At its core, it is a simple concept. RavenDB allows you to define a script that will transform your data into a tabular format. So far, that is very much the same as the SQL ETL. The interesting bit happens afterward. Instead of pushing the data into a relational database, we’ll generate a set of Parquet files (columnar data store) and push them to a cloud location.On the cloud, you can use your any data lake solution to process such file, issue reports, etc. For example, you can use Presto or AWS Athena to run queries over the uploaded files. You can define the ETL process in a single location or across your entire fleet of databases on the edge, they’ll push the data to the cloud automatically and transparently. All the how, when, failure management, reties and other details are handled for you. RavenDB is also capable of integrating with custom solution, such as generating a one time token on each upload (no need to expose your credentials on the edge).The end result is that you have a simple and native option to run large scale queries across your data, even if you are working in a widely distributed system. And even for those who run just a single cluster, you have a wider set of options on how to report and aggregate your data.
I posted a few weeks ago about a performance regression in our metrics that we tracked down the to the disk being exhausted. We replaced the hard disk to a new one, can you see what the results were?This is mostly because we were pretty sure that this is the problem, but couldn’t rule out that this was something else. Good to know that we were on track.
We recently added support for running RavenDB on a Ubuntu machine (or Debian) using DEB files. I thought that I would post a short walkthrough of how you can install RavenDB on such a machine. I’m running the entire process on a clean EC2 instance. Steps beforehand, making sure that the firewall is setup appropriately:Note that I’m opening up just the ports we need for actual running of RavenDB.Next is to go and fetch the relevant package, you can do that from the Download Page, where you can find the most up to date DEB file. SSH into the machine and then we’ll need to download and install the package:$ sudo apt-get update && sudo apt-get install libc6-dev –y $ wget --content-disposition https://hibernatingrhinos.com/downloads/RavenDB%20for%20Ubuntu%2020.04%20x64%20DEB/51027$ sudo dpkg -i ravendb_5.1.8-0_amd64.deb This will download and install the RavenDB package, after making sure that the environment is properly setup for it.Here is what this will output:### RavenDB Setup ###
#
# Please navigate to http://127.0.0.1:53700 in your web browser to complete setting up RavenDB.
# If you set up the server through SSH, you can tunnel RavenDB setup port and proceed with the setup on your local.
#
# For public address: ssh -N -L localhost:8080:localhost:53700 ubuntu@34.235.129.104
#
# For internal address: ssh -N -L localhost:8080:localhost:53700 ubuntu@ip-172-31-22-131
#
###RavenDB is installed, but we now need to configure it. For security, RavenDB default to listening to the local host only, however, we are now running it on a remote server. tat is why the installer output gives you the port forwarding command. We can exit SSH and run these commands, getting us to run the setup via secured port forwarding and setting up a secured RavenDB instance in minutes.
(Originally sent to my weekly tips subscribers in March of 2019) When you look at a method or function, it should have a name that describes…Keep Reading →
We use cookies to analyze our website traffic and provide a better browsing experience. By
continuing to use our site, you agree to our use of cookies.