So what type of crashes can we handle in RonDB? At the beginning of NDB development we had to choose between optimising on fast restart times or surviving a maximum amount of crashes. We decided to focus on surviving as many crashes as possible.
This led to the introduction of node groups. Thus each node group contains one part of the entire database, there is no data shared between node groups (except of course fully replicated tables that are replicated in all node groups).
Each node in a node group contains the full data set handled by that node group. The node group concept is more or less the same as the sharding concept used in Big Data applications. The main difference is that sharding is done from the application level and each shard is a stand-alone DBMS that have no synchronisation with other shards.
In RonDB the shards (node groups) are all part of the same DBMS and we support both transactions between shards and queries over all the shards in RonDB. Thereby RonDB supports cross-shard transactions, cross-shard queries, cross-shard foreign keys, cross-shard unique indexes while still being able to operate very efficiently using partition pruned index scans that ensure that applications can scale.
RonDB supports Online Add Node as described earlier, this means that we can dynamically add new node groups (shards) and we can even reorganize the data in all tables as an online operation while still allowing data to be updated in any manner. Thus RonDB support on-line resharding.
An alternative could have been to split out partitions over all nodes in the cluster. In this case all nodes would assist the node starting up, at the same time this architecture would never survive more than one crash at a time. As soon as a second node failed the cluster could not continue to operate.
The node group concept makes it possible to survive the cluster as long as at least one node per node group is still up.
Replication inside a cluster#
The most basic configuration variable in an RonDB installation is the NoOfReplicas variable. This can be set to 1, 2, 3 and 4. In RonDB we currently support 1, 2 and 3 replicas, 4 replicas is still in beta phase.
1 means no replication at all and is only interesting when used in a one node cluster when RonDB is used as a storage engine for MySQL or as a very simple fast hash table. It can be interesting for application development to minimize resources used during development.
2 replicas is the standard use case for RonDB. We have spent considerable effort to also ensure that everything works also with 3 replicas which is often the requirement in financial applications. It turns out to be very useful also in applications such as an Online Feature Store where 3 replicas ensures that we lose very little throughput with the loss of one replica.
The software can handle 4 replicas as well, but there are still some known bugs in this area.
Failures supported from a fully functional cluster#
Assume that we have a fully functional cluster, all nodes in all node groups are working. Question is how many failures we can support.
The following rules apply in order: 1) If all nodes in one node group have failed we cannot survive (obviously happens with any failure with 1 replica).
Observation: If 1) is false this means that we have at least one node per node group that is still alive.
2) If at least one node group has all its nodes alive, we will survive
Proof: From observation we can conclude that no node group is completely failed, also we have at least one node group fully alive. Thus the failed nodes cannot survive on their own since they will be missing at least one node group completely. Thus rule 1) means that they will not survive.
3) If all node groups has at least one alive node , but also no node group has all nodes alive, the cluster will survive if a majority of the nodes are still alive.
Proof: Rule 1) and 2) shows that we have no completely missing or available node group. We have two sets of nodes that both could potentially survive. In this case the part with most nodes available will survive. The other cluster part will fail since it cannot form a majority and cannot even contain exactly half of the nodes in the cluster.
4) All node groups has at least one alive node , but also no node group has all nodes alive, and exactly half the nodes are still alive. In this case we will use an arbitrator to decide which cluster part will survive. The cluster that first asks the arbitrator for permission to continue operating will survive, the other part will get a refusal on his request if it manages to reach the arbitrator.
Examples of failures#
The most simple example is a cluster with 2 data nodes in one node group. A failure here of only one node requires an arbitrator. Thus the arbitrator is required for the most common and simple example of them all. This means in reality that we need to ensure that the arbitrator is placed on its own node and not colocated with any of the two data nodes. Otherwise a single failure would bring down the entire cluster in 50% of the cases.
This is why RonDB requires at least 3 computers to provide a highly available service. With only 2 nodes we can only survive software failures and 50% of the hardware failures (this covers a lot of cases and isn't that bad). The third computer only needs to run a ndb_mgmd program or a simple NDB API application. Thus it can be a much less resourceful computer compared to the data nodes.
If the cluster is executing on VMs, we still need to ensure that the 3 VMs are located on different servers and preferrably also not sharing too much HW resources. Most clouds provide 3 failure zones in each availability zone/domain. Thus making sure that the data nodes and one management server is placed in its own failure zone is a way to ensure that we provide a highly available service.
In a cluster of 2 replicas with multiple node groups we can survive a crash of one node per node group, but no crashes of two nodes in the same node group. This example shows an obvious reason why it isn't a good idea to place two nodes from the same node group on the same computer. By default node groups are formed from the order in the configuration file or by the order of node ids. By default the two data nodes with the lowest node ids are placed in the first node group (with 2 replicas) and similarly moving forward.
It is possible to specify the node group id for a data node to arrange them in node groups under user control. It is necessary that the config contains as many nodes in each node group as there are replicas in the cluster (set by NoOfReplicas).
The arbitrator is only used in one and only case. This is the case when two clusters can be formed with nodes from all node groups. The arbitrator must have been set up before the crash happens. All live nodes must agree on which node is arbitrator before the crash happens.
When a node discovers a need to ask the arbitrator for permission to continue the node asks the arbitrator for his vote. The arbitrator will say yes to the first node and after that he will say no to the other nodes coming after that. In this manner only one cluster can survive.
Once the arbitrator have voted it cannot be used any more as an arbitrator. Immediately after using the arbitrator the nodes surviving must select a new arbitrator (or the same arbitrator again).
Arbitrator is selected from the set of management servers, API nodes and MySQL Servers in the cluster. The configuration variable ArbitrationRank is set on those to provide guidance on which nodes will be selected as arbitrator. Setting it to 0 means the node will never be selected as arbitrator and nodes with rank 2 will always be selected before nodes with rank 1.
It is also possible to write your own arbitrator and integrate it with RonDB. This could be useful e.g. if you absolutely cannot get hold of a third machine for the cluster and you feel safe that you can decide which machine is the one that should survive. It could also be useful when integrating RonDB with some Clusterware.
At the time that the data node discover that it has to ask an arbitrator to pick the data node to survive, it will log a message to the cluster log instead of sending a message to our arbitrator.
Continuing after wait for external arbitration, node: 1,2
where 1,2 is a list of the nodes to arbitrate between. It is a list of the node ids for the data nodes to decide which are to survive.
The action that the external clusterware should now perform is to check for this message in the cluster log. When it discovers this message it should ensure that one of the data nodes in the list is properly killed. The surviving node will then assume that it won the arbitration and the other node will be crashed. It is important to consider having a management server in both machines in this case.
For this to work the ArbitrationTimeout needs to be set to at least twice the interval required by the clusterware to check the cluster log. The configuration variable Arbitration is set to WaitExternal for this to happen.
Handling startup and network partitioning#
When starting after a crash we can provide information about how to proceed. There are a number of configuration variables and startup parameters to data nodes that specifies how to handle startup.
StartPartialTimeout specifies how many seconds we will wait before starting up a partial cluster. The cluster must not be partitioned, thus at least one node group must be fully started. But not necessarily all nodes in the cluster are needed to perform a partial start. This configuration parameter defaults to 30 seconds.
StartPartitionedTimeout specifies how many seconds to wait before we decide to start a partitioned cluster. In this case we can come into a situation of two network partitioned clusters. By default this parameter is set to 0 seconds, meaning it will wait forever. It is highly recommended to not change it which means that the cluster will never start up in a partitioned manner.
--nowait-nodes=list specifies a set of nodes that isn't necessary to wait for. This is a manual intervention where the DBA (DataBase Administrator) knows that these nodes are not up and running and thus there is no risk of network partitioning coming from avoiding those nodes. It is ok to start a partial cluster as soon as at least one node group have all nodes, or all nodes except nodes specified in this list. At least one node per node group must be specified to be able to start at all. This parameter is only a startup parameter for the RonDB data nodes.
Another special case is handling of nodes belonging to node group 65536. Data nodes belonging to this node group are not part of any node group. They are put into the configuration such that at a later time we can run the command in the management client to add node a new node group. These nodes are not vital to start a cluster. We can set the time to wait for those nodes to start with the configuration parameter StartNoNodeGroupTimeout. The default value of this configuration parameter is 15 seconds.
In RonDB we have the possibility to ensure that a subset of the nodes are not waited for when they are set to be inactive. This gives a possibility to create a cluster with 3 replicas, but still only startup with 1 replica. This makes it possible to start small and grow the cluster both in terms of number of replicas and number of node groups and even the size of individual data nodes.
Handling network partitioning#
RonDB tries to avoid network partitioning for a cluster. The only possible manner to get a partitioned cluster is to have the configuration parameter StartPartitionedTimeout set to a non-zero value. This should be avoided. We don't recommend using this option, RonDB has no way of getting one back to a consistent situation again if allowing the cluster parts to be partitioned.
To even more ensure that we cannot get into strange situations in this situation we will record the nodes that form in this case. We will not allow the nodes that got excluded to enter back into the cluster other than through an initial node restart.
A similar approach is performed when using the --nowait-nodes parameter.
RonDB definitely tries to avoid partitioned clusters and focuses on the Consistency and Availability in the CAP theorem. However RonDB supports global replication, in this case we can survive in a network partitioned state where two clusters that replicate to each other are both continuing to operate after the network has been partitioned. Later when the network heals the normal replication will ensure that changes are replicated in both directions and that conflicts are properly handled. Thus a local RonDB cluster focuses on the C and A in the CAP theorem whereas replication between local RonDB clusters handles the P part in the CAP theorem.
The most demanding users always combine a local RonDB cluster with replication to one or more other local RonDB clusters to provide the very highest availability.