Failure Protocols#
Node Registration Protocol#
All recovery protocols contain a master role. This master role must be quickly resumed by some other node when a failure happens. All nodes must select the same master node. In order for this to work we have a node registration protocol where nodes enter the cluster in a specific order that is informed about. This order is then used to always select the oldest member of the cluster as the master.
We can only add one node at a time to the cluster. This protocol is
described in some more detail in the source file
storage/ndb/src/kernel/blocks/qmgr/QmgrMain.cpp
.
The nodes currently in the cluster must agree on the node to add next. The reason is that nodes have a dynamic order id. Thus we know which nodes is the currently oldest node in the cluster. We will always select the oldest node in the cluster as the master node in the cluster.
Heartbeat Protocol#
The heartbeat protocol uses the order of nodes acquired by the node registration protocol and always sends heartbeats to the next node in the dynamic order. Similarly it expects to receive heartbeats from the node before it in dynamic order. The newest node will send heartbeats to the oldest node and vice versa the oldest node will expect heartbeats from the newest node.
Heartbeats have a special signal. Normally this signal isn’t used. Every signal received from a node is treated as a heartbeat, so the transport layer assists in the heartbeat protocol. The reason was that in high load cases the heartbeat signal sometimes got so far into the send buffer that the heartbeat timeout happened before it got to the other node.
Watchdog Handling#
Every thread in the RonDB data nodes is protected by a watchdog thread. For most of the time the watchdog is not showing up at all. As soon as a thread is unresponsive for more than 100 milliseconds we will get printouts of this in RonDB. Watchdog printouts can come due to bugs in the RonDB code, but they can also occur if the OS has problems in delivering a real-time experience. During early phases of restart it is fairly normal to get some printouts of some watchdog issues during memory allocation.
If the watchdog does not hear any progress in a thread for the watchdog timeout (about one minute) it will crash the data node. If this happens it is usually due to a RonDB bug where we have ended up in an infinite loop. The watchdog thread is there to ensure that we will come out with a crash even if there is no crashing code.
Node Failure Protocol#
One important part of the distributed transaction theory is that node failures have to be handled as transactions. This means that any transaction has to be serialised in relation to the node failure. To achieve this we use the heartbeat protocol to quickly discover nodes that have failed and we use the node failure protocol to collectively decide on which set of nodes that failed together using a two-phase protocol. We detect node failures in many other ways such that most node failures that are detected by the heartbeat mechanism are either due to real hardware failures, some sort of overload mechanism or an operating system that has stalled our process. The latter was fairly common in early versions of virtual machine implementations.
Node failures are discovered in numerous ways. If a data node closes the TCP/IP socket to another data node, the other node will treat this as a node failure. Missing heartbeats for more than the heartbeat period (four times the heartbeat timeout) will trigger a node failure handling protocol.
Sometimes a node behaves in a weird manner, in this case the node might be shut down by another node sending a shutdown signal to it.
When the node failure protocol starts, it sends a prepare message to all nodes with the list of failed nodes. A node receiving this message will be able to add nodes to this list, but not be able to remove any.
When all nodes have replied to this we might have a new set of failed nodes. If so we send another round of prepare messages. Eventually the set of nodes will agree on the failed nodes. After that we will commit the node failures. Thus each node failure set is agreed upon by all surviving nodes in the cluster. Thus we can trust that all nodes see the same view on the list of alive nodes after the node failure transaction is completed.
Now that we have decided on the set of failed nodes we will start handling the node failures. This includes handling each and every distributed protocol.
In particular, it involves handling the failure of the transaction coordinator in many transactions as described in the takeover protocol of the non-blocking 2PC protocol.
There is an extension available to the node failure protocol that starts
by performing a full check of connectivity among the nodes in the
cluster before initiating the node failure protocol. To activate this
one needs to set the configuration variable
ConnectionCheckIntervalDelay
to a non-zero value.
Graceful Shutdown Protocol#
A crash shutdown works just fine in many cases. However, we have implemented a graceful shutdown to ensure that no transactions are aborted during a planned shutdown of a node.
A graceful shutdown is initiated from the client to the management server (ndb_mgm binary). It is also initiated by kill -TERM on the data node process. This is used in managed RonDB when stopping a node.
A graceful shutdown will ensure that no more transactions are started in the node. It will allow for already started transactions to complete before it shuts down the node. If transactions are still running after a few seconds they will be aborted.