Recovery Overview#

RonDB employs several mechanisms to manage failures and thereby uphold both availability and system performance. It is engineered to recover from software failures in milliseconds and from hardware or operating system failures in seconds. The recovery architecture is pivotal in achieving a high level of availability, specifically 99.9999% (Class 6 availability).

Recovery Architecture#

The recovery architecture is an integral component of RonDB’s internal design. It can be categorized into three sets of protocols:

Safety Protocols: These protocols are designed to persist data to disk to prevent loss in case of failures. They include:

- REDO Log

- Global Checkpoint Protocol

- Local Checkpoint Protocol

- Schema Transaction Protocol
Failure Protocols: These protocols are the backbone of the node life-cycle and are designed to maintain system stability during node failures:

- Node Registration Protocol (part of Heartbeat Protocol)

- Heartbeat Protocol

- Takeover Protocol in Non-blocking Two-Phase Commit (not discussed here)

- Watchdog Protocol

- Node Failure Protocol

- Graceful Node Shutdown Protocol
Startup Protocols: These are critical for restoring the system post-failure. They comprise:

- Node / Cluster Start Protocol

- Node / Cluster Restart Protocol

All protocols except the registration protocol need to handle the following cases:

Multiple node failures, including master node: When nodes fail and the master is still alive we will normally fake a response signal from the failed node to ensure that the protocol doesn’t stop. Whenever the master fails, each active protocol will start a new master which will ask for information about the protocol state in a discovery protocol. Therefore, each interrupted protocol can safely continue or abort.
Failure of new master in master takeover handling: The discovery processes in a master takeover handling must also be able to handle the case where the new master fails.

All these protocols are discussed in detail in the following sections.