REDO Log & Global Checkpoint Protocol#

RonDB optimizes for low-latency applications where hard drive writes during the commit process are infeasible. It achieves the D (Durability) in ACID (Atomicity, Consistency, Isolation, Durability) through Network Durability. During a transaction commit, the data is committed in memory across multiple machines, so that all reads after this commit will see the new data. This ensures resilience against multiple node failures, but not against a complete cluster failure.

To ensure transaction durability on persistent storage mediums, RonDB writes REDO logs to disk within the Global Checkpoint Protocol in an asynchronous fashion. The protocol is a means of making the REDO log transaction consistent.

Protocol#

RonDB flushes its transaction records to disk every 1-2 seconds using the REDO log. However, in the event of a cluster failure, we need a transaction consistent point to recover from. This is achieved by the Global Checkpoint Protocol (GCP). This protocol is a way of labeling the transaction records in the REDO log with a global checkpoint identifier (GCI) and making sure that we have always flushed all transactions up to a certain GCI to disk. Each GCP is therefore a consistent recoverable point at a cluster crash.

In the following, we will describe in detail how the GCP protocol works, whereby the below figure shows its general architecture. The protocol runs between the data nodes’ DBDIH blocks (every data node runs a single DBDIH block). The master data node / DBDIH block is the oldest data node living and it is this node that initiates the protocol.

The DBDIH is also the block that contains the GCI. Whenever a transaction coordinator (TC) decides to commit a transaction, it will ask its local DBDIH for the GCI and pass it to the participants via the commit message. Read more about the role of the TC in RonDB’s two-phase commit protocol here.

First Phase of Global Checkpoint Protocol#

The first part of the protocol is run every 10-100ms, a so-called epoch.

First, the master data node decides to create a new global checkpoint. It sends a message called GCP_PREPARE to all data nodes in the cluster (including itself). At reception of the GCP_PREPARE message, the DBDIHs will queue TCs from receiving their GCIs. This means that new commit messages can briefly not be sent out. The nodes then respond to the master DBDIH with GCP_PREPARECONF.

Generally, this blocking of commits will only affect performance minimally, since it gives prepare messages a larger proportion of the CPU resources. Furthermore, the resulting queue of commits allow a sort of batch commit that improves throughput after the commit-block is released.

Next, the master data node sends GCP_COMMIT to all nodes. At reception of GCP_COMMIT the DBDIHs will bump the GCI, and release the queue of TCs waiting for the GCI. New commits can now be sent out again. The DBDIHs then respond to the master DBDIH with GCP_NODEFINISH.

The reason why this part of the protocol is run so frequently is because we distinguish between micro GCPs and persistent GCPs. A GCI consists of a low and a high number. Only when the master data node decides to bump the high number, will we move into the second part of the protocol into order to create a persistent GCP.

The micro GCPs are used for Global Replication, so that a backup cluster can receive a consistent set of transactions more frequently.

Due to this two-phased approach where we briefly block our TCs from committing, we are certain that we get a proper serialisation point between the global checkpoints. This is a requirement in order to recover a consistent database state from a GCP. After 15 years of operation of the cluster we still have had no issues with this phase of the Global Checkpoint Protocol.

Second phase of Global Checkpoint Protocol#

The second phase makes the completed GCP durable on disk. It is usually run every 1-2 seconds, but can be further reduced or increased. The trade-off lies between the frequency of disk flushes and the volume of data potentially lost during a crash.

The first step is that each node needs to wait until the transaction coordinators have completed the processing of all transactions belonging to the completed GCI. Normally this is fast, but large transactions can slow it down. Especially transactions changing many rows with disk columns can represent a challenge here.

When all transactions in the GCP in the node are completed we will send the message GCP_NODEFINISH back to the master data node.

When all nodes have finished the commit phase of the Global Checkpoint Protocol the next phase is to flush the REDO logs in all nodes. The message GCP_SAVEREQ is sent to all nodes to ask them to flush the REDO log up to the GCI. The REDO log implementation will then flush all transactions until the last one containing this GCI.

It is perfectly possible that some log records belonging to the next GCI are flushed to disk as part of this. There is no serial order of GCIs in one specific REDO log. This is not a problem, since we know at recovery which GCI we are recovering from and will skip any log records with a higher GCI.

When the flush of the REDO logs is complete the nodes respond with GCP_SAVECONF.

At this point there is enough information to recover this GCI. To simplify the recovery we use one more phase where we write a small system file that contains the recoverable GCI. This is accomplished by sending COPY_GCIREQ to all nodes. These will then write a special file called P0.sysfile containing the GCI. It is written in a safe manner in two places to ensure that it is always available at recovery. Writing this file allows us to start up without needing to scan all REDO logs for the GCI to recover.

When the file is written the nodes will respond with COPY_GCICONF. The GCP is now completed and recoverable.