A popular protocol to replicate changes between nodes in a distributed database is the two-phase commit protocol. Recently many systems have implemented other protocols that are more complex and most of them require three phases. The idea with the third phase is to avoid that the protocol is blocking.
In the two-phase commit protocol a node is blocked in the prepare phase until the transaction coordinator have spread its decision on whether the transaction was committed or not.
The block here could mean that nodes have to keep locks for a long time waiting for a node to recover. This is not a desirable feature of a replication protocol.
We have also added a third phase, the complete phase. It is possible to reply to the application already after the second phase. We discussed this in an earlier chapter where we described the Read Backup feature.
Idea behind new two-phase commit protocol#
In RonDB we kept some of the ideas of the two-phase commit protocol, but we also changed a lot of important details in it.
Distributed commit combined with linear commit#
One variant of the two-phase commit protocol is called linear commit protocol. This protocol sends the prepare message from first participant to last participant in serial order. Instead of returning to the transaction coordinator to commit the transaction, the transaction is committed in the last participant. Next the commit message is sent in serial order back to the first participant.
The original two-phase commit protocol uses 4 * n messages to complete a transaction (where n is the number of participant nodes in the transaction. The linear two-phase commit protocol uses 2 * (n - 1) instead. Thus the linear commit protocol saves more than half of the messages. This is very important for throughput.
At the same time latency of the original two-phase commit protocol is shorter. The latency of the original two-phase commit protocol is the latency of two rounds of parallel messages whereas the linear commit protocol gets longer as the number of nodes increases.
Given that we want to achieve scalability AND predictable latency it would be beneficial to use a combination the original protocol and the linear commit protocol.
Before presenting our new protocol, we mention that we use primary copy locking. Thus we need to lock the primary replica row before we lock the backup replica row. Failure to use this order can easily lead to deadlocks that could have been avoided.
Our idea is to use linear commit for one row, but to use the original commit protocol on each row. Thus the number of messages we will send will be 2 * m * (r + 1) (where m is number of rows in transaction set, and r is the number replicas). If we were to use the original commit protocol only we would send m * 4 * r.
An example with m, number of rows set to 10 and number of replicas (r) set to 2 will give 80 messages for original protocol and 60 messages for our new protocol. The latency of our protocol is much better compared to the linear commit protocol. The latency of our protocol is slightly worse to the original commit protocol. Given that we decreased the cost of communication we expect to get this latency loss back at high loads. We use a lot of piggybacking to ensure that the real number of messages is a lot lower.
Description of new protocol#
The figure below shows the new transaction protocol.
The idea is to start the prepare phase by going to the primary replica, this ensures that we minimise the risk of deadlocks. The risk of deadlocks is high in the original two-phase commit protocol since we can lock the replicas in different order for different transactions. This can lead to a deadlock even when the application has ensured a transaction ordering that is free from deadlocks.
The use of linear commit is important not only to minimise number of messages, but also to ensure that we avoid distributed deadlocks that make it very difficult to scale the application even if properly designed.
Next we move to the backup replicas and from there back to the transaction coordinator.
Now when it is time to commit the transaction we use the reverse linear order. The idea with this order is that it makes it possible to release the locks on the primary replica already in the commit phase. This has the advantage that if we route reads to the primary replica we can report transaction commit to the application already after the commit phase and still see our own updates. If we want to read any replica we still have to wait for the complete phase to complete before reporting the commit to the application.
The complete phase releases the locks in the backup replica. It also releases the memory attached to the transaction in the form of copy rows and operation records used to contain transaction state. A certain level of parallelism is used in the transaction commit where several rows can be committed in parallel and likewise for the complete phase.
Achieving a non-blocking commit protocol#
Before we move on to prove that the new protocol is a non-blocking we need to make a few observations.
First, the transaction coordinator is no longer the commit point in the transaction. Instead the commit point is when the first primary replica decides to commit the transaction. Before the commit message arrives at the primary replica it has passed through all backup replicas. Thus at the commit point all replicas have decided on the commit.
All replicas reside within one node group in RonDB, if all nodes within a node group fails, the cluster will also fail. Thus at a node failure that doesn't bring down the cluster we are certain that there is information in the surviving nodes to find out whether the transaction should commit or abort.
Thus we are never blocked by any node failure that doesn't cause a crash of the cluster.
If the cluster crashes completely we will always need recovery before transactions can start again. We can never become blocked in this case. Thus our commit protocol is a non-blocking protocol if we can make use of the information in the transaction participants to complete transactions after node failures.
We implement a take-over protocol to rebuild the transaction state of transactions that lost their transaction coordinator to ensure that we can always decide on commit or abort of any ongoing transactions in the cluster. More on this protocol below.
We have shown that our new two-phase commit protocol is a non-blocking protocol. There is no node failures that can cause transactions to hang waiting for nodes to recover before the transaction can complete.
If nodes are missing to complete the transaction it also means that there is insufficient amount of nodes to keep the cluster up and running. If the cluster crashes it will perform recovery and recover those transactions that are durable also on disk.
Take over protocol#
The take-over protocol asks all nodes in the cluster to send information about any ongoing transaction where the failed node is transaction coordinator. The node asking is the master node that will take over the completion of these transactions.
After receiving all information about a set of transactions the new transaction coordinator will decide whether the transaction should be committed or aborted based on how far it got in the process. If any node heard a commit decision it will be committed. The new transaction coordinator will also inform the API of the transaction outcome.
This protocol will ensure that the transactions ongoing in the failed node, at the time of the crash, are completed before we report the node failure handling as completed. When node failure handling is completed we thus have none of the transactions still lingering in the surviving data nodes, they are all committed or aborted.
Similarly the API nodes have been informed of the transaction outcome with one exception. The exception is when using the read backup feature, in this case the transaction outcome is reported at complete. This means that the only remaining part of the transaction is the transaction coordinator when going to send the commit message. This transaction is handled by the message about the node failure handling being complete. If the transaction coordinator have failed and the node fail handling is completed and we haven't heard anything about transaction outcome, we know that the transaction is committed since this is the only reason for the transaction outcome message to not arrive.
We keep special commit ack information in the data nodes to ensure that the API can always be informed of the transaction outcome if the API itself survives.
Global checkpoint protocol#
RonDB is designed for applications that require latencies that cannot be achieved when writing to a hard drive as part of the commit protocol. SSDs wasn't around at the time NDB was designed.
RonDB uses Network Durability to implement the D in the term ACID (Atomic, Consistent, Isolated, Durable). Thus at commit time the transaction is committed in memory of multiple computers, thus it can survive any single point of failure of an entire computer and in many cases even multiple node failures.
First phase of global checkpoint protocol#
Some mechanism is also required to ensure that the cluster recovers to a consistent state also after a complete cluster crash. It is not sufficient to simply write out the transaction logs to disk as soon as possible. To recover a consistent state of the database we need something more as well.
What we implemented here is a sort of group commit transaction. We call this group commit a global checkpoint (GCP). Each GCP is a consistent recoverable point at a cluster crash. We also use micro GCPs. These are not recoverable after a crash, they form epochs used to send batches of transaction over to a backup cluster.
We use global checkpoints for many things in our recovery mechanisms. Each time a transaction is about to commit it asks for a global checkpoint id (GCI). Normally it gets one of those immediately. If we are in the process of starting a new GCP there could be a small wait.
The process to start a new global checkpoint is the following. The master data node decides to create a new global checkpoint. It sends a message called GCP_PREPARE to all nodes in the cluster (including itself). It responds with a GCP_PREPARECONF message, next the master data node sends GCP_COMMIT to all nodes.
At reception of the GCP_PREPARE message the nodes starts blocking committers from proceeding. All the prepare phase activities are allowed to continue which means that the impact on throughput and latency is minimal. Impact on throughput is low since there will be more room for prepare messages, thus we can still keep the CPUs busy. Impact on latency is low since it only introduces an extra delay of a two-phase message and given that committers are blocked a little bit more CPU is available to handle this message faster. After 15 years of operation of the cluster we still have had no issues with this initial phase of the global checkpoint protocol. One more reason is that queueing the commits for a short time introduces a sort of batching of commits that improves throughput.
At reception of GCP_COMMIT the nodes will immediately resume the commit of the stalled transactions.
Now that we have created a new global checkpoint we have also at the same time completed the previous one. Given the two-phased approach we are certain that we get a proper serialisation point between the global checkpoints. We know that a global checkpoint contains a set of transactions and all these will either be recovered completely or not at all.
This means that when we recover a node we use the GCI to recover to know which transactions to restore.
Each COMMIT record in the transaction log (in our case only a REDO log) is tagged with the GCI of its transaction. We also use the GCI to make synchronisation of a starting node and a living node easy. We recover the starting node to a certain GCI, after that only the rows in the living node with a higher GCI need to be sent to the starting node to synchronize the rows in the nodes.
Second phase of global checkpoint protocol#
The second phase makes the completed GCI durable on disk. This phase starts already when receiving GCP_COMMIT.
The first step is that each node needs to wait until the transaction coordinators have completed the processing of all transactions belonging to the completed GCI. Normally this is fast since it mostly involves networking messages and memory operations. Large transactions have the possibility to increase the time for this phase. Commits of transactions changing many rows with disk columns can represent a challenge.
When all transactions in the GCP in the node are completed we will send the message GCP_NODEFINISH back to the master data node.
When all nodes have finished the commit phase of the global checkpoint protocol the next phase is to flush the REDO logs in all nodes. The message GCP_SAVEREQ is sent to all nodes to ask them to flush the REDO log up to the GCI. The REDO log implementation need to keep track of where the last commit message for this GCI is stored to know how much REDO log needs to be flushed.
When the flush of the REDO logs is complete the nodes respond with GCP_SAVECONF.
At this point there is enough information to recover this GCI. To simplify the recovery we use one more phase where we write a small system file that contains the recoverable GCI. This is accomplished by sending COPY_GCIREQ to all nodes that will respond with COPY_GCICONF when done.
Handling cluster crash#
At a cluster crash we will use the information from the system file to discover which GCI to restore. To restore the data we will first install a local checkpoint. This local checkpoint will know how far back in the REDO log we need to start to ensure that we recover the data. More on the details of this in a later chapter.
Next we execute the REDO log executing all log records belonging to restored tables and with GCI smaller or equal to the restored GCI.
After that we have restored a consistent database, before the recovery is completed we execute a local checkpoint to avoid strange issues with multiple node crashes close to each other.
Impact of Read Backup feature#
As mentioned the Read Backup feature changes the protocol slightly in that the response to the API is done after the complete phase instead of after the commit phase. Thus making the protocol into a true three-phase commit protocol. Only when reading primary replicas can we use the optimisation to respond already after the second phase of the commit protocol.
The RonDB transaction protocol is a bit involved since we have many different requirements on it. The requirements are:
Avoid deadlocks using primary copy locking
The protocol must be non-blocking
It must be able to handle multiple node failures in all protocols
It must be able to handle multiple node failures in node failure handling
Minimise overhead of transaction protocol by decreasing amount of messages
Avoid writing to disk as part of commit phase to achieve predictable latency
Restore a consistent database after a cluster crash
Handle node restarts
Important to understand here is how the transaction protocol and the global checkpoint protocol is actually both parts of the implementation of transactions.
What we have done is that we first use a special two-phase commit protocol to ensure that we achieve Network Durability (committed in memory of several computers before sending commit message to the application).
Next we perform a group commit of a set of transactions and make those transactions durable.
Interestingly although I invented the ideas, it is really useful to describe those protocols to fully understand how the protocols are interconnected. The ideas evolved over a period of 25 years, so even for me it is good to go back and revisit the basic requirements to understand why we have the set of protocols we have.
Most DBMSs integrate everything into one commit protocol and possibly do some group commit optimisations. For RonDB this wasn't an option since the telecom database requirements meant that we could not wait for disk during the commit processing. To provide the durability without using disk we made use of the Network Durability concept instead that commits in memory of several computers.
At the same we still needed to recover something also after a complete cluster crash. We needed a protocol to make the transactions durable also on disk. Also this disk durable state needed to be consistent and thus the global checkpoint protocol was invented.