Optimising restart times#
As a DBA for a highly available RonDB installation you are likely to have requirements on restart times for RonDB. In this chapter we will discuss the options available to impact restart times.
The restart has around 20 different phases, most of those phases are fairly quick. There is an ndbinfo table that gives information on restarts and the exact time we spent in these different phases.
Here we will focus on the most important phases that consumes most of the time.
Early phases#
Most of the early phases are normally executed quickly. There are occasions when they can take some time.
The first phase is to allocate a node id. This phase can wait for the node failure handling to complete. When a node fails the transactions need to be completed for all ongoing transactions this node was responsible for. A number of protocols need to be completed before the node failure handling is complete. As long as the node failure handling is still ongoing in the cluster it won't be possible to allocate the node id.
Next phase is to be included into the heartbeat protocol. The registration of a new nodes takes 1-3 seconds. If many nodes start in parallel this phase might take some time.
Next we have to wait until all nodes required for the restart are up and running. For a node restart this is normally very short. In a cluster start or a system restart of the cluster enough nodes have to be available for the start to be able to proceed. This phase might have to wait for nodes to start up.
Next there is a short phase waiting for permission to be included into the cluster followed by waiting for an appropriate time to copy the meta data over to the starting node. Starting up an LCP might block the start of a node for a short time, in RonDB it is at most a few seconds normally.
The next phase copies over the meta data, the length of this phase is dependent on the amount of tables and other meta data in the cluster. It is normally a very short phase.
Next we have a phase to include node in a number of important protocols. Next a phase to start up the local recovery of a node.
There are no specific methods available to speed up those parts of the recovery.
After these phases we move into the phases where the actual database recovery happens. This is where most restarts spend a majority of their time and this is where it is possible to have ways to improve the recovery times.
Load data phase#
The first phase that can take considerable time is the load data phase. This phase reads in the data from the local checkpoint files. This phase executes in parallel on all LDM threads. Actually it even uses all query threads as well, in the default setting of AutomaticThreadConfig we also create a number of Recover threads, the total number of LDM threads, Query threads and Recover threads will be the same as the number of CPUs in the node. Thus this phase should more or less use all CPUs in the node.
The time it takes for this phase is directly dependent on the amount of data to restore. With records of a few hundred bytes in size it is likely that each thread will be capable of sustaining a few hundred thousands records per second and even up to a million records per second. Thus we can sustain several million records per second during this recovery phase. Thus even a database of of 200 GByte should be loaded within 3 minutes. This requires the file system to be able to sustain reading more than 1 GByte per second from disk into memory.
Thus in a node using block storage in the cloud it is often the bandwidth to block storage that limits the recovery speed since block storage has limitations on the bandwidth towards block storage from each VM. Using instances with NVMe drives thus increase recovery speed significantly.
The user considerations for this phase are:
-
Ensure disks can sustain the speed used by the recovery threads
-
Ensure that the number of recovery threads is sufficient in the data node
Most of the considerations are fairly automatic, the only considerations that are important is to select the proper hardware, both regarding CPU resources and disk resources.
UNDO log execution#
The UNDO log execution is only used for disk data tables. The amount of UNDO logs is directly dependent on the amount of changes to the disk data content in RonDB.
The execution of the UNDO log starts by reading the UNDO log in the rep thread. Next one looks up the table and fragmentation identity in the page. This read of a page in the extra PGMAN block is necessary to be able to route the log message to the correct LDM thread.
The LDM threads executes the UNDO logs in serial mode.
With the partial local checkpoints used in RonDB a local checkpoint is normally only a few seconds and below a minute even in a highly loaded cluster. Thus the UNDO log execution phase isn't expected to take very long time, it should normally significantly below a minute.
The UNDO log execution needs to read the pages from disk to change them. Thus the disks used for the UNDO log and more particularly for the data files of the tablespace should be high performance disks.
The user considerations for this phase are:
-
Ensure disks used for tablespaces can read data pages quickly
-
Ensure disks used for UNDO log can read and write sequentially at sufficient speed
REDO log execution#
After finishing loading data and executing the UNDO log we are ready to execute the REDO log.
The amount of REDO logs to execute is completely dependent on the amount of writes performed before the crash time combined with how long time have passed since the last local checkpoint of the table partitions to restore.
Executing a REDO log record takes similar time to executing the operation in an environment with a single replica since no distributed commit is necessary. The code used to execute the REDO log operations is the same code used to execute normal user transactions.
To keep the time for executing the REDO log records low it is important to keep the time of executing a full local checkpoint short. This is now automated using an adaptive algorithm that ensures that we execute local checkpoints at a high pace thus avoiding long recovery through execution of the REDO log.
Rebuild ordered indexes#
Rebuilding ordered indexes is a completely CPU bound activity. We will use all CPUs available in the configuration or a high number of parallel threads that will make use of most of the available CPU resources.
This is configurable through configuration variable BuildIndexThreads. This is by default set to 0. This can be set to 128 to fully parallelise ordered index builds. At the moment the maximum parallelisation in this phase is the number of partitions in a table since one table at a time has its indexes rebuilt.
Rebuild unique indexes#
The unique indexes are implemented as normal tables that are maintained using internal triggers using the transaction protocol. There is no specific need to rebuild unique indexes during recovery, it is rebuilt as any normal table. A unique index table doesn't have any ordered indexes on it, thus there is no specific ordered index build processing.
The hash index is built and maintained by the process that installs checkpoints and by the REDO log execution and by the copy fragment execution. All of these processes use exactly the same code path that any normal transaction is using.
Copy fragment phase#
After completing the rebuild of the ordered indexes we have restored an old version of the database. However to get from here to an up-to-date version we need to synchronize each row with a live node. This is the aim of the copy fragment phase where one table partition at a time is brought on-line.
The copy fragment phase knows the last global checkpoint that the restored data contains. Thus no rows with older GCI (global checkpoint identifier) are required to copy from the live nodes to the starting node. Each row has a GCI field where we stamp the GCI that updated this row. Thus the live node will scan all rows in the table partition and check for each row if the GCI field contains a GCI higher than the GCI restored on the starting node. We know the GCI where we have all rows restored for each of the table partitions.
This phase will scan all rows in the live node and copy over all changes to the starting node. This will happen in parallel in each LDM thread in the live node and the starting node.
Before the copying of a fragment starts up, the distribution handler will ensure that the starting node will participate in all transactions. This ensures that after copying a row it will be kept up-to-date.
During the copy fragment phase we will have to write UNDO logs for all changes to disk data parts. This could trigger a local checkpoint execution locally in the starting node to avoid that we run out of UNDO log during the Copy fragment phase.
After completing the copy fragment execution we will enable the REDO logs for each fragment.
The time for the copy fragment is again dependent on the number of LDM threads. It is highly dependent on the load in the live node while performing this phase. If the live node is busy serving user transactions this will give less CPU time for the restart of the starting node. There will always be progress, this phase will eventually complete also in highly loaded clusters.
Local checkpoint phase#
After completing the copy fragment phase we have restored an up-to-date version of the data. We cannot yet recover the node without assistance from the live node(s). To make it recoverable on its own we need to perform a local checkpoint. This is a normal LCP that all nodes alive is participating in.
During recovery we first perform a local checkpoint that is local to the starting node in preparation for participating in the first distributed local checkpoint since there could have been a large amount of data needing to be checkpointed after a restart.
After completing this phase we have a data node that can recover on its own without aid of other nodes and we can handle the final restart phases.
Initial node restart#
Initial node restart is a bit special and need to be considered. In this case there is no data to start with. All the data comes from the live nodes through the copy fragment execution.
In this case we use two phases by setting the configuration parameter TwoPassInitialNodeRecovery. The first phase copies over all rows of the table partition. This phase doesn't enable the starting node to participate in transactions. After copying all data over we will start the rebuild index phase.
When the rebuild ordered index phase is completed we are in the same state as after rebuilding the ordered indexes for a normal node restart. Thus we continue with a second phase of copy fragment, this is performed in the same fashion as the traditional one. We have kept track of the oldest GCI we have data for, we need not copy rows over with GCI older than this GCI.
There is a configuration variable enabling this two-phase approach for initial node restarts.
Handling MySQL Replication Servers#
The final phase in the restart is to setup the MySQL replication server. The new node needs to be known by the MySQL servers that act as replication servers to ensure that they know that they can go to this node to fetch event records. This phase requires the MySQL Replication servers to be up and running, otherwise we will wait for a configurable time which is two minutes by default before we proceed and announce the node as restarted.
Final words#
There is an ndbinfo table called restart_info that records the times of each the 20 phases of a restart and how long time it takes. This can be used to understand which phase of the restart is your problem and considering how to improve the speed of recovery in this part.
After going through all restart phases we see that enabling as many CPUs to work on the restart in parallel as possible is important to shorten restart times. This can be accomplished by using machines with more CPUs.
However the most important factor impacting the restart times is the bandwidth of the disks used, in the cloud this requires careful consideration whether to use local NVMe drives or what block storage variant to use.
Obviously local NVMe drives has the best restart times, but in a cloud environment these nodes will require an initial node restart if they are moved to a new VM. Thus for normal restarts local NVMe drives will be a lot faster, but occasionally we will have a bit longer restart times to handle moving data to a new VM size. Moving a VM with block storage to a new size can still reuse the same block storage.