There are four types of RonDB startup. Initial start of the cluster, node restart after a node failed where the rest of the cluster is operational, system restart that restarts the cluster after the entire cluster failed or was shut down. The final variant is an initial node restart that is a node restart where the node starts from scratch.
This chapter will also discuss the start phases used when starting a data node.
Initial start is the very first start of a new cluster. Before starting we have prepared a configuration file and we have prepared machines, OS and file systems to house the database files for RonDB.
An initial start takes some time primarily to initialise the REDO logs, the system tablespace and the UNDO log files. The reason for those to take time is that we ensure that the entire file is written. Since those files can be very big it can take some time.
If we don't write them at initial start the file system doesn't guarantee that the necessary file space is allocated. In some cases we also need to initialise the pages in the files.
It is possible to use some configuration variables that avoids writing all of the files, but this doesn't guarantee that we have the file space. Thus if using those configuration variables we can run out of REDO log file space in the middle of writing to the REDO log. Thus this is not a desired setting in a highly available environment.
Other than that the initial start is fairly quick and completed in less than a minute. A very large node could consume some time allocating memory, normally it takes about one second per three gigabyte of memory allocated. This time is very dependent on the OS used.
If a node shut down, or failed, while the rest of the cluster is operational, the restart is normally using a node restart.
A node restart starts by a local recovery up to the point where the node failed. The data has continued changing after the node failed. After the local recovery has been completed we synchronize the data with a live node in the same node group.
This synchronization is done row id by row id. Each row in RonDB has at least a small fixed size memory header part. It can also have a variable sized memory part and a fixed size disk part. The row id of a row is the logical address of the fixed memory part of the row. Pages in a fragment starts at page 0 and continues upwards. Each page contains a set of rows, the number of rows per page is dependent on the amount of columns that use the fixed size part of the row. The fixed size part has faster access, but it is less flexible in memory operations.
Using in-memory fixed size columns is much faster since we know exactly which memory we need to use and can prefetch this memory. Thus with inflexibility comes high throughput. The decision which memory variant to use can be set when creating the table in the CREATE TABLE statement.
It is recommended to use primary keys that can use the fixed size part and not being overly large for best throughput.
When a node is restarting it synchronizes row id by row id. For this synchronisation to work it is essential that a row has the same row id in all replicas. There are extremely rare cases when an insert fails to allocate the row id on the other replicas, this is a temporary error 899 that we've worked very hard to remove as much as possible. It is now so rare that one should report it as a bug if it occurs.
There are a number of cases to handle at synchronisation, the row in the starting node and the live node could be the same and there is nothing more to do. We discover this by checking the GCI on the row. The starting node has all the updates up to a certain GCI, so any row in the live node that has the same GCI or a lower GCI have no need of synchronisation.
If the row has changed it is possible that a row id exists in the live node that doesn't exist in the starting node, in this case we have to insert the new row in the starting node. The opposite can happen where we need to delete the row in the starting node. The next case is that the row can have been updated, in this case we have two different cases. If both rows have the same primary key one simply updates the row with the new information in the row. It is possible that the primary key differs (this can happen when the original row is deleted followed by a new insert that uses the same row id as the old row). If the primary key differs we have to first delete the old row in the starting node followed by an insert of the new row.
The local recovery first reads in a checkpoint, applies the UNDO log on the disk data parts, synchronises these pages to disk after completing the UNDO log execution. Next the REDO log is executed from the local REDO log. Next any ordered indexes are rebuilt.
The synchronisation is performed with a live node. During the synchronisation phase (copy fragment phase) no REDO logs are generated. After the synchronisation phase is completed the started node has exactly the same data as the live node(s) in the node group. The data is not recoverable yet though. The final step is to write a checkpoint that makes the data recoverable.
A system restart is a restart that involves all nodes of the cluster. During a system restart the cluster is not available. The nodes perform the same actions as during a node restart except that there is no need to run the synchronisation phase. After executing the REDO log and rebuilding the ordered indexes and performing a local checkpoint the cluster is operational again.
If all nodes doesn't have the possibility to recover the GCI that is restarted, then those nodes that can will first perform a system restart followed by a node restart for all nodes that could not recover the GCI for the system restart locally.
Initial node restart#
The final restart variant is initial node restart, this is a node restart that starts from a node that has an empty file system.
This can be used in a number of cases. The first case is if the file system of the node was corrupted. In this case one can start the node from scratch again after repairing the file system and removing all old files of the node.
It can be used to move the node to a new computer with a completely indepdent file system such as that happens when moving to a VM with local NVMe drives in the cloud.
In some cases it is necessary to use this restart type for complex upgrades.
Downgrade support is an important feature in a system that is required to always be available. If there is something that is not working correctly during an upgrade or after the upgrade, this could be a problem in the new RonDB version, it could be application issues since one often upgrades the application software at the same time. Supporting downgrade is important for applications that really care about uptime of their applications.
One can use initial node restart to upgrade computers. One takes down the old node, replaces it with a new node and gives it the same IP address and restarts the node from scratch. In this manner it is possible to upgrade the node to use a more recent hardware.
RonDB start phases#
RonDB restart is controlled at a number of different levels. At first we need to start up the RonDB virtual machine that controls the execution of the actual database code.
Starting RonDB virtual machine#
The RonDB virtual machine consists of a set of threads, there is a set of threads called ldm threads. ldm stands for Local Data Manager. These threads take care of handling the actual data, the REDO log, the hash indexes, the ordered indexes. This is the heart of the data node process.
There is a set of query threads, these can handle Committed Reads instead of ldm threads based on an adaptive scheduling algorithm. There are also recover that can assist ldm threads during the restore of a local checkpoint.
The next thread type is the tc threads that deals with transaction control.
There is one main thread that handles meta data, takes care of a number of restart activities.
Next there is one rep thread, this thread takes care of asynchronous replication and event reporting of data changes. It is possible to subscribe to changes and get asynchronous event signals for every data change in the database. It also takes care of the UNDO log handler, some proxy functions of the various blocks (proxy functions are used to handle control functions involving multiple threads).
The above threads are called block threads. These threads almost entirely spend their time executing signals sent to blocks residing in their thread. A block is a set of data and code, only one thread is executing at a time in a block except for some actions by the tc threads that need to access some shared data structures that are updated by the main thread. We will go through the most common signal flows in RonDB in a later chapter.
There is a number of recv threads. These can also execute signals, but they also handle all the actions to receive data on the sockets connecting the data node to other nodes in the cluster. The recv threads communicate with other threads using the same mechanism used to communicate between any block threads. This mechanism is a memory buffer unique for communication between two distinct block threads. The only synchronisation mechanism needed for this communication is carefully placed memory barriers.
In RonDB 22.01 we made the placement of blocks in threads more flexible. This makes it possible to experiment with many more thread configurations. This is explained in some detail in the chapters Research using RonDB.
There is a set of send threads. These can't execute signals, they are only used to send signals to other nodes that the block threads didn't take care of by themselves. They communicate with other threads using mutexes and signals.
There is a large set of file threads, these threads makes it possible to access the file system using synchronous file operations by code written as asynchronous signals sent between blocks. They communicate with the NDBFS block residing in the rep thread using mutexes and signals.
There is a watchdog thread checking that threads continue to make progress.
Before the actual database start phases can start we need to start all those threads, this includes setting up CPU locking and thread priorities for these threads. We need to initialise the communication subsystem. We need to setup the communication subsystem to communicate between threads, we allocate most of the memory (some of it is allocated during the first phase of the database start as well) before the actual database start phases start.
At a high level one can say that we start the virtual machine first before we start the actual database engine.
A very important thing that happens even before this is to read the configuration from the RonDB management server. The configuration contains information about how many threads of various types to use, how much memory of various types to allocate and all sorts of other things.
Reading the configuration is done over the network by connecting to the RonDB management server. This is why the address information where to find the RonDB management server is normally supplied as a command parameter when starting the ndbmtd program.
One special variant of the RonDB data node is to use the ndbd program. This program has one and only one block thread and this thread take care of the activitity of the send threads and receive threads. This program is no longer built in RonDB releases.
Phases used to start the RonDB data node#
When the virtual machine has completed starting up it is time to start up the actual database engine. This is controlled by a block called Missra that is a sub-block of NDBCNTR. This block starts by sending a signal called STTOR with phase 1 to all blocks, one block at a time. The block takes care of the action to be performed in phase 1 and in addition it sends a signal STTORRY where it reports which phases the block is interested in. When a block exists in multiple threads it will execute the phase in each thread in a serial manner. Thus the execution order of a startup is completely deterministic internally in a node.
In addition a starting node communicates with other live nodes and with other starting nodes. This communication is necessary to synchronise the startup to ensure that recovery is done in the correct way.
In the block NDBCNTR, a large comment explains the various start phases and what they do. At a high level the first phase does some memory allocation and a lot of data initialisation. The second and third phase builds up a number of data structures, most of the work in those two phases have been removed in RonDB.
Phase four through six performs the actual database recovery. There is some additional synchronisation in phases between seven and one hundred followed by a phase that ensures that the new data nodes takes it share of the work for event handling. This phase requires the API nodes to be available to complete. If they are not available a wait of 2 minutes will happen before the event part of the restart is completed.