Startup Phases#

List of Phases#

For a cluster (re-)start, the management server is started first, then the data nodes.

For all kinds of (re-)starts, data nodes are started like this:

ndbmtd --ndb-connectstring=<management server address> \
    --ndb-nodeid=<node-id> \
    --initial  # In case of an initial start

Then the data node runs through the following phases:

Startup Step	Cluster Start	Cluster Restart	Node Restart	Initial Node Restart
Complete Node Failure Protocol			✅	✅
A number of protocols need to be completed before the node failure handling is complete All ongoing transactions the failed node was responsible for must be completed
Receive Configuration from Management server	✅	✅	✅	✅
Acquire node Id How many threads of various types to use How much memory of various types to allocate
Start Virtual Machine	✅	✅	✅	✅
Set up RonDB’s thread architecture Set up thread priorities and CPU locking Set up communication between threads Allocate most memory Allocating memory takes about 3 GiB / second (OS-dependent)
Node Registration Protocol	✅	✅	✅	✅
Be included in Heartbeat Protocol Takes 1-3 seconds, nodes join one at a time
Write System Files	✅			✅
Initialise the REDO logs, UNDO logs and the system tablespace Entire files are written at startup (they can be large, so can take time) Sometimes we already initialise pages in the files Can configure to avoid writing files directly, but then file space is not owned (risky) Usually takes less than a minute in total
Synchronisation Barrier		✅
Starting nodes wait for other starting nodes to reach this point We sync here because the next few steps are run in serial (only one node at a time) There is a timeout here; we can start a partial cluster (at least one node group with all nodes alive)
Cluster Inclusion	✅	✅	✅	✅
The starting node is included on a DBDIH-level
Copy MySQL Metadata		✅	✅	✅
This is the schema.log (every node persists its own) This is always run if another node has a newer schema.log
Join GCP & LCP Protocol	✅	✅	✅	✅
After this, the next node can join the DBDIH logic
Restore Local Checkpoint		✅	✅
Performance: Run in parallel on all LDM and Query threads; one thread per CPU, using all CPUs Threads can load 1m records / second per thread (for records of a few hundred bytes in size) 200 GiB database should be loaded in 3 minutes Optimisations: High performance disks Many CPUs
Apply UNDO Log		✅	✅
General: Amount is proportionate to changes to disk data since last local checkpoint Normally takes significantly less than a minute Internals: Each log is read into rep thread and then routed to a LDM thread The PGMAN block is used to identify fragment id and thus the correct LDM thread LDM threads execute the UNDO logs serially in backwards order Optimisations: High performance disks for UNDO log & particularly tablespace data files
Apply REDO Log		✅	✅
General: Each commit record in the REDO log is tagged with the GCI of its transaction The REDO log is recovered up to the GCI found in the system file P0.sysfile The REDO log amount is dependent on the amount of writes between the last local checkpoint and the node crash Performance: Same code path as normal transactions (without distributed commit) As fast as single-replica transactions Reading an INSERT is as fast as reading a local checkpoint (similar code path) Generally slower than reading a local checkpoint because every UPDATE is a separate read Optimisations: Reduce REDO logs by increasing the frequency of local checkpoints (adaptive algorithm currently)
Copy Fragment Phase (Part I)				✅
General: Sync data with live nodes, copying row (id) by row (id) No row locks are taken on live nodes Starting node does not join live transactions after row is copied Result is a fuzzy checkpoint; the rows are not transaction consistent After this, state is similar to a node that runs a normal node restart Set `TwoPassInitialNodeRestartCopy=0` to skip this step Advantage of step: Allows parallel ordered index build (fast) Disadvantage of step: Live node has to scan data twice (heavier load) If skipped, all indexes are built during copy in Phase II UNDO logs are written for all changes to disk data Overloaded UNDO log may trigger non-distributed local checkpoints on starting node More documentation in source file `storage/ndb/src/kernel/blocks/ndbcntr/NdbcntrMain.cpp` Performance: Runs in parallel between LDM threads (each LDM thread owns a set of partitions) Number of rows synchronized in parallel ≤ hardcoded constant Optimisations: Use more in-memory fixed-size columns More CPUs (more LDM threads) Less system load
Rebuild Ordered Indexes		✅	✅	✅
General: Unique indexes are not rebuilt here, since they are implemented as normal tables Performance: Entirely CPU bound; tries to use all CPUs One table at a time Default max parallelisation is number of partitions in a table Optimisations: Many CPUs Use `BuildIndexThreads` to increase parallelization (default: 0, max: 128)
Copy Fragment Phase (Part II)		✅	✅	✅
Same as Part I, but starting node directly joins all live transactions (without taking coordinator role) If rows are not yet in-sync, starting node fakes a successful transaction and discards the changes Runs in cluster restart if nodes were not stopped at the same time Each row & page has a GCI that specifies the last GCI it was written Rows & page with equivalent GCIs are skipped Since row ids != primary keys, deleted records will be detected by empty rows with newer GCIs If a row is re-used for a new record, we see a primary key change and also copy over the DELETE Before copying, a shared row lock is taken on the live node and an exclusive row lock on the starting node Locks are required to avoid race conditions with concurrent updates; updates might be discarded on starting node Lock on live node can be released after read due to signal ordering REDO logs are activated on starting node once all fragments are synchronised Optimisations: Same as Part I
Wait for Local Checkpoint	✅	✅	✅	✅
Before this, all data is up-to-date, but we would not be able to recover from anything Participate in distributed local checkpoint; sped up by previous non-distributed local checkpoints

Notes on Performance#

A detailed YCSB benchmark of RonDB’s recovery performance can be found in a blog post.

In order to achieve better performance during recovery, we generally recommend the following:

Machines with many CPUs
Use local NVMe drives / high-bandwidth block storage

The advantages and disadvantages for local NVMe are universal and also apply to RonDB:

+ Very high bandwidth
+ Normal node restarts are very fast
- Using new machines will require initial node restart (load on system)