Startup Phases
List of Phases
For a cluster (re-)start, the management server is started first, then
the data nodes.
For all kinds of (re-)starts, data nodes are started like this:
ndbmtd --ndb-connectstring=<management server address> \
--ndb-nodeid=<node-id> \
--initial # In case of an initial start
Then the data node runs through the following phases:
Complete Node Failure Protocol |
|
|
✅ |
✅ |
|
Receive Configuration from Management
server |
✅ |
✅ |
✅ |
✅ |
|
Start Virtual Machine |
✅ |
✅ |
✅ |
✅ |
Set up RonDB’s thread architecture
Set up thread priorities and CPU locking
Set up communication between threads
Allocate most memory
Allocating memory takes about 3 GiB / second
(OS-dependent)
|
Node Registration Protocol |
✅ |
✅ |
✅ |
✅ |
|
Write System Files |
✅ |
|
|
✅ |
Initialise the REDO logs, UNDO logs and the system
tablespace
Entire files are written at startup (they can be large, so can
take time)
Sometimes we already initialise pages in the files
Can configure to avoid writing files directly, but then file
space is not owned (risky)
Usually takes less than a minute in total
|
Synchronisation Barrier |
|
✅ |
|
|
Starting nodes wait for other starting nodes to reach this
point
We sync here because the next few steps are run in serial (only
one node at a time)
There is a timeout here; we can start a partial cluster (at least
one node group with all nodes alive)
|
Cluster Inclusion |
✅ |
✅ |
✅ |
✅ |
|
Copy MySQL Metadata |
|
✅ |
✅ |
✅ |
|
Join GCP & LCP Protocol |
✅ |
✅ |
✅ |
✅ |
|
Restore Local Checkpoint |
|
✅ |
✅ |
|
Performance:
Run in parallel on all LDM and Query threads; one thread per CPU,
using all CPUs
Threads can load 1m records / second per thread (for records of a
few hundred bytes in size)
200 GiB database should be loaded in 3 minutes
Optimisations:
High performance disks
Many CPUs
|
Apply UNDO Log |
|
✅ |
✅ |
|
General:
Internals:
Each log is read into rep thread and then routed to a
LDM thread
The PGMAN block is used to identify fragment id and thus
the correct LDM thread
LDM threads execute the UNDO logs serially in backwards
order
Optimisations:
|
Apply REDO Log |
|
✅ |
✅ |
|
General:
Each commit record in the REDO log is tagged with the GCI of its
transaction
The REDO log is recovered up to the GCI found in the system file
P0.sysfile
The REDO log amount is dependent on the amount of writes between
the last local checkpoint and the node crash
Performance:
Same code path as normal transactions (without distributed
commit)
As fast as single-replica transactions
Reading an INSERT is as fast as reading a local checkpoint
(similar code path)
Generally slower than reading a local checkpoint because every
UPDATE is a separate read
Optimisations:
|
Copy Fragment Phase (Part I) |
|
|
|
✅ |
General:
Sync data with live nodes, copying row (id) by row (id)
No row locks are taken on live nodes
Starting node does not join live transactions after row is
copied
Result is a fuzzy checkpoint; the rows are not
transaction consistent
After this, state is similar to a node that runs a normal node
restart
Set TwoPassInitialNodeRestartCopy=0 to skip this
step
Advantage of step: Allows parallel ordered index build
(fast)
Disadvantage of step: Live node has to scan data twice (heavier
load)
If skipped, all indexes are built during copy in Phase
II
UNDO logs are written for all changes to disk data
Overloaded UNDO log may trigger non-distributed local checkpoints
on starting node
More documentation in source file
storage/ndb/src/kernel/blocks/ndbcntr/NdbcntrMain.cpp
Performance:
Optimisations:
|
Rebuild Ordered Indexes |
|
✅ |
✅ |
✅ |
General:
Performance:
Optimisations:
|
Copy Fragment Phase (Part II) |
|
✅ |
✅ |
✅ |
Same as Part I, but starting node directly joins all live
transactions (without taking coordinator role)
If rows are not yet in-sync, starting node fakes a successful
transaction and discards the changes
Runs in cluster restart if nodes were not stopped at the same
time
Each row & page has a GCI that specifies the last GCI it was
written
Rows & page with equivalent GCIs are skipped
Since row ids != primary keys, deleted records will be detected
by empty rows with newer GCIs
If a row is re-used for a new record, we see a primary key change
and also copy over the DELETE
Before copying, a shared row lock is taken on the live node and
an exclusive row lock on the starting node
Locks are required to avoid race conditions with concurrent
updates; updates might be discarded on starting node
Lock on live node can be released after read due to signal
ordering
REDO logs are activated on starting node once all fragments are
synchronised
Optimisations: Same as Part I |
Wait for Local Checkpoint |
✅ |
✅ |
✅ |
✅ |
Before this, all data is up-to-date, but we would not be able to
recover from anything
Participate in distributed local checkpoint; sped up by previous
non-distributed local checkpoints
|
A detailed YCSB benchmark of RonDB’s recovery performance can be found
in a blog
post.
In order to achieve better performance during recovery, we generally
recommend the following:
The advantages and disadvantages for local NVMe are universal and also
apply to RonDB: