Skip to content

Startup Phases#

List of Phases#

For a cluster (re-)start, the management server is started first, then the data nodes.

For all kinds of (re-)starts, data nodes are started like this:

ndbmtd --ndb-connectstring=<management server address> \
    --ndb-nodeid=<node-id> \
    --initial  # In case of an initial start

Then the data node runs through the following phases:

Startup Step Cluster Start Cluster Restart Node Restart Initial Node Restart
Complete Node Failure Protocol
  • A number of protocols need to be completed before the node failure handling is complete

  • All ongoing transactions the failed node was responsible for must be completed

Receive Configuration from Management server
  • Acquire node Id

  • How many threads of various types to use

  • How much memory of various types to allocate

Start Virtual Machine
  • Set up RonDB’s thread architecture

  • Set up thread priorities and CPU locking

  • Set up communication between threads

  • Allocate most memory

  • Allocating memory takes about 3 GiB / second (OS-dependent)

Node Registration Protocol
  • Be included in Heartbeat Protocol

  • Takes 1-3 seconds, nodes join one at a time

Write System Files
  • Initialise the REDO logs, UNDO logs and the system tablespace

  • Entire files are written at startup (they can be large, so can take time)

  • Sometimes we already initialise pages in the files

  • Can configure to avoid writing files directly, but then file space is not owned (risky)

  • Usually takes less than a minute in total

Synchronisation Barrier
  • Starting nodes wait for other starting nodes to reach this point

  • We sync here because the next few steps are run in serial (only one node at a time)

  • There is a timeout here; we can start a partial cluster (at least one node group with all nodes alive)

Cluster Inclusion
  • The starting node is included on a DBDIH-level

Copy MySQL Metadata
  • This is the schema.log (every node persists its own)

  • This is always run if another node has a newer schema.log

Join GCP & LCP Protocol
  • After this, the next node can join the DBDIH logic

Restore Local Checkpoint

Performance:

  • Run in parallel on all LDM and Query threads; one thread per CPU, using all CPUs

  • Threads can load 1m records / second per thread (for records of a few hundred bytes in size)

  • 200 GiB database should be loaded in 3 minutes

Optimisations:

  • High performance disks

  • Many CPUs

Apply UNDO Log

General:

  • Amount is proportionate to changes to disk data since last local checkpoint

  • Normally takes significantly less than a minute

Internals:

  • Each log is read into rep thread and then routed to a LDM thread

  • The PGMAN block is used to identify fragment id and thus the correct LDM thread

  • LDM threads execute the UNDO logs serially in backwards order

Optimisations:

  • High performance disks for UNDO log & particularly tablespace data files

Apply REDO Log

General:

  • Each commit record in the REDO log is tagged with the GCI of its transaction

  • The REDO log is recovered up to the GCI found in the system file P0.sysfile

  • The REDO log amount is dependent on the amount of writes between the last local checkpoint and the node crash

Performance:

  • Same code path as normal transactions (without distributed commit)

  • As fast as single-replica transactions

  • Reading an INSERT is as fast as reading a local checkpoint (similar code path)

  • Generally slower than reading a local checkpoint because every UPDATE is a separate read

Optimisations:

  • Reduce REDO logs by increasing the frequency of local checkpoints (adaptive algorithm currently)

Copy Fragment Phase (Part I)

General:

  • Sync data with live nodes, copying row (id) by row (id)

  • No row locks are taken on live nodes

  • Starting node does not join live transactions after row is copied

  • Result is a fuzzy checkpoint; the rows are not transaction consistent

  • After this, state is similar to a node that runs a normal node restart

  • Set TwoPassInitialNodeRestartCopy=0 to skip this step

    • Advantage of step: Allows parallel ordered index build (fast)

    • Disadvantage of step: Live node has to scan data twice (heavier load)

    • If skipped, all indexes are built during copy in Phase II

  • UNDO logs are written for all changes to disk data

  • Overloaded UNDO log may trigger non-distributed local checkpoints on starting node

  • More documentation in source file storage/ndb/src/kernel/blocks/ndbcntr/NdbcntrMain.cpp

Performance:

  • Runs in parallel between LDM threads (each LDM thread owns a set of partitions)

  • Number of rows synchronized in parallel ≤ hardcoded constant

Optimisations:

  • Use more in-memory fixed-size columns

  • More CPUs (more LDM threads)

  • Less system load

Rebuild Ordered Indexes

General:

  • Unique indexes are not rebuilt here, since they are implemented as normal tables

Performance:

  • Entirely CPU bound; tries to use all CPUs

  • One table at a time

  • Default max parallelisation is number of partitions in a table

Optimisations:

  • Many CPUs

  • Use BuildIndexThreads to increase parallelization (default: 0, max: 128)

Copy Fragment Phase (Part II)
  • Same as Part I, but starting node directly joins all live transactions (without taking coordinator role)

  • If rows are not yet in-sync, starting node fakes a successful transaction and discards the changes

  • Runs in cluster restart if nodes were not stopped at the same time

  • Each row & page has a GCI that specifies the last GCI it was written

  • Rows & page with equivalent GCIs are skipped

  • Since row ids != primary keys, deleted records will be detected by empty rows with newer GCIs

  • If a row is re-used for a new record, we see a primary key change and also copy over the DELETE

  • Before copying, a shared row lock is taken on the live node and an exclusive row lock on the starting node

  • Locks are required to avoid race conditions with concurrent updates; updates might be discarded on starting node

  • Lock on live node can be released after read due to signal ordering

  • REDO logs are activated on starting node once all fragments are synchronised

Optimisations: Same as Part I

Wait for Local Checkpoint
  • Before this, all data is up-to-date, but we would not be able to recover from anything

  • Participate in distributed local checkpoint; sped up by previous non-distributed local checkpoints

Notes on Performance#

A detailed YCSB benchmark of RonDB’s recovery performance can be found in a blog post.

In order to achieve better performance during recovery, we generally recommend the following:

  • Machines with many CPUs

  • Use local NVMe drives / high-bandwidth block storage

The advantages and disadvantages for local NVMe are universal and also apply to RonDB:

  • + Very high bandwidth

  • + Normal node restarts are very fast

  • - Using new machines will require initial node restart (load on system)