Cluster log and node logs#

Printing information is something most programs do a lot to ensure that it is possible to troubleshoot problems when they occur. RonDB is no different in this respect. It is very easy to add new log messages, these log messages are not documented as such, they should be viewed as free text explanation of events in the software.

The RonDB management server(s) prints the cluster log. There are simple functions in the RonDB software to print a message in the cluster log. Look for infoEvent calls in the block code.

The RonDB data nodes prints node logs. These are a bit more detailed in some aspects.

The MySQL Server has an error log that is used to write log messages also from the NDB storage engine.

The Cluster log#

The cluster log is found in a file normally called node_65_cluster.log if the node id of the management server is 65.

It contains all sorts of messages about what is going on. To provide a flavour of what it contains we first show the printouts in the cluster log at the very first initial start.

The first segment of the log shows what node id the management server uses and what port it is using. It reports the version number and informs that we start from an initial configuration. After this the first configuration generation is created with version number 1. If the management server is restarted it will reuse the configuration in the binary configuration file created. To change the configuration we will go through in a later chapter in some detail.

We have broken the long lines to make the examples readable, some of the messages in the logs can use very long lines.

2021-03-23 15:48:38 [MgmtSrvr] INFO     -- Id: 65, Command port: *:1186
2021-03-23 15:48:38 [MgmtSrvr] INFO     -- Node 65: Node 65 Connected
2021-03-23 15:48:38 [MgmtSrvr] INFO     -- RonDB Management Server RonDB-24.10.0 started
2021-03-23 15:48:39 [MgmtSrvr] INFO     -- Node 65 connected
2021-03-23 15:48:39 [MgmtSrvr] INFO     -- Starting initial
  configuration change
2021-03-23 15:48:39 [MgmtSrvr] INFO     -- Configuration 1 commited
2021-03-23 15:48:39 [MgmtSrvr] INFO     -- Config change completed!
  New generation: 1

A few seconds later here the RonDB data nodes 1 and 2 start up. They allocate a node id before they can start up. After receiving their node id they convert the connection to the management server to an NDB connection that can be used to send signals between the management server and the data nodes.

Each start phase completion is reported into the cluster log. A message that starts with Node 1: means that it is a message generated by node 1 and sent to the cluster log for logging. When it isn’t preceded by this, it is a message coming from the management server itself.

We will skip showing the time and MgmtSrvr for the rest of the messages, they are there in each line. I will also skip INFO and -- for all lines that are INFO messages.

Nodeid 1 allocated for NDB at 127.0.0.1
Nodeid 2 allocated for NDB at 127.0.0.1
Node 65: Node 1 Connected
Node 65: Node 2 Connected
Node 2: Start phase 0 completed
Node 1: Start phase 0 completed
Node 1: Initial start, waiting for 2 to connect,
  nodes [ all: 1 and 2 connected: 1 no-wait:  ]
Node 2: Initial start, waiting for 1 to connect,
  nodes [ all: 1 and 2 connected: 2 no-wait:  ]
Node 1: Communication to Node 2 opened
Node 2: Communication to Node 1 opened
Alloc node id 67 failed, no new president yet
Node 2: Node 1 Connected
Node 1: Node 2 Connected
Node 1: Initial start with
  nodes 1 and 2 [ missing:  no-wait:  ]
Node 1: CM_REGCONF
  president = 1, own Node = 1, our dynamic id = 0/1
WARNING  -- Failed to allocate nodeid for API at
  127.0.0.1. Returned error: 'Id 67 already allocated by
  another node.'

The cluster reports what type of start each node is performing, it reports what nodes have already connected and which nodes are required for the restart to proceed. Thus first both node 1 and node 2 reports that they want to perform an initial start, but they are waiting for both nodes to connect. Once they connect to each other node 1 (lowest node wins when no president has started yet) they select a president and continue immediately with the next phase in the startup.

Node 67 is a MySQL Server attempting to connect to the cluster. For an API node to be able to allocate a node id, the cluster must be up and running. Therefore the allocation will fail. The MySQL Server will continue to regularly try, the log will receive a fair amount of messages of failed allocations of node id from MySQL Servers that start before the cluster have started. This is not a problem, it is simply something we log.

Node 2: CM_REGCONF
  president = 1, own Node = 2, our dynamic id = 0/2
Node 1: Node 2: API RonDB-24.10.0
Node 2: Node 1: API RonDB-24.10.0
Node 2: Start phase 1 completed
Node 1: Start phase 1 completed
Node 1: System Restart:
  master node: 1, num starting: 2, gci: 0
Node 1: CNTR_START_CONF:
  started: 0000000000000000
Node 1: CNTR_START_CONF:
  starting: 0000000000000006

CM_REGCONF is received when we completed entering the heartbeat sending. When we reach this state we have completed memory allocations and memory initialisations in the nodes.

Next a new start phase is completed and we report a System Restart (all cluster nodes starts, no cluster is up and running). Since gci is 0 here, it is a System Restart from an initial start that we call inital start.

Next the nodes connect on the next layer where we deduce that node 1 considers no previous nodes started and that node 1 and 2 are starting. The 0..06 is a hexadecimal number representing a bit map of all nodes starting. Bit 1 is 2 and Bit 2 is 4 and 2+4 is 6.

Node 1: Local redo log file initialization status:
#Total files: 16, Completed: 0
#Total MBytes: 192, Completed: 0

Next we receive a report on each log part that is initialised in each node. This initialisation can take some time.

Node 2: Start phase 4 completed (initial start)
Start phase 4 completed (initial start)

We continue receiving reports on completed start phases from all starting nodes.

Node 1: Make On-line Database recoverable
  by waiting for LCP Starting, LCP id = 2
Node 1: Local checkpoint 2 started.
  Keep GCI = 1 oldest restorable GCI = 1
Node 1: LDM(1): Completed LCP, #frags = 4
  #records = 1037, #bytes = 33264
...
Node 1: Local checkpoint 2 completed
Node 1: Make On-line Database recoverable
  by waiting for LCP Completed, LCP id = 2
...
Node 1: President restarts arbitration thread [state=1]
...
Node 1: Start phase 101 completed (initial start)
Node 1: Started (RonDB-24.10.0)
Node 2: Started (RonDB-24.10.0)
Node 2: Prepare arbitrator node 3 [ticket=1cc90001aec87e2a]
Node 1: Started arbitrator node 3 [ticket=1cc90001aec87e2a]
Node 2: Communication to Node 67 opened

After completing the recovery, we need to make the database recoverable by performing a local checkpoint. Execution of location checkpoints is reported in the cluster log regularly. The log will be filled with these in normal operation.

The cluster log will switch to a new file when the file is full (1 MByte in size). It is configurable how many cluster log files that you want to save before they are overwritten. If you want to save the cluster log files for longer time it is necessary to copy them to another place.

For each local checkpoint we return number fragments checkpointed, number of records written as part of this checkpoint and number of bytes written in this checkpoint.

Phase 101 is the last restart phase and at the end of the restart we start up the arbitrator node to ensure that we can survive at least one node crash.

When the node has started a lot of messages about communication between nodes being opened, node id allocated and connected are logged.

Alloc node id 67 succeeded
Nodeid 67 allocated for API at 127.0.0.1
Node 67: mysqld --server-id=1
Node 2: Node 67 Connected
Node 1: Node 67: API RonDB-24.10.0

For MySQL Servers that starts up we can see the server id of the MySQL Server.

When a table is created we receive reports about phases it passes through in creating the table.

When the cluster have entered a started state and tables have been created, the logging quiets down. Now it is mostly regular checkpoints that gets logged.

This stays until we decide to restart a node or some other event happens.

The majority of the events logged in the cluster log are related to start of the cluster, failed nodes or nodes that disconnect, restart of a node, local checkpoints.

Most of the reporting is intended to be readable messages that can be understood with some background in RonDB.

It is possible to get logs about even more events by changing the cluster log levels as discussed in the previous chapter.

Naturally the cluster log also have information about various error situations.

Personally I find the node logs more interesting to look at to analyse problems since it often contains a lot more details about what happened. But the cluster log is a good place to start your investigation of what is going in the cluster since it has a helicopter view of what goes on.

Management server node log#

The management server also reports a bit of information in its node log. Most of this information is also recorded in the cluster log, there is rarely much of interest in this log except when there is a problem in starting up the management server.

Data node logs#

If for some reason there is an issue with starting a node, stopping a node, a node crash or some error situation, it is usually in the data node logs that most information can be retrieved about the problem.

The data node logs are mostly focused on restarts, node failures, error situations. In particular one can often see problems in the node log with regard to watch dog issues. These occur often when the OS misbehaves and doesn’t schedule the process or one of the threads in the process. It is not uncommon for this to happen for seconds at a time in certain situations. Since RonDB is a real-time system, it will quickly detect those problems since each node have to send heartbeats regularly to other nodes. RonDB will not work very well if the OS runs like syrup and constantly fails to make to progress.

The node logs are constantly written to, it is important to regularly copy them away and to empty the file. Otherwise there is a risk that the file system runs out of space for logging. A popular Linux tool to handle this is logrotate. This can be used together with cron to ensure that the log file is constantly rotated and that the log is backed up if required.

We will now show examples of output in the node log, there is some information in their that is useful for various purposes. In all those examples we will skip any timestamps and INFO strings. Some of the messages have this header of time and type of message, some messages have no such header info.

Angel connected to 'localhost:1186'
Angel allocated nodeid: 1
Angel pid: 71526 started child: 71530
Normal start of data node using checkpoint and log info if existing
Configuration fetched from 'localhost:1186', generation: 1
Changing directory to '/path/datadir/ndbd.1'
ThreadConfig: input: main={count=1},tc={count=3},ldm={count=4},
  io={count=1},rep={count=1},recv={count=2},send={count=2}
  LockExecuteThreadToCPU:  => parsed: main,ldm,ldm,ldm,ldm,
  recv,recv,rep,tc,tc,tc,send,send

A data node starts up first with the angel process, this process connects to the management server and allocates a node id. Next the angel forks a data node process that runs the actual data node. The angel stays around waiting to start up failed data node.

As seen above the node log contains the host and port of the management server, the node id allocated and the PID of both the angel process and the data node process.

Next the data node process changes current directory to the data directory of the data node. We always get a report of the threads that the data node will start up. In this case there is four ldm threads, three tc threads, 2 send threads, 2 recv threads, one main thread, one rep thread.

....
A monotonic timer was not available on this platform.
Memory Allocation for global memory pools Starting
Adjusting system time manually, or otherwise (e.g. NTP), may
  cause false watchdog alarms, temporary freeze, or node shutdown.
numa_set_interleave_mask(numa_all_nodes) : no numa support
Ndbd_mem_manager::init(1) min: 248Mb initial: 268Mb
Touch Memory Starting, 6220 pages, page size = 32768
Touch Memory Completed
Adding 195Mb to ZONE_19 (1, 6219)
Memory Allocation for global memory pools Completed

Next we receive some redundant information skipped here, we report the version number. In this case I start on Mac OS X where no monotonic timer is available so I get an apprioate information what to consider to avoid problems in this case.

The global memory manager allocates most of the memory used in RonDB. After allocation the memory is touched to ensure that the memory pages are allocated. This ensures that we don’t run out memory space in normal operations. Touching the memory can take about 1 second per GByte, thus it can be a fairly lengthy process.

Started thread, index = 0, id = -1, type = NdbfsThread
Loading blocks for data node run-time environment
Starting Sending and Receiving services
Started thread, index = 1, id = -1, type = SocketClientThread
Started thread, index = 2, id = -1, type = WatchDogThread
Started thread, index = 3, id = -1, type = SocketServerThread

Every thread started in the startup has some sort of printout attached to it. Above we see the first instance of the NDB file system threads, these handle the interaction with the file system and there can be quite a few such threads although they are not using the CPU so much. These threads is there to ensure that also the file system can be accessed using asynchronous signals.

We start a watch dog thread, a thread that connects as a client to other nodes through a socket and thread handling our socket server that waits for other nodes to connect to us.

Starting the data node run-time environment
thr: 0 (main) DBTC(0) DBDIH(0) DBDICT(0) NDBCNTR(0) QMGR(0)
  NDBFS(0) CMVMI(0) TRIX(0) DBUTIL(0) DBSPJ(0) THRMAN(0)
  TRPMAN(0) THRMAN(1) realtime=0, spintime=0,
  max_signals_before_send=300, max_signals_before_send_flush=110
First START_ORD executed to connect MGM servers
Received second START_ORD as part of normal start
Disconnect all non-MGM servers
Start excuting the start phases
Sending READ_CONFIG_REQ to index = 0, name = CMVMI
Start initiated (RonDB-24.10.0)

For the block threads we receive information about which block objects it handles, the thread parameters on realtime setting, the signal scheduler settings (configured through the SchedulerResponsiveness parameter, higher setting will increase responsiveness and thus decrease the maximum number of signals executed before flush and before send. Above printout comes from Mac OS X where thread ids are not available. On Linux a thread id is also reported for the thread that can be used e.g. in perf scripts that we will discuss later in this book.

Ndb kernel thread 2 is stuck in: Unknown place 0 elapsed=11
Watchdog: User time: 3  System time: 16
thr: 12 (send)
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 18, id = -1, type = NdbfsThread

Above we have a typical warning message that can happen easily in an environment that cannot handle the real-time requirements. The watchdog here reports that one of the thread with id 2 is stuck in an unknown place and the time that passed by since last it was seen. The watchdog will report threads that don’t show up within 100 milliseconds. It will not crash the node until a much longer time have passed by, log messages of this kind can show up every now and then.

A message about kernel threads being stuck is an indication of CPU overload and should be avoided if real-time responsiveness is desirable.

Very early in our development we had such a case where the user had a crash due to the watchdog firing at 2am every night. After some investigation it turned out that at time the machine backup was started and it consumed almost all the memory resources in the machine and made the data node go into swapping mode. This is one reason why we have the configuration parameter LockPagesInMainMemory that ensures that all our memory will stay in memory and never be swapped out even in such cases where heavy programs are started at the same time. In the same fashion modern VM technology can be used to protect the data node programs from other programs getting access to its dedicated CPU resources. Docker have some protection against this as well, but VM protection is safer and can protect entire CPUs. Docker can protect that we get a certain share of the CPUs.

READ_CONFIG_REQ phase completed, this phase is used to read
configuration and to calculate various sizes and allocate
almost all memory needed by the data node in its lifetime
Not initial start
Start phase 0 completed
Phase 0 has made some file system initialisations
Starting QMGR phase 1
DIH reported initial start, now starting the Node Inclusion Protocol

Phase 0 of the restart sends a signal READ_CONFIG_REQ to all blocks. This signals gives the chance for early initialisation of objects, it allocates any objects not allocated from the global memory space and it calculates sizes of various pools based on configuration and thus various other calculations based on the configuration.

In phase 1 QMGR will insert us into the heartbeat protocol.

Include node protocol completed, phase 1 in QMGR completed
Start phase 1 completed
Phase 1 initialised some variables and included node in cluster,
locked memory if configured to do so
Asking master node to accept our start (we are master, GCI = 0)
We are performing an initial start of the cluster
System Restart: master node: 1, num starting: 2, gci: 0
CNTR_START_CONF: started: 0000000000000000
CNTR_START_CONF: starting: 0000000000000006
NDBCNTR master accepted us into cluster, start NDB start phase 1

QMGR handles the lowest layer of the cluster node management. The next layer is handled by the NDBCNTR block. We are informed of which node became the master node, how many nodes are starting and their node ids in the form of a bitmap.

After being accepted by the master node to start we proceed on with the restart.

LDM(1): Starting REDO log initialisation
LDM(1): Completed REDO log initialisation of logPart = 0
LDM(1): Completed REDO initialisation
Schema file initialisation Starting
Schema file initialisation Completed
NDB start phase 1 completed
Start phase 2 completed
Phase 2 did more initialisations, master accepted
  our start, we initialised the REDO log

This is an initial start, so now it is time to initialise the REDO log files to ensure that we have access to the full space of the REDO log files. This can take considerable time in the first initial start, it is only done in an initial start and initial node restart.

Grant nodes to start phase: 3, nodes: 0000000000000006
Start NDB start phase 2
NDB start phase 2 completed
Start phase 3 completed
Phase 3 performed local connection setups
Grant nodes to start phase: 4, nodes: 0000000000000006
Start NDB start phase 3
LDM(1): Completed old Redo head invalidation on log part 0
LDM(1): All redo actions complete (apply, invalidate)
Start invalidating: Part 0, Head: file: 0, page: 1,
  Invalidation start:
  file: 0, page: 1, actual start invalidate: file: 0 page: 1

As part of the restart we have to apply the REDO log and before that we might have to invalidate a part of the REDO log that was written, but didn’t survive the crash. We get reports about where this happens.

LDM(1): Completed old Redo head invalidation on log part 0
LDM(1): All redo actions complete (apply, invalidate)
NDB start phase 3 completed
Starting with m_restart_seq set to 1
Start phase 4 completed
Phase 4 continued preparations of the REDO log
Grant nodes to start phase: 5, nodes: 0000000000000006
NDB start phase 4 completed
Creating System Tables Starting as part of initial start
Node 2 have reached completion of NDB start phase 4
Creation of System Tables Completed
Start NDB start phase 5 (only to DBDIH)

m_restart_seq is the number of the restart and ensures that we can separate this instance of the node from any earlier instance of this node that already failed. Other nodes might have signals received from the old instance. So we need some protection to ensure that we don’t get confused by old signals still progressing through the cluster even after a node failure.

In an initial restart we create some system table and fill it with data.

NDB start phase 5 completed
Start NDB start phase 6
NDB start phase 6 completed
Start phase 5 completed
Phase 5 Created the System Table
Phase 5 waited for local checkpoint to complete
Grant nodes to start phase: 6, nodes: 0000000000000006
Start phase 6 completed
Phase 6 updated blocks about that we've now reached
  the started state.
Grant nodes to start phase: 7, nodes: 0000000000000006
President restarts arbitration thread [state=1]
Activating bucket 0 in SUMA
Start phase 7 completed
Phase 7 mainly activated the asynchronous change events
  process, and some other background processes

Here we might get confused since start phase 5 completes after NDB start phase 6 have completed. The reason is that start phase is managed by the subblock Missra in NDBCNTR that sends phased signals to each block in the data node. This is controlled by the signals STTOR and STTORRY.

Now when NDBCNTR receives such a start phase from Missra it starts off zero, one or more NDB start phases.

The reason for separation of lower layer cluster management and database cluster management both in heartbeat setup and early phases and also later phases of the restart is historical.

In the early days before the NDB architecture had settled down the idea was that NDB would be just one subsystem running in the virtual machine. Thus the STTOR/STTORRY handles the start of the virtual machine whereas NDB_STTOR and NDB_STTORRY handles the start of the database machine.

Now this idea has been completely abandoned for almost 20 years, the distinction today between the layers is not as clear anymore. But the design works fine, thus there has been no specific need to change this part, also it would be quite hard to change it given the implications it would have on upgrade of the cluster software due to the impact on the NDB protocols it would have.

In the late restart phases we startup the event API handling and the arbitration thread ensuring that we can survive split brain scenarios.

Grant nodes to start phase: 8, nodes: 0000000000000006
Start NDB start phase 7
Foreign Key enabling Starting
Foreign key enabling Completed
NDB start phase 7 completed
Start phase 8 completed
Phase 8 enabled foreign keys and waited forall
  nodes to complete start up to this point
Grant nodes to start phase: 9, nodes: 0000000000000006
Start phase 9 completed
Phase 9 enabled APIs to start connecting
Grant nodes to start phase: 10, nodes: 0000000000000006
Start phase 49 completed
Grant nodes to start phase: 50, nodes: 0000000000000006
Restart complete, updated local sysfile
Start phase 50 completed
Grant nodes to start phase: 51, nodes: 0000000000000006
Start phase 101 completed
Phase 101 was used by SUMA to take over responsibility
  for sending some of the asynchronous change events
Grant nodes to start phase: 102, nodes: 0000000000000006
Node started
Started arbitrator node 65 [ticket=176a00019a4cb559]
Allocate event buffering page chunk in SUMA, 16 pages,..

Here is the final phase of the initial start where foreign key handling is started up and a couple of more phases to synchronise special parts of the data nodes. After starting the arbitrator on node 65 (the NDB management server) the restart is done.

Finding node log messages in code#

If you are interested in understanding a bit more about the NDB code you can do this by adding more messages to the node log. Most normal information messages are added through a call such as shown below:

  g_eventLogger->info("Insert normal printf logic here");

By doing a grep as below you can quickly find all the examples of node log messages in our code base.

cd $MYSQL_SOURCE_DIR
cd storage/ndb/src
grep -r 'g_eventLogger->info' *

Replacing info by debug shows all debug messages, warning shows all warning messages and critical for critical messages.

Writing to node logs#

Every write to the node log presents a potential problem. It can fail due to various problems with the disk. To avoid crashing the data nodes in those situations we implement the actual log writing in a special log writing thread. If the log writes fails after a number of retries, the log message is thrown away and the data node can continue to operate. Similarly if the log writing buffer is full in this thread.

Summary of logs#

The data node logs and cluster log contain lots of information. The above was presented mostly as an example. It presents information about restarts, node failures. It is heavily used by developers printing debug information during analysis of bugs and new feature development.

A major effort has been spent on ensuring that supportability of the product was increased by improving the understandability of the restart messages. Many more new messages was printed to ensure that it is possible to follow the restart phases in some detail. A new ndbinfo table was developed that keep track of node restarts in a live system and one can see progress of restarts.

There are many different DUMP commands that can be used to print detailed information on various topics to analyse error situations. These are mostly intended to aid MySQL support to assist users with various sorts of problems.

The good news is that RonDB is open source, the code is available and so anyone can learn those detailed logging messages as well and become an expert in supporting a RonDB installation.

The cluster log messages are documented and all of them are described in the MySQL manual. This is not the case for messages in the node logs. These are not documented, rather the aim of those is to be as self-describing as possible and the amount of messages in the node logs can at times be a bit higher.