Cluster log and node logs#
Printing information is something most programs do a lot to ensure that it is possible to troubleshoot problems when they occur. RonDB is no different in this respect. It is very easy to add new log messages, these log messages are not documented as such, they should be viewed as free text explanation of events in the software.
The RonDB management server(s) prints the cluster log. There are simple functions in the RonDB software to print a message in the cluster log. Look for infoEvent calls in the block code.
The RonDB data nodes prints node logs. These are a bit more detailed in some aspects.
The MySQL Server has an error log that is used to write log messages also from the NDB storage engine.
The Cluster log#
The cluster log is found in a file normally called node_65_cluster.log if the node id of the management server is 65.
It contains all sorts of messages about what is going on. To provide a flavour of what it contains we first show the printouts in the cluster log at the very first initial start.
The first segment of the log shows what node id the management server uses and what port it is using. It reports the version number and informs that we start from an initial configuration. After this the first configuration generation is created with version number 1. If the management server is restarted it will reuse the configuration in the binary configuration file created. To change the configuration we will go through in a later chapter in some detail.
We have broken the long lines to make the examples readable, some of the messages in the logs can use very long lines.
::: tcolorbox 2021-03-23 15:48:38 [MgmtSrvr] INFO -- Id: 65, Command port: *:1186 2021-03-23 15:48:38 [MgmtSrvr] INFO -- Node 65: Node 65 Connected 2021-03-23 15:48:38 [MgmtSrvr] INFO -- RonDB Management Server RonDB-21.04.1 started 2021-03-23 15:48:39 [MgmtSrvr] INFO -- Node 65 connected 2021-03-23 15:48:39 [MgmtSrvr] INFO -- Starting initial configuration change 2021-03-23 15:48:39 [MgmtSrvr] INFO -- Configuration 1 commited 2021-03-23 15:48:39 [MgmtSrvr] INFO -- Config change completed! New generation: 1 :::
A few seconds later here the RonDB data nodes 1 and 2 start up. They allocate a node id before they can start up. After receiving their node id they convert the connection to the management server to an NDB connection that can be used to send signals between the management server and the data nodes.
Each start phase completion is reported into the cluster log. A message that starts with Node 1: means that it is a message generated by node 1 and sent to the cluster log for logging. When it isn't preceded by this, it is a message coming from the management server itself.
We will skip showing the time and MgmtSrvr for the rest of the messages, they are there in each line. I will also skip INFO and -- for all lines that are INFO messages.
::: tcolorbox Nodeid 1 allocated for NDB at 127.0.0.1 Nodeid 2 allocated for NDB at 127.0.0.1 Node 65: Node 1 Connected Node 65: Node 2 Connected Node 2: Start phase 0 completed Node 1: Start phase 0 completed Node 1: Initial start, waiting for 2 to connect, nodes [ all: 1 and 2 connected: 1 no-wait: ] Node 2: Initial start, waiting for 1 to connect, nodes [ all: 1 and 2 connected: 2 no-wait: ] Node 1: Communication to Node 2 opened Node 2: Communication to Node 1 opened Alloc node id 67 failed, no new president yet Node 2: Node 1 Connected Node 1: Node 2 Connected Node 1: Initial start with nodes 1 and 2 [ missing: no-wait: ] Node 1: CM_REGCONF president = 1, own Node = 1, our dynamic id = 0/1 WARNING -- Failed to allocate nodeid for API at 127.0.0.1. Returned error: 'Id 67 already allocated by another node.' :::
The cluster reports what type of start each node is performing, it reports what nodes have already connected and which nodes are required for the restart to proceed. Thus first both node 1 and node 2 reports that they want to perform an initial start, but they are waiting for both nodes to connect. Once they connect to each other node 1 (lowest node wins when no president has started yet) they select a president and continue immediately with the next phase in the startup.
Node 67 is a MySQL Server attempting to connect to the cluster. For an API node to be able to allocate a node id, the cluster must be up and running. Therefore the allocation will fail. The MySQL Server will continue to regularly try, the log will receive a fair amount of messages of failed allocations of node id from MySQL Servers that start before the cluster have started. This is not a problem, it is simply something we log.
::: tcolorbox Node 2: CM_REGCONF president = 1, own Node = 2, our dynamic id = 0/2 Node 1: Node 2: API RonDB-21.04.1 Node 2: Node 1: API RonDB-21.04.1 Node 2: Start phase 1 completed Node 1: Start phase 1 completed Node 1: System Restart: master node: 1, num starting: 2, gci: 0 Node 1: CNTR_START_CONF: started: 0000000000000000 Node 1: CNTR_START_CONF: starting: 0000000000000006 :::
CM_REGCONF is received when we completed entering the heartbeat sending. When we reach this state we have completed memory allocations and memory initialisations in the nodes.
Next a new start phase is completed and we report a System Restart (all cluster nodes starts, no cluster is up and running). Since gci is 0 here, it is a System Restart from an initial start that we call inital start.
Next the nodes connect on the next layer where we deduce that node 1 considers no previous nodes started and that node 1 and 2 are starting. The 0..06 is a hexadecimal number representing a bit map of all nodes starting. Bit 1 is 2 and Bit 2 is 4 and 2+4 is 6.
::: tcolorbox Node 1: Local redo log file initialization status: #Total files: 16, Completed: 0 #Total MBytes: 192, Completed: 0 :::
Next we receive a report on each log part that is initialised in each node. This initialisation can take some time.
::: tcolorbox Node 2: Start phase 4 completed (initial start) Start phase 4 completed (initial start) :::
We continue receiving reports on completed start phases from all starting nodes.
::: tcolorbox Node 1: Make On-line Database recoverable by waiting for LCP Starting, LCP id = 2 Node 1: Local checkpoint 2 started. Keep GCI = 1 oldest restorable GCI = 1 Node 1: LDM(1): Completed LCP, #frags = 4 #records = 1037, #bytes = 33264 ... Node 1: Local checkpoint 2 completed Node 1: Make On-line Database recoverable by waiting for LCP Completed, LCP id = 2 ... Node 1: President restarts arbitration thread [state=1] ... Node 1: Start phase 101 completed (initial start) Node 1: Started (RonDB-21.04.1) Node 2: Started (RonDB-21.04.1) Node 2: Prepare arbitrator node 3 [ticket=1cc90001aec87e2a] Node 1: Started arbitrator node 3 [ticket=1cc90001aec87e2a] Node 2: Communication to Node 67 opened :::
After completing the recovery, we need to make the database recoverable by performing a local checkpoint. Execution of location checkpoints is reported in the cluster log regularly. The log will be filled with these in normal operation.
The cluster log will switch to a new file when the file is full (1 MByte in size). It is configurable how many cluster log files that you want to save before they are overwritten. If you want to save the cluster log files for longer time it is necessary to copy them to another place.
For each local checkpoint we return number fragments checkpointed, number of records written as part of this checkpoint and number of bytes written in this checkpoint.
Phase 101 is the last restart phase and at the end of the restart we start up the arbitrator node to ensure that we can survive at least one node crash.
When the node has started a lot of messages about communication between nodes being opened, node id allocated and connected are logged.
::: tcolorbox Alloc node id 67 succeeded Nodeid 67 allocated for API at 127.0.0.1 Node 67: mysqld --server-id=1 Node 2: Node 67 Connected Node 1: Node 67: API RonDB-21.04.1 :::
For MySQL Servers that starts up we can see the server id of the MySQL Server.
When a table is created we receive reports about phases it passes through in creating the table.
When the cluster have entered a started state and tables have been created, the logging quiets down. Now it is mostly regular checkpoints that gets logged.
This stays until we decide to restart a node or some other event happens.
The majority of the events logged in the cluster log are related to start of the cluster, failed nodes or nodes that disconnect, restart of a node, local checkpoints.
Most of the reporting is intended to be readable messages that can be understood with some background in RonDB.
It is possible to get logs about even more events by changing the cluster log levels as discussed in the previous chapter.
Naturally the cluster log also have information about various error situations.
Personally I find the node logs more interesting to look at to analyse problems since it often contains a lot more details about what happened. But the cluster log is a good place to start your investigation of what is going in the cluster since it has a helicopter view of what goes on.
Management server node log#
The management server also reports a bit of information in its node log. Most of this information is also recorded in the cluster log, there is rarely much of interest in this log except when there is a problem in starting up the management server.
Data node logs#
If for some reason there is an issue with starting a node, stopping a node, a node crash or some error situation, it is usually in the data node logs that most information can be retrieved about the problem.
The data node logs are mostly focused on restarts, node failures, error situations. In particular one can often see problems in the node log with regard to watch dog issues. These occur often when the OS misbehaves and doesn't schedule the process or one of the threads in the process. It is not uncommon for this to happen for seconds at a time in certain situations. Since RonDB is a real-time system, it will quickly detect those problems since each node have to send heartbeats regularly to other nodes. RonDB will not work very well if the OS runs like syrup and constantly fails to make to progress.
The node logs are constantly written to, it is important to regularly copy them away and to empty the file. Otherwise there is a risk that the file system runs out of space for logging. A popular Linux tool to handle this is logrotate. This can be used together with cron to ensure that the log file is constantly rotated and that the log is backed up if required.
We will now show examples of output in the node log, there is some information in their that is useful for various purposes. In all those examples we will skip any timestamps and INFO strings. Some of the messages have this header of time and type of message, some messages have no such header info.
::: tcolorbox Angel connected to 'localhost:1186' Angel allocated nodeid: 1 Angel pid: 71526 started child: 71530 Normal start of data node using checkpoint and log info if existing Configuration fetched from 'localhost:1186', generation: 1 Changing directory to '/path/datadir/ndbd.1' ThreadConfig: input: main={count=1},tc={count=3},ldm={count=4}, io={count=1},rep={count=1},recv={count=2},send={count=2} LockExecuteThreadToCPU: => parsed: main,ldm,ldm,ldm,ldm, recv,recv,rep,tc,tc,tc,send,send :::
A data node starts up first with the angel process, this process connects to the management server and allocates a node id. Next the angel forks a data node process that runs the actual data node. The angel stays around waiting to start up failed data node.
As seen above the node log contains the host and port of the management server, the node id allocated and the PID of both the angel process and the data node process.
Next the data node process changes current directory to the data directory of the data node. We always get a report of the threads that the data node will start up. In this case there is four ldm threads, three tc threads, 2 send threads, 2 recv threads, one main thread, one rep thread.
::: tcolorbox .... A monotonic timer was not available on this platform. Memory Allocation for global memory pools Starting Adjusting system time manually, or otherwise (e.g. NTP), may cause false watchdog alarms, temporary freeze, or node shutdown. numa_set_interleave_mask(numa_all_nodes) : no numa support Ndbd_mem_manager::init(1) min: 248Mb initial: 268Mb Touch Memory Starting, 6220 pages, page size = 32768 Touch Memory Completed Adding 195Mb to ZONE_19 (1, 6219) Memory Allocation for global memory pools Completed :::
Next we receive some redundant information skipped here, we report the version number. In this case I start on Mac OS X where no monotonic timer is available so I get an apprioate information what to consider to avoid problems in this case.
The global memory manager allocates most of the memory used in RonDB. After allocation the memory is touched to ensure that the memory pages are allocated. This ensures that we don't run out memory space in normal operations. Touching the memory can take about 1 second per GByte, thus it can be a fairly lengthy process.
::: tcolorbox Started thread, index = 0, id = -1, type = NdbfsThread Loading blocks for data node run-time environment Starting Sending and Receiving services Started thread, index = 1, id = -1, type = SocketClientThread Started thread, index = 2, id = -1, type = WatchDogThread Started thread, index = 3, id = -1, type = SocketServerThread :::
Every thread started in the startup has some sort of printout attached to it. Above we see the first instance of the NDB file system threads, these handle the interaction with the file system and there can be quite a few such threads although they are not using the CPU so much. These threads is there to ensure that also the file system can be accessed using asynchronous signals.
We start a watch dog thread, a thread that connects as a client to other nodes through a socket and thread handling our socket server that waits for other nodes to connect to us.
::: tcolorbox Starting the data node run-time environment thr: 0 (main) DBTC(0) DBDIH(0) DBDICT(0) NDBCNTR(0) QMGR(0) NDBFS(0) CMVMI(0) TRIX(0) DBUTIL(0) DBSPJ(0) THRMAN(0) TRPMAN(0) THRMAN(1) realtime=0, spintime=0, max_signals_before_send=300, max_signals_before_send_flush=110 First START_ORD executed to connect MGM servers Received second START_ORD as part of normal start Disconnect all non-MGM servers Start excuting the start phases Sending READ_CONFIG_REQ to index = 0, name = CMVMI Start initiated (RonDB-21.04.1) :::
For the block threads we receive information about which block objects it handles, the thread parameters on realtime setting, the signal scheduler settings (configured through the SchedulerResponsiveness parameter, higher setting will increase responsiveness and thus decrease the maximum number of signals executed before flush and before send. Above printout comes from Mac OS X where thread ids are not available. On Linux a thread id is also reported for the thread that can be used e.g. in perf scripts that we will discuss later in this book.
::: tcolorbox Ndb kernel thread 2 is stuck in: Unknown place 0 elapsed=11 Watchdog: User time: 3 System time: 16 thr: 12 (send) NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer Started thread, index = 18, id = -1, type = NdbfsThread :::
Above we have a typical warning message that can happen easily in an environment that cannot handle the real-time requirements. The watchdog here reports that one of the thread with id 2 is stuck in an unknown place and the time that passed by since last it was seen. The watchdog will report threads that don't show up within 100 milliseconds. It will not crash the node until a much longer time have passed by, log messages of this kind can show up every now and then.
A message about kernel threads being stuck is an indication of CPU overload and should be avoided if real-time responsiveness is desirable.
Very early in our development we had such a case where the user had a crash due to the watchdog firing at 2am every night. After some investigation it turned out that at time the machine backup was started and it consumed almost all the memory resources in the machine and made the data node go into swapping mode. This is one reason why we have the configuration parameter LockPagesInMainMemory that ensures that all our memory will stay in memory and never be swapped out even in such cases where heavy programs are started at the same time. In the same fashion modern VM technology can be used to protect the data node programs from other programs getting access to its dedicated CPU resources. Docker have some protection against this as well, but VM protection is safer and can protect entire CPUs. Docker can protect that we get a certain share of the CPUs.
::: tcolorbox READ_CONFIG_REQ phase completed, this phase is used to read configuration and to calculate various sizes and allocate almost all memory needed by the data node in its lifetime Not initial start Start phase 0 completed Phase 0 has made some file system initialisations Starting QMGR phase 1 DIH reported initial start, now starting the Node Inclusion Protocol :::
Phase 0 of the restart sends a signal READ_CONFIG_REQ to all blocks. This signals gives the chance for early initialisation of objects, it allocates any objects not allocated from the global memory space and it calculates sizes of various pools based on configuration and thus various other calculations based on the configuration.
In phase 1 QMGR will insert us into the heartbeat protocol.
::: tcolorbox Include node protocol completed, phase 1 in QMGR completed Start phase 1 completed Phase 1 initialised some variables and included node in cluster, locked memory if configured to do so Asking master node to accept our start (we are master, GCI = 0) We are performing an initial start of the cluster System Restart: master node: 1, num starting: 2, gci: 0 CNTR_START_CONF: started: 0000000000000000 CNTR_START_CONF: starting: 0000000000000006 NDBCNTR master accepted us into cluster, start NDB start phase 1 :::
QMGR handles the lowest layer of the cluster node management. The next layer is handled by the NDBCNTR block. We are informed of which node became the master node, how many nodes are starting and their node ids in the form of a bitmap.
After being accepted by the master node to start we proceed on with the restart.
::: tcolorbox LDM(1): Starting REDO log initialisation LDM(1): Completed REDO log initialisation of logPart = 0 LDM(1): Completed REDO initialisation Schema file initialisation Starting Schema file initialisation Completed NDB start phase 1 completed Start phase 2 completed Phase 2 did more initialisations, master accepted our start, we initialised the REDO log :::
This is an initial start, so now it is time to initialise the REDO log files to ensure that we have access to the full space of the REDO log files. This can take considerable time in the first initial start, it is only done in an initial start and initial node restart.
::: tcolorbox Grant nodes to start phase: 3, nodes: 0000000000000006 Start NDB start phase 2 NDB start phase 2 completed Start phase 3 completed Phase 3 performed local connection setups Grant nodes to start phase: 4, nodes: 0000000000000006 Start NDB start phase 3 LDM(1): Completed old Redo head invalidation on log part 0 LDM(1): All redo actions complete (apply, invalidate) Start invalidating: Part 0, Head: file: 0, page: 1, Invalidation start: file: 0, page: 1, actual start invalidate: file: 0 page: 1 :::
As part of the restart we have to apply the REDO log and before that we might have to invalidate a part of the REDO log that was written, but didn't survive the crash. We get reports about where this happens.
::: tcolorbox LDM(1): Completed old Redo head invalidation on log part 0 LDM(1): All redo actions complete (apply, invalidate) NDB start phase 3 completed Starting with m_restart_seq set to 1 Start phase 4 completed Phase 4 continued preparations of the REDO log Grant nodes to start phase: 5, nodes: 0000000000000006 NDB start phase 4 completed Creating System Tables Starting as part of initial start Node 2 have reached completion of NDB start phase 4 Creation of System Tables Completed Start NDB start phase 5 (only to DBDIH) :::
m_restart_seq is the number of the restart and ensures that we can separate this instance of the node from any earlier instance of this node that already failed. Other nodes might have signals received from the old instance. So we need some protection to ensure that we don't get confused by old signals still progressing through the cluster even after a node failure.
In an initial restart we create some system table and fill it with data.
::: tcolorbox NDB start phase 5 completed Start NDB start phase 6 NDB start phase 6 completed Start phase 5 completed Phase 5 Created the System Table Phase 5 waited for local checkpoint to complete Grant nodes to start phase: 6, nodes: 0000000000000006 Start phase 6 completed Phase 6 updated blocks about that we've now reached the started state. Grant nodes to start phase: 7, nodes: 0000000000000006 President restarts arbitration thread [state=1] Activating bucket 0 in SUMA Start phase 7 completed Phase 7 mainly activated the asynchronous change events process, and some other background processes :::
Here we might get confused since start phase 5 completes after NDB start phase 6 have completed. The reason is that start phase is managed by the subblock Missra in NDBCNTR that sends phased signals to each block in the data node. This is controlled by the signals STTOR and STTORRY.
Now when NDBCNTR receives such a start phase from Missra it starts off zero, one or more NDB start phases.
The reason for separation of lower layer cluster management and database cluster management both in heartbeat setup and early phases and also later phases of the restart is historical.
In the early days before the NDB architecture had settled down the idea was that NDB would be just one subsystem running in the virtual machine. Thus the STTOR/STTORRY handles the start of the virtual machine whereas NDB_STTOR and NDB_STTORRY handles the start of the database machine.
Now this idea has been completely abandoned for almost 20 years, the distinction today between the layers is not as clear anymore. But the design works fine, thus there has been no specific need to change this part, also it would be quite hard to change it given the implications it would have on upgrade of the cluster software due to the impact on the NDB protocols it would have.
In the late restart phases we startup the event API handling and the arbitration thread ensuring that we can survive split brain scenarios.
::: tcolorbox Grant nodes to start phase: 8, nodes: 0000000000000006 Start NDB start phase 7 Foreign Key enabling Starting Foreign key enabling Completed NDB start phase 7 completed Start phase 8 completed Phase 8 enabled foreign keys and waited forall nodes to complete start up to this point Grant nodes to start phase: 9, nodes: 0000000000000006 Start phase 9 completed Phase 9 enabled APIs to start connecting Grant nodes to start phase: 10, nodes: 0000000000000006 Start phase 49 completed Grant nodes to start phase: 50, nodes: 0000000000000006 Restart complete, updated local sysfile Start phase 50 completed Grant nodes to start phase: 51, nodes: 0000000000000006 Start phase 101 completed Phase 101 was used by SUMA to take over responsibility for sending some of the asynchronous change events Grant nodes to start phase: 102, nodes: 0000000000000006 Node started Started arbitrator node 65 [ticket=176a00019a4cb559] Allocate event buffering page chunk in SUMA, 16 pages,.. :::
Here is the final phase of the initial start where foreign key handling is started up and a couple of more phases to synchronise special parts of the data nodes. After starting the arbitrator on node 65 (the NDB management server) the restart is done.
Finding node log messages in code#
If you are interested in understanding a bit more about the NDB code you can do this by adding more messages to the node log. Most normal information messages are added through a call such as shown below:
::: tcolorbox g_eventLogger->info("Insert normal printf logic here"); :::
By doing a grep as below you can quickly find all the examples of node log messages in our code base.
::: tcolorbox cd $MYSQL_SOURCE_DIR cd storage/ndb/src grep -r 'g_eventLogger->info' * :::
Replacing info by debug shows all debug messages, warning shows all warning messages and critical for critical messages.
Writing to node logs#
Every write to the node log presents a potential problem. It can fail due to various problems with the disk. To avoid crashing the data nodes in those situations we implement the actual log writing in a special log writing thread. If the log writes fails after a number of retries, the log message is thrown away and the data node can continue to operate. Similarly if the log writing buffer is full in this thread.
Summary of logs#
The data node logs and cluster log contain lots of information. The above was presented mostly as an example. It presents information about restarts, node failures. It is heavily used by developers printing debug information during analysis of bugs and new feature development.
In MySQL Cluster 7.4 a major effort was spent on ensuring that supportability of the product was increased by improving the understandability of the restart messages. Many more new messages was printed to ensure that it is possible to follow the restart phases in some detail. A new ndbinfo table was developed that keep track of node restarts in a live system and one can see progress of restarts.
There are many different DUMP commands that can be used to print detailed information on various topics to analyse error situations. These are mostly intended to aid MySQL support to assist users with various sorts of problems.
The good news is that RonDB is open source, the code is available and so anyone can learn those detailed logging messages as well and become an expert in supporting a RonDB installation.
The cluster log messages are documented and all of them are described in the MySQL manual. This is not the case for messages in the node logs. These are not documented, rather the aim of those is to be as self-describing as possible and the amount of messages in the node logs can at times be a bit higher.