Troubleshooting RonDB#

You have managed to get a cluster up and running, maybe you have even managed to get the cluster into production usage. All of a sudden a crash occurs unexpectedly. What to do about it?

RonDB is designed for continous operation even in the presence of bugs and failures. In most cases the crashed node comes up again and you are back to a replicated setup after a short time. After that you can analyse your log files to see what went wrong.

There has even been cases where a node has failed and the node failed to start up again. After working with the error logs and the remaining nodes it is possible at times to work around the problem to avoid system downtime. Oftentimes the cluster can stay up and running for weeks and months even with a node down. With two replicas it will stay up for years and years.

RonDB inherited extensive crash logs from its AXE inheritance. In AXE the CPUs generated writes into a jump address memory that showed the last few thousands jumps in the assembler code before any crash. We took this concept to the C++ world by writing a set of scripts called jam() with a few variants. Each such macro will write an entry into a memory area that shows which source code file and which line in this file that we visited the last microseconds before the crash. We record a few thousand such macro calls before the crash. This has an overhead of around 5-10%, but it is an overhead that is well worth the effort in quickly finding bugs.

In addition since RonDB is a message oriented architecture and we use memory for send buffers, it is possible to also generate signal logs for the last few thousand signals. This can take us back milliseconds before the crash and in some cases even many seconds back.

RonDB data nodes uses a multithreaded architecture. Here each thread is generating its own crash output. It is even possible to follow messages that pass between threads in many cases, in system crashes it is even possible to follow messages going between nodes in the cluster.

In addition a very important tool for analysing the behaviour of data nodes is the node logs and cluster logs. These are printouts that goes to special log files, the cluster logs gather all logs for the cluster and the node logs gather logs from all threads within one node (there is a node log also for the management server, and the MySQL Server have its error log).

The MySQL server and the NDB API doesn’t have the same facilities of crash logs. At the same time a crash in the MySQL Server using NDB is a crash of client software, thus it represents very little risk for you of losing your data. It is much more important to care for the risk of losing data in data nodes.

When debugging and developing RonDB we use a set of tools to printout debug messages in data nodes and in the MySQL Server we use the same tools that are used by the rest of the MySQL Server.

With this combination we have the upper hand on any bug caused by temporary source glitches. It is usually detected and found rather easy with those tools.

The crash principle in RonDB is a bit different to the crash principle of the MySQL server. The MySQL server goes a long ways to avoid crashing the MySQL server. The reason is that a normal MySQL server that fails means downtime for the user application.

An RonDB data node that goes down doesn’t represent any downtime in a standard NDB configuration. A standard configuration of RonDB always have 2 replicas of all the data in the data nodes. Thus a crash can be handled without any downtime seen by the user other than that a few transactions will have to be executed again.

In RonDB we employ the fail fast methodology. Thus our code is filled with asserts that ensure that the data structures in RonDB are always consistent. As soon as something is wrong we immediately crash. We see it as better to crash and come back to a stable state rather than to continue operating with data in a state that the code isn’t prepared to handle. Continuing in this state could jeopardize the database since it is easy to work with bad pointers when data integrity has been breached.

Data node crash output#

When a crash occurs we gather a number of logs together before the program is completed.

Each thread writes a trace file that contains a combined jump trace file and a message logger file. The node log is maintained continously and so is the cluster log.

Crashes can occur for a myriad of reasons. The crash could come through internal problems in our code. These cases are normally found through some assert in the code.

Other cases can be that we attempted to perform a configuration change that simply didn’t work. Often there is a clear error message in those cases explaining the problem.

We can crash in cases where we overextend the memory buffers in the node. This is rather uncommon that we crash for this reason, in this case we normally drop the connection to a node instead in this case. We employ various techniques to avoid crashing in this case. In an overload situation it rarely helps to crash, this usually leads us into an even bigger issue.

A crash is recorded in an error log, each threads jam() calls and signal log is recorded in one trace file per thread.

In order to display the use of those files I inserted a very easy crash case in a version I am currently developing. The below is just to show how the various log files looks, that you know where to look (we have cut some lines to make the output readable in the book format).

Error log#

The error log contains a short description of the error that occurred. This describes the time of the failure, the status of the failure (how large impact did it have), an explanation of the problem. In this case it was caused by an assert in the DblqhMain.cpp.

In the error log we can see the exact condition that was checked in the error check and at what line in the code the assert happened. Obviously we need to use source code from the same version to be able to analyse any reports about code lines.

Time: Thursday 19 March 2021 - 20:17:09
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire)
  (Internal error, programming error or
   missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 8084) 0x00000002 Check lqhRef == 0 failed
Program: ndbmtd
Pid: 2526 thr: 3
Version: RonDB-21.04.1
Trace file name: ndb_1_trace.log.1_t5
Trace file path: /path/datadir/ndbd.1/ndb_1_trace.log.1 [t1..t11]
***EOM***

In this case the output was generated by the ndbmtd program.

In the above case we have crashed in node 1, thread id 5 and there was threads t1 through t11. The first thread (the main thread is called ndb_1_trace.log.1. The second thread is the rep thread, this thread has its trace files written into (in this case) ndb_1_trace.log.1_t1. Next comes the ldm threads (4 in my case), these end with t2 -> t5. After that comes the tc threads. In our case we had three of those, thus t6 through t8. There is in our case two recv threads, these gets their input into t9 and t10.

In the above test we got a crash log in the last of the ldm thread. We get the pid, the thread count and the version number generating the crash output.

JAM#

The output of the jam() macros is listed in the above trace files. To make it easier to read the lines around the crash the last line touched before the crash is the last line listed before the signal logging starts. Thus signals are listed in the jam() output going backwards in the file. A good idea to read the trace file is to search for Signal and start from there.

Each new signal is listed in the jam() output and each time we jump into a new file we change into a new line with the name of the source file listed. Next we display the line numbers we have touched. In some cases we can list a 16-bit number instead in the jam(). An example could be that we want to list which node we are processing, in this case the number doesn’t refer to a code line.

Below is an example of one signal execution.

---> signal
DbtupMeta.cpp        00661 00718 00906 00908 00812
DblqhMain.cpp        04077

Signal log#

The signal log have a lot more information in it. Each signal carries a header with receiving block number and name, receiving node id, receiving signal id and the global signal number and the name of the signal and its priority. Here r.bn is short for receiving block number. 247/4 means block number 247 (DBLQH) and ldm thread instance 4 (starting at 1). r.proc means receiving processor id, that is our node id. Each signal carries a signal id, each new signal executed in a thread increments this number. Similarly s.bn, s.proc and s.sigId is the senders identity. Thus we can find the exact signal that sent this signal if we suspect that it did something wrong in sending signal.

The rest of the signal output depends on the signal. Many signals have special signal printers attached to them (LQHKEYREQ below has that). These print the various fields and flags in text and a bit better readability compared to a hexadecimal dump of the signal data that is the default if no special printer is available.

LQHKEYREQ is the signal sent from DBTC to read, update, delete, write or insert rows using a primary key. In this case we are performing an insert. We can see the transaction id, the table id, the back reference to DBTC and a number of other fields.

--------------- Signal ----------------
r.bn: 247/4 "DBLQH", r.proc: 1, r.sigId: 11937
    gsn: 316 "LQHKEYREQ" prio: 1
s.bn: 245/1 "DBTC", s.proc: 1, s.sigId: 81
   length: 13 trace: 0 #sec: 2 fragInf: 0
 ClientPtr = H'00002b20 hashValue = H'27680d04 tcBlockRef = H'02f50001
 transId1 = H'00000000 transId2 = H'00000001 savePointId = H'00000000
 Operation: Insert
 Op: 2 Lock: 0 Flags: CommitAckMarker
   ScanInfo/noFiredTriggers: H'3fffb
 AttrLen: 0 (0 in this) KeyLen: 0 TableId: 2 SchemaVer: 1
 FragId: 7 ReplicaNo: 0 LastReplica: 1 NextNodeId: 2
 ApiRef: H'00fb0001 ApiOpRef: H'00000000
 AttrInfo:

We have developed many prototypes of even more advanced signal loggers that can be used also when not crashing. These can be very helpful in learning about how signal flows for various functions are implemented. This is true even for me with 20 years of experience of developing RonDB. There are commands in the RonDB management client accessible using HELP DEBUG to provide such signal logs.

What we have found again and again is that the crash logs does more than 90% of the functions needed to understand what goes on. It is often simpler to insert a simple assert somewhere in the code and have it crash at some point where you want to see the workings exactly before the crash. Therefore none of those prototypes have been productified. We even had an advanced signal database tool implemented in Java already in the 1990s for this purpose and we have done some efforts in a similar fashion a few years ago as well.

Node and cluster log#

We covered the node and cluster log in the previous chapter.

Watchdog#

An important part of the execution environment is that we always ensure that threads do not stop from proceeding. We check this from a special watchdog thread. Every thread must register into shared data structures that this thread will see. If a block stays away for more than a minute or two the node will crash because of this.

Warnings are printed by the watchdog already after 100 milliseconds. These warnings should be taken seriously, they are usually an indication of the OS environment you are running in. Thus if the OS have a lot of CPU contention these messages can quickly start appearing.

Log writing#

A number of cases occurred at user sites where logging itself could cause the watchdog to fire off. This happened in cases with overloaded disks and other similar OS issues.

To prevent this from happening all log writes have been moved to a special log writing thread. The execution threads will never do any form of IO operations other than those that assist in sending to the network and the recv thread that will take care of the network reading.

ndbrequire#

A very commonly used macro in the RonDB data node code is ndbrequire. This is an assert function, it is executed even in a production environment. Its cousin ndbassert is only executed in a debug environment. For the most part we prefer ndbrequire since we want to fail fast if there is an error in any data structure. Thus ndbassert is more for testing stupid tests that sometimes can be good to have in a development phase to find some silly errors.

ndbrequire can only be used in block code, in other code we have another macro called require.

Debugging RonDB#

Debugging RonDB code is about using the above tools. I have used the above tools to solve many thousands of complex bugs over the years. In all other software I have written I have almost always been forced to write lots of special debug statements scattered all over the code, and as soon as the code is in production these printouts are gone.

The above mechanisms makes it possible for us to work efficiently with users that experiences problems in the field. Many times these logs can be sufficient to solve the issues.

There is still a need for printing debug macros however as a debugging tool and in the NDB API and in the NDB storage engine we use the MySQL server debugging tools that provides a good insight into how the server works.

For the RonDB data nodes there is many ways to write log messages. Most of them end up in the node log. There is also another mechanism that end up in a special log called signal log, this log can be used to combine signal logging and normal debug printouts.

Below is a preferred method I have used heavily the last few years.

Debugging using debug macros#

During my development projects I have used special debug macros a lot in my development.

The reason is that when testing restart code, it is often the case that the bug was in the production of checkpoints. The bug wasn’t discovered until the restart though. Thus we have no chance of discovering the error through signal logs.

In this case I used a wide range of debug macros that can be selectively activated during development.

Here is an example of the definition of such a debug macro. It is important to include the header file EventLogger.hpp and extern declare the event logger object g_eventLogger.

#define DEBUG_EXTRA_LCP 1
#ifdef DEBUG_EXTRA_LCP
#define DEB_EXTRA_LCP(arglist) \
  do { g_eventLogger->info arglist ; } while (0)
#else
#define DEB_EXTRA_LCP(arglist) do { } while (0)
#endif

Now we can use this macro in the code to print a line to the node log that contains a timestamp and also these logs are serialised to ensure that each line is printed in full without interruption from other log writes.

When the code is ready we can simply remove this code by commenting out the line above like this and the code is no longer executed, neither in debug nor in production builds.

//#define DEBUG_EXTRA_LCP 1

To use it in the code one uses a normal printf format. Important to note is that it requires double parenthesis around the statement.

DEB_EXTRA_LCP(("(%u)op_type: %u, newestGci: %u, "
               "tableId: %u, fragId: %u",
               instance(),
               tcPtrP->operation,
               regFragptr->newestGci,
               regFragptr->tabRef,
               regFragptr->fragId));