Main Programs#

We will now discuss the various programs and what they contain. This includes the various node types, but it also includes various management programs.

ndbmtd#

The heart of RonDB is the data node. ndbmtd is the multithreaded variant of this. The data node contains the code to handle transactions, it implements the various data structures to store and index data, and it implements all the recovery features. Most of the innovations in RonDB are located in the data nodes. The remainder of the programs is about interfacing the data node, managing the cluster and a few other things.

ndbmtd contains a fair amount of thread types. It has several threads to handle the actual data and its indexes and the logging. These threads are called ldm threads. There are also similar thread types that assist ldm threads for reads, these are called query threads. In 22.10 query threads were replaced by query workers. This means that any thread can act as a query thread. Therefore, from 22.10.2, any thread (except send threads) can handle read queries.

Several threads handle transaction coordination and interface with other nodes. These threads are called tc threads. There is a set of recv threads to handle receive from other nodes. Each thread handles receive from a set of nodes. Communication to another node is through one TCP/IP socket. There is a set of send threads that can handle send to any node. Other threads can also assist send threads. It is still possible to configure with no send threads, this means that sending is done by the threads that generated the messages.

There is a main thread, this is mainly used when doing metadata changes and it assists the send threads. The next thread type is the rep thread, this thread is used for various things, but it is mainly used for replication between clusters and when reorganizing the data after adding more data nodes to the cluster.

There is a set of io threads. These threads are used to communicate with the file system. Since the file system access in most OSs was difficult to make completely asynchronous, we decided to break out file accesses into separate threads, in this manner the file system threads can use the synchronous file IO calls and thus become much easier to handle from a portability point of view.

There is a watchdog thread that keeps track of the other threads to ensure that they continue making progress, there are two connection setup threads, one for connecting client and one that acts as a server in the connection setup.

The ndbmtd threads (ldm, tc, main, rep and recv) contains a small operating systems, they contain a scheduler and has memory buffers setup for communication with other threads efficiently. The programming of most threads is done using an asynchronous programming style where sending messages is a normal activity. There are coding rules for how long time a message execution is allowed to take (10 microseconds). To handle long-running activities one has to use various sorts of continue messages to ensure that other activities get access to CPU resources at predictable latency.

One nice feature of the data nodes is that the more load they get, the more efficient they become. When loading the cluster, the efficiency goes up and thus the nodes have an extra protection against overload problems.

The number of threads per type is flexible and will adapt to the HW the node runs on. It is also possible to control the thread setup in great detail using configuration parameters.

Each thread handles its own data. Thus in most cases, there is no need to use mutexes and similar constructs to handle data. This simplifies programming and gives predictable performance. In RonDB we have introduced query threads that have to use mutexes to control access to various data structures. This means that there is a bit of interaction between the threads to ensure that we make full use of all computing resources in the VMs that RonDB runs in.

Communication between threads is mostly done without mutexes since communication is mostly done using lock-free constructs with a single reader and single writer for each communication buffer. Mutexes and similar constructs are required to wake up sleeping threads, it is required for the send handling. There is special concurrency control to handle the data that represents the data distribution inside the cluster. This is using a mechanism called RCU (Read Copy Update) which means that any number of readers can concurrently read the data. If someone needs to be update the data it will be updated from the same thread always, there is a special memory write protocol used to communicate to readers that they need to retry their reads. This memory write protocol requires the use of memory barriers but requires no special locks.

One data node can scale up to more than 1000 threads. A data node of that size will be capable to handle many millions of reads and writes of rows per second.

The cluster design is intended for homogenous clusters, the number of threads per type should be the same on all data nodes. It is possible to use a heterogenous cluster configuration but it is mainly intended for configuration changes where one node at a time changes its configuration.

It is possible to run several data nodes on the same machine.

ndb_mgmd#

The RonDB Management Server is a special program in RonDB that contains a small distributed database in its own that maintains the RonDB configuration database with the configuration of all the nodes in the cluster. It stores the configuration data of the cluster. It supports changing the configuration data and the change will be done using a distributed transaction to ensure that all management servers get the same updates.

The configuration changes will only be applied to other nodes when the nodes restart. To perform a configuration change requires two steps, first update the configuration in the management server(s), second restart the affected nodes.

The management server is also used for logging purposes. This logging is intended for the management of the cluster and contains event reports of various sorts. There is also a set of commands that can be executed that generates output in the cluster log.

Since the management server is part of the cluster it can answer questions about current cluster state. There is a protocol called the NDB MGM protocol to communicate with the management server. It is a simple half-duplex protocol with carriage return to indicate a newline and two carriage returns to indicate the end of the message and that the sender is now expecting a reply. All the communication in this protocol is using the 7-bit ASCII character set.

This protocol has the ability to handle graceful shutdown of a node and startup of a stopped node.

It can only start nodes if the program is still running, the data nodes can be started through an angle process that will survive when the data node crashes, this angle process will restart the data node process when told to do so by the management server.

A capability that both API nodes and management server nodes can handle is arbitration. Arbitration is used when nodes fail in the cluster to ensure that we can’t get two clusters running after the failure. The simplest case is a two-node cluster (two data nodes that is), if one of them sees the other fail through missing heartbeats it will assume that the other node is dead. If the other node is alive, but the communication between the nodes is down, the other node might reason in the same way. In this case, they will both contact the arbitrator, the first one to contact the arbitrator gets to continue and the second is told to die.

The cluster needs to decide on which node the arbitrator is when nodes are up. A new arbitrator cannot be assigned at the time when a failure happens.

Thus the minimum configuration that can survive a failure of a data node is two working data nodes and one node that is currently arbitrator. If the arbitrator fails while the cluster is up, a new arbitrator is immediately assigned, normally there is some management server or API node available to handle this role.

ndb_mgm#

The NDB Management client is a program that implements the NDB MGM protocol using the NDB MGM API. It provides a number of simple text commands to manage the cluster.

The MySQL Server (mysqld)#

The MySQL Server acts as an API node towards the data nodes and the management server. It can even act as several API nodes to ensure that it can serve the required traffic. One API node is limited by what one socket can deliver and what one receive thread in the NDB API can process. It is possible for larger cluster setups that the number of possible API nodes is the limiting factor in cluster throughput.

The MySQL Server interacts with the data nodes through the NDB API. The NDB API implements the NDB protocol to the data nodes. This protocol is a binary full-duplex protocol where the base message is a signal that has a small header and a payload. There is a fairly large number of predefined signals that can be carried.

One socket can be used by thousands of threads using multiplexing. Many threads can pack their own signals into the send buffers for each socket. Each signal has an address header that describes the receiver thread and the receiver block and a few other items.

API nodes#

API nodes come in many flavors such as LDAP servers, DNS servers, DHCP servers, Hadoop File Server, Hopsworks Online Feature Store and the list can go on for a long time, it is limited to the imagination of application developers using RonDB.

MySQL clients#

Programs that use MySQL APIs to communicate with the cluster are not part of the cluster as such. They can read and write the data in the cluster, but the cluster has no notion of when these nodes start and stop, they connect to a MySQL Server and perform a set of queries and when done they shut down the connection. Only the MySQL server knows of these programs as connections to the MySQL Server.

Any program that uses the MySQL APIs can potentially be used against RonDB as well. Whether it will work is of course dependent on a lot of things. Changing to use a different database engine for an application is always a task that requires some work.