Now we move on to describing the architecture of RonDB. In the figure below we see a typical setup for a small cluster.
The web client contacts the Web Server, the web server contacts the MySQL Server to query data. The MySQL Server will parse the request and contact the data nodes to read or write the data. In addition we have a management server that contains the configuration of the cluster. Each node starting up contacts the management server (ndb_mgmd) to retrieve the configuration in the early phases of starting up the nodes.
The nodes that are part of the cluster can be divided into three types, the data nodes, the management servers and API nodes. All of these nodes have a node id in the cluster and communicate using the NDB protocol.
The data node program is called ndbmtd and is a modern multithreaded version.
There is one management server type ndb_mgmd, there can be one or two management servers, they are required to start up nodes, but as soon as nodes have started up they are only used for cluster logging. Thus if all management servers are down the data nodes and API nodes will continue to operate.
API nodes comes in many flavors. The most common one is of course a MySQL Server (mysqld). But we have also application specific API nodes that use some NDB API variant (will be described later).
A common environment is using RonDB with MySQL Servers. The figure below shows the setup in this case where a client calls the MySQL Server which in turn talks to the RonDB data nodes.
In this setup it is clear that RonDB is always a distributed solution. We have worked hard to optimise the communication between MySQL Server nodes and the RonDB data nodes.
Node group concept (Shards)#
A basic theme in RonDB is the use of node groups. The RonDB data nodes are formed in node groups. When starting the cluster one defines the configuration variable NoOfReplicas. This specifies the number of nodes per node group. When defining the nodes in the cluster the data nodes are listed in the NDBD sections. The data nodes will be put automatically into node groups unless specifically configured to be in a node group.
E.g. if we define 4 nodes with node ids 1,2,3 and 4 and we have NoOfReplicas set to 2, node 1 and 2 will form one node group and node 3 and 4 will form another node group.
Node groups works similar to shards. Thus all the nodes in a node group (shard) is fully replicated within the node group. Thus in the above example node 1 and node 2 will have exactly the same partitions and the same data. Thus all rows that exist in node 1 will also be present in node 2. The same holds true for node 3 and node 4. However none of the rows in node 3 and node 4 will be present in node 1 and node 2 (an exception is tables using the fully replicated feature).
The cluster continue to operate as long as we have at least one node in each node group up and running (there are some things to consider about network partitioning that we will discuss later). If all nodes in one node group have failed the cluster will also fail since we have lost a part of the data. This is one difference to sharding where each shard lives and dies independently of other shards.
RonDB supports foreign key relations between node groups and it supports distributed queries over all node groups, the loss of an entire node group will make it difficult to operate the cluster.
When RonDB was designed there was a choice between spreading partitions over all nodes and dividing into node groups. Spreading partitions means that we will always survive one node failure, but it will be hard to survive multiple node failures. Spreading partitions would mean faster restarts since all nodes can assist the starting node to come up again. Supporting multi-node failures was deemed more important than to optimise node restarts, therefore RonDB use the node group concept.
We will now discuss the various programs and what they contain. This includes the various node types, but it also includes various management programs.
The heart of RonDB is the data node. ndbmtd is the multithreaded variant of this. The data nodes contains the code to handle transactions, it implements the various data structures to store data, to index data. It implements all the recovery features. Thus most of the innovations in RonDB is located in the data nodes. The remainder of the programs is about interfacing the data node, managing the cluster and a few other things.
ndbmtd contains a fair amount of thread types. It has a number of threads to handle the actual data and its indexes and the logging. These threads are called ldm threads. There are also similar thread types that assist ldm threads for reads, these are called query threads.
There are a number of threads that handles transaction coordination and handles the interface towards other nodes to a great extent. These threads are called tc threads. There is a set of recv threads to handle receive from other nodes. Each thread handles receive from a set of nodes. Communication to another node is through one TCP/IP socket. There is a set of send threads that can handle send to any node. Other threads can also assist send threads. It is still possible to configure with no send threads, this means that sending is done by the threads that generated the messages.
There is a main thread, this is mainly used when doing metadata changes and it assists the send threads. The next thread type is the rep thread, this thread is used for various things, but it is mainly used for replication between clusters and when reorganising the data after adding more data nodes to the cluster.
There is a set of io threads. These threads are used to communicate with the file system. Since the file system access in most OSs was difficult to make completely asynchronous we decided to break out file accesses into separate threads, in this manner the file system threads can use the synchronous file IO calls and thus becomes much easier to handle from a portability point of view.
There is a watchdog thread that keeps track of the other threads to ensure that they continue making progress, there are two connection setup threads, one for connecting client and one that acts as a server in the connection setup.
The ndbmtd threads (ldm, query, tc, main, rep and recv) contains a small operating systems, they contain a scheduler and has memory buffers setup for communication with other threads in an efficient manner. The programming of most threads is done using an asynchronous programming style where sending messages is a normal activity. There are coding rules for how long time a message execution is allowed to take (10 microseconds). To handle long running activities one have to use various sorts of continue messages to ensure that other activities gets access to CPU resources at predictable latency.
One nice feature of the data nodes is that the more load they get, the more efficient they become. When loading the cluster, the efficiency goes up and thus the nodes have an extra protection against overload problems.
The number of threads per type is flexible and will adapt to the HW the node runs on. It is also possible to control the thread setup in great detail using configuration parameters.
Each thread handles its own data, thus in most cases there is no need to use mutexes and similar constructs to handle data. This simplifies programming and gives predictable performance. In RonDB we have introduced query threads that have to use mutexes to control access to various data structures. This means that there is a bit of interaction between the threads to ensure that we make full use of all computing resources in the VMs that RonDB runs in.
Communication between threads is mostly done without mutexes since communication is mostly done using lock-free constructs with a single reader and single writer for each communication buffer. Mutexes and similar constructs are required to wake up sleeping threads, it is required for the send handling. There is special concurrency control to handle the data that represents the data distribution inside the cluster. This is using a mechanism called RCU (Read Copy Update) which means that any number of readers can concurrently read the data. If someone needs to be update the data it will be updated from the same thread always, there is a special memory write protocol used to communicate to readers that they need to retry their reads. This memory write protocol requires the use of memory barriers but requires no special locks.
One data node can scale up to more than 1000 threads. A data node of that size will be capable to handle many millions of reads and writes of rows per second.
The cluster design is intended for homogenous clusters, the number of threads per type should be the same on all data nodes. It is possible to use a heterogenous cluster configuration but it is mainly intended for configuration changes where one node at a time changes its configuration.
It is possible to run several data nodes on the same machine.
The RonDB Management Server is a special program in RonDB which contains a small distributed database in its own that maintains the RonDB configuration database with the configuration of all the nodes in the cluster. It stores the configuration data of the cluster. It supports changing the configuration data and the change will be done using a distributed transaction to ensure that all management servers gets the same updates.
The configuration changes will only be applied to other nodes when the nodes restart. To perform a configuration change requires two steps, first update the configuration in the management server(s), second restart the affected nodes.
The management server is also used for logging purposes. This logging is intended for management of the cluster and contains event reports of various sorts. There is also a set of commands that can be executed that generates output in the cluster log.
Since the management server is part of the cluster it can answer questions about current cluster state. There is a protocol called the NDB MGM protocol to communicate with the management server. It is a simple half-duplex protocol with carriage return to indicate a newline and two carriage returns to indicate the end of the message and that the sender is now expecting a reply. All the communication in this protocol is using the 7-bit ASCII character set.
This protocol has the ability to handle graceful shutdown of a node and startup of a stopped node.
It can only startup nodes if the program is still running, the data nodes can be started through an angle process that will survive when the data node crashes, this angle process will restart the data node process when told to do so by the management server.
A capability that both API nodes and management server nodes can handle is arbitration. Arbitration is used when nodes fail in the cluster to ensure that we can't get two clusters running after the failure. The simplest case is a two-node cluster (two data nodes that is), if one of them sees the other fail through missing heartbeats it will assume that the other node is dead. If the other node is alive, but the communication between the nodes is down, the other node might reason in the same way. In this case they will both contact the arbitrator, the first one to contact the arbitrator gets to continue and the second is told to die.
The cluster needs to decide on which node is arbitrator when nodes are up, a new arbitrator cannot be assigned at the time when a failure happens.
Thus the minimum configuration that can survive a failure of a data node is two working data nodes and one node that is currently arbitrator. If the arbitrator fails while the cluster is up, a new arbitrator is immediately assigned, normally there is some management server or API node available to handle this role.
The NDB Management client is a program that implements the NDB MGM protocol using the NDB MGM API. It provides a number of simple text commands to manage the cluster.
The MySQL Server (mysqld)#
The MySQL Server acts as an API node towards the data nodes and the management server. It can even act as several API nodes to ensure that it can serve the required traffic. One API node is limited by what one socket can deliver and what one receive thread in the NDB API can process. It is possible for larger cluster setups that the number of possible API nodes is the limiting factor in cluster throughput.
The MySQL Server interacts with the data nodes through the NDB API. The NDB API implements the NDB protocol to the data nodes. This protocol is a binary full-duplex protocol where the base message is a signal that has a small header and a payload. There is a fairly large number of predefined signals that can be carried.
One socket can be used by thousands of threads using multiplexing. Many threads can pack their own signals into the send buffers for each socket. Each signal has an address header that describes the receiver thread and the receiver block and a few other items.
API nodes comes in many flavors such as LDAP servers, DNS servers, DHCP servers, Hadoop File Server, Hopsworks Online Feature Store and the list can go on for a long time, it is limited to the imagination of application developers using RonDB.
Programs that use MySQL APIs to communicate with the cluster is not part of the cluster as such. They are able to read and write the data in the cluster, but the cluster has no notion of when these nodes start and stop, they connect to a MySQL Server and perform a set of queries and when done they shut down the connection. Only the MySQL server knows of these programs as connections to the MySQL Server.
Any program that uses a MySQL APIs can potentially be used against RonDB as well. Whether it will work is of course dependent on a lot of things. Changing to use a different database engine for an application is always a task that requires some work.