Data Node Architecture#

The data node in RonDB is heavily inspired by the AXE architecture at Ericsson. The architecture is a message-passing architecture where messages (called signals) are transported from one module (called block) to another module.

Messages have a destination address. This is a 32-bit number that, similarly to an IP address, is used for routing the message to its destination. The message also carries a source address and the messages carry data.

The address consists of a node ID, a thread ID and a block ID. This makes it possible to scale to 65535 nodes with 1024 threads in each node and up to 64 blocks in each thread.

Block and Signal Architecture#

Blocks#

The idea with a block is that it has its own data and code. In AXE the blocks were implemented in PLEX, this programming language did not allow for any type of code sharing and data sharing. Our blocks are implemented as C++ classes and they can share some common base classes to handle various common data structures such as hash tables, memory pools and so forth.

For the most part though, the blocks contain their own code and their own data. This has the advantage that errors are quite isolated from each other. Given that the interface to a block is intended to use messages all the time, a bug can only affect other blocks by sending erroneous data. This makes it easier to find bugs in RonDB.

Reading the code requires an understanding of how messages are routed. In the code, a signal is passed using a block reference. There is very little in the code that indicates where this signal goes. It is important to document signal flows to make it easier to understand the code. At the same time, there are many ways to create signal flows to assist new developers in understanding the signal flows.

Block references#

When a signal is sent we use several methods called something like sendSignal mostly. A signal uses a 32-bit block reference as its destination address. This address is divided into a number of parts. The lowest 16 bits is the node ID. At the moment we can only use 8 of those bits, but the architecture is designed such that it can grow to clusters with 65535 nodes.

The 16 upper bits are a block address inside a node. They are interpreted slightly differently in data nodes and API nodes.

In an API node the block address is used to get the Ndb object from an array, the array address is the index to this array. This array starts with index 32768. There can be at most 4711 Ndb objects in one cluster connection. Since Ndb objects are designed to be used in one thread this is the limit on the number of parallel threads to be used per cluster connection. At the same time, the practical limit is much smaller. There is very little reason to use thousands of threads per cluster connection.

There are three special API blocks, block number 2047 is used for sending packed signals. Block number 4002 is used for sending signals to the cluster manager and 4003 is used for configuration change signals between management servers.

In a data node, the lowest 9 bits of the block address is the block number. For example, DBLQH has the block number 245. We can have up to 512 blocks and currently, we have 24 blocks. There is lots of space in this part of the address space.

The upper 7 bits constitute a thread ID of the block. Each thread can contain its own DBLQH instance for example. Since we are not likely to use any more than 64 blocks, we currently have a bit more than 20 blocks, RonDB uses a mathematical transformation of block numbers and thread IDs such that we support only 64 blocks and instead, we support up to 1023 threads.

When a signal is sent, the software looks at the destination node. If it isn’t our own node we pass the signal to the sending of distributed signals. If it is, we check the thread ID combined with the block number to get the destination thread number. Next, the signal is passed into a signal queue of the destination thread.

Asynchronous signals#

Most signals sent are asynchronous signals. These signals are not executed immediately but are put into a signal queue of the thread. This signal queue is called job buffer.

Asynchronous signals can be sent on two priority levels. A is high-priority and B is normal-priority. The priority of a signal affects the scheduling of a signal execution in a data node. It doesn’t affect how it is put into buffers for sending to other nodes and it doesn’t affect scheduling on an API node.

Synchronous signals#

We can also send synchronous signals. In practice, those are function calls that use the same protocols as signals. They are always executed in the same thread, but might in a few situations execute on blocks that belong to another thread (mostly true when looking up the distribution information from a transaction coordinator).

In some cases, we can also use direct function calls on other blocks in the same thread. This is mainly used for the most optimized code paths and mostly in the ldm thread.

Scheduling rules#

A thread executing signals on blocks uses a scheduler to handle execution and communication with other threads. When a signal is executed in a block it is always calling a method that looks like this:

void
BlockName::executeSIGNAL_NAME(Signal *signal)
{
  .....
}

The scheduler is not using a time-sharing model here. It is the responsibility of the block to not execute for too long. There are coding rules that a signal should not execute for more than at most 10 microseconds and almost all signals should be executed within 1-2 microseconds.

Thus even if hundreds of signals are waiting in the job buffer, they will be executed within a few hundred microseconds.

This scheduling model has advantages when combining primary key queries with long-running scan operations. Scan operations handle a few rows at a time when being scheduled. After that, they will be put last in the queue again. This means that they will not block the execution of key lookups for a long time. This architecture is an important part of our solution for predictable latency in RonDB.

To handle background activities that require a certain percentage of the CPU resources we have implemented methods to check job buffer levels and make use of high-priority signals in cases where it is necessary to run for more extended times. This is currently used by our local checkpoint algorithms to guarantee a certain progress even when the load from traffic queries is very high.

Parallelising signal execution#

Given that each signal starts a new thread of execution, it is extremely easy to parallelize activities in RonDB. If one signal is sending one signal it is one thread of execution, but if one thread sends two signals, it has in principle created a new thread of execution.

Thus creating a new thread of execution has a cost measured in nanoseconds. We use this when executing a scan operation. If we have 4 partitions in a table, a scan will send off parallel scan operations towards all partitions from the transaction coordinator.

It is important to not start new threads of execution in an unbounded manner. This would overload the job buffers. We have design principles that state that a signal cannot start more than 4 new signals in the same thread. And even signals to other threads one has to be careful with. For the above scans, we limit the number of scans that can execute in parallel by the MaxNoOfScans configuration parameter, this parameter can not be set higher than 500.

Signal data structure#

Signals come in several flavors. The easiest signals carry from 1 to 25 32-bit unsigned integer values, the signal has a specific length. These are sometimes called short signals.

Many years ago we added long signals, they can come with up to 3 extra parts. The total size of a long signal can be up to around 32 kiB in size.

In some cases it is necessary to send even longer signals. In this case, we use a concept called fragmented signals. This is a signal that is sent through many long signals. Each such long signal is received and handled and the memory is stored in a buffer waiting for all parts to arrive before the signal is executed.

These fragmented signals are fairly rare and mainly used in various metadata operations. Normal traffic signals are carried in long signals in almost all cases.

Receive handling in Data Node#

In a data node, we have one or more special thread(s) that handle the sockets communicating with other nodes (more on single-threaded data nodes later). Signals are received on a socket, we use a special connection setup protocol to get started. After that, each signal carries a fixed 12-byte header that ensures that we can easily find where the signal ends and where the next signal starts.

Any signal can be destined for any thread and any block in the data node.

Each thread has one job buffer for each thread that can communicate with it. The receiver thread has one job buffer to write its signals into for each thread handling signals in the data node. We call these threads block threads.

These job buffers can be written into without using any mutexes, they are mapped such that each job buffer has one thread writing into it and one thread reading from it. In this case, it is possible to design algorithms for buffer read and write that only uses memory barriers to secure a safe communication between the threads. Only high-priority signals are sent to a shared job buffer protected by a mutex.

In very large data nodes with more than 32 block threads, we use 32 job buffers that other threads access using a mutex per job buffer.

The recv threads sleep on an epoll_wait call on Linux and similar constructs on other OSs. Thus as soon as a message arrives from a node the recv thread will wake up and handle it.

It is possible to specify that the recv thread should spin for a while before going to sleep again, this has the possibility of decreasing latency in low-load scenarios. It is also possible to set the thread priority of the recv thread a bit higher to ensure it isn’t forced to go to sleep when it still has a job to do in high-load scenarios.

CPU spinning is the default in the receive threads and in the block threads. The CPU spinning uses an adaptive algorithm that ensures that we only spin when it is likely that we can gain from it.

Send handling in Data Node#

Sending signals to another node is handled through send buffers. These buffers are configurable in size. A block thread executing messages that is sending a signal will simply copy the signal into one of those send buffers and continue execution.

After executing a number of signals the scheduler will ensure that the send buffers local to its thread are made available for sending. This includes signals sent to other threads in the same node. The number of signals before this happens is controlled by the configuration parameter SchedulerResponsiveness. Setting this value higher means that we will execute for a longer time before we make them available.

After executing even more signals we will return to the scheduler loop. The number of signals before this happens is controlled by the SchedulerResponsiveness parameter. If the job buffer is empty we will always return to the scheduler loop (this is the most common reason of course).

The scheduler loop handles checking for delayed signals (signals that are sent with a delay of a specific number of milliseconds)). In addition it handles packing of send buffers when necessary. It handles sending of packed signals, these are signals where several small signals are packed into one signal, these signals avoid the cost of signal scheduling for a number of very common signals and avoids the cost of the 12-byte overhead for each signal sent to another node over the network.

The scheduler of a block thread calls a method do_send to activate the sending of messages. The actual sending of a message is normally configured to be done by one or more send threads. If no send thread is configured, the sending will be handled by the block threads.

In MySQL NDB Cluster 7.5 we added the possibility for sending to be done by block threads even when a send thread is configured. This is handled by an adaptive algorithm. If a thread is very lightly loaded it will wake up once per millisecond to see if it can assist in sending messages. If a thread is in an overloaded state (roughly above 75% load), it will offload all sending to the send threads. If the thread is not overloaded, it will send to as many nodes as the thread itself was attempting to send to.

In practice, this algorithm usually leads to the tc thread heavily assisting send threads since they frequently run through the scheduler loop. The main and rep threads are usually lightly loaded and wake up now and then to ensure that we keep up with the load on the send threads. This ensures that the node can keep up with the send load even when configured with a single send thread and lots of block threads. The ldm threads usually assist the send threads a bit while not overloaded. If any threads are overloaded, it is usually the ldm threads. The recv threads do not assist the send threads.

When a node wants assistance from send threads to execute, it will wake a send thread up. Each connection has a responsible send thread. Each send thread has its mutex to ensure that this mutex isn’t becoming a bottleneck.

The send threads have as their sole purpose to send signals. They grab the send thread mutex to see if any nodes are scheduled for signal sending. If so they grab the node from the list, release the mutex and start sending to the node. If no nodes are available to send to, the send thread will go to sleep.

RonDB connection setup protocol#

After a socket is setup between a data node and an API node or other data node, the following connection setup protocol is used to ensure that the other side is an RonDB node as well.

First, the client side (the API node or the data node with the highest node ID) sends a ndbd and ndbd passwd message and waits for a response from the server side. The server responds to these messages with an ok message. After this, the client side sends 2 1, assuming 2 is the node ID of the client side and 1 means the TCP/IP socket used for communication. The server side responds with the same message with its node ID and a 1.

Client: ndbd<CR>
Client: ndbd passwd<CR>
Server: ok<CR>
Client: 2 1<CR>
Server: 3 1<CR>

If any side finds a problem with this setup protocol it will close the connection. If both sides are ok with the setup protocol they will now enter into a full duplex mode where both can send signals at full speed. In most cases the receive side of the socket is handled by some receive thread and the send side of the socket is handled by some send thread.

RonDB signal header definition#

When the connection between two nodes are setup the only thing carried on the sockets are signals. These signals are sent using the following format.

The signal always starts with 3 words (a word is 32 bits) that form a header. If the configuration specified that the connection would use signal IDs, the header will be four words instead where the fourth word is an id of the signal sent, each signal sent to the other node increments the ID. This signal ID can be used to improve signal logging capabilities. It is not on by default in a cluster.

After the header, the normal signal data comes. This can be up to 25 words. The number of words is specified in the header part.

After the normal signal data, we have the lengths of the optional 3 segments that the signal can carry, each length is a word. The number of segments is specified in the header part. There can be zero segments.

Next, the segment data comes, which is always a number of words.

After the segment data, there is an optional checksum word. By default this is not on, but it can be activated through a configuration parameter on the communication section.

Header word 1:

Bit 0, 7, 24, 31 Endian bits
Bit 1,25 Fragmented messages indicator
Bit 2 Signal ID used flag
Bit 3 Not used
Bit 4 Checksum used flag
Bit 5-6 Signal priority
Bit 8-23 Total signal size
Bit 26-30 Main signal size (max 25)

Bit 0,7,24 and 31 in the first header word are all set if the communication link is using big-endian. Otherwise, they are 0. Currently, it is not possible to mix little-endian machines with big-endian machines in the cluster.

Fragmented messages are multiple messages that should be delivered as one message to the receiver of the message. In this case bit 1 and 25 represent a number between 0 and 3.

0 means there is no fragmentation, or in terms of fragments, this fragment is the first and the last fragment.

1 means it is the first fragment in a train of fragments.

2 means it is not the first fragment, but also isn’t the last fragment.

3 means it isn’t the first, but it is the last fragment.

Header word 2:

Bit 0-19 Signal number (e.g. API_HBREQ)
Bit 20-25 Trace number
Bit 26-27 Number of segments used
Bit 28-31

The signal number is the ID of the signal sent. This will be used to ensure that the correct function is called in the data node and in the API node.

Trace number is a feature that can be used to trace signals for a specific train of signals. Each signal sent carries the trace number of the signal executed currently. It can be used to follow a special event like a scan operation through the data nodes. It has been used in special implementations, but there is no support for this type of tracing in the current source trees.

The number of segments is used to discover long signals and also segments in fragmented signals. It gives the number of segment sizes after the short signal data and how many segments of data to read at the end of the signal.

Header word 3:

Bit 0-15 Sender block id
Bit 16-31 Receiver block id

The receiver block ID is used to direct the signal to the proper Ndb object in the API nodes and the proper thread and block in the data nodes. the sender block ID is occasionally used by blocks when executing a signal.