API Node Architecture#

In a previous chapter we went through the NDB API. This is the C++ API used to access RonDB. In this chapter we will look at the implementation aspects of this API. In particular what we have done to map the block and signal architecture to a synchronous API. We have an asynchronous API part as well that fits very well with the block and signal model.

The C++ NDB API is a fairly normal object-oriented library. It is using blocks and signals as a way to route messages to the appropriate thread and software unit in the data nodes.

The original NDB API used a structure where we had one mutex that protected the entire send and receive part of the NDB API. When we fixed this in MySQL NDB Cluster 7.3 we had a number of choices.

We needed to separate the send and receive logic and ensure that they could execute without interfering with each other. We had to make a number of choices on where to execute the actual signals. The signals have as destination the Ndb objects or some other object linked to this object. It is possible to let user threads execute the signals.

We decided to let the signals be executed by the receive logic. The reason is that we could improve our latency in this case. We could see that the alternative approach would increase latency of the NDB API. At the same time the scalability of the approach where user threads execute the signals is better, so it is a trade off.

Thus there are two main scalability limitations of the current NDB API limitation. The first is that all signals arriving have to be received and executed by a single thread that handles both receiving on the socket as well as executing the signal itself.

The second limitation is that one API node is using one TCP/IP socket. One socket have a fair amount of states where only one or two CPUs at a time can work on the socket. Thus one socket have limitations on number of packets received per second and the bandwidth that such a socket can maintain.

The solution to both these problems is to use multiple API nodes for one program. E.g. the MySQL Server can define any number of cluster connections that will work independent of each other.

In the managed version of RonDB each MySQL is assigned 4 API nodes that will ensure that it scales to at least 32 VCPUs.

Cluster connection#

The cluster connection is maintaining one API node and as mentioned already, there can be multiple API nodes in one program.

One cluster connection contains all the structures required to communicate with each data node in the cluster and each management server.

Thus the cluster connections have one receive thread handling signals arriving to the API node. It has one send thread that can take over send handling when the socket is overloaded at the time when the user thread tries to send. Finally it has a thread to handle cluster management activities, such as heartbeats.

User threads#

User threads are not under our control in most cases. The MySQL Server is an exception where one thread is created per connection to the MySQL Server (except when the thread pool is used). These threads execute all the normal NDB API calls and we wake those threads up when we have completed executing all signals that we were requested to execute to handle the NDB API calls. User threads handles most of the send and receive handling in cases of low loads. The higher the load becomes, the more the receive thread will jump in to assist in executing the signals received.

NDB API send threads#

Send threads in the NDB API only send when the socket cannot keep up with the amount of signals we attempt to execute. Normally the send threads are sleeping. But in high load cases they can be quite busy sending signals that it was assigned to handle.

NDB API receive threads#

The receive threads is the heart of the NDB API implementation. Receive handling is a property that is controlled by a mutex. Any user thread can take this responsibility if no other thread already has grabbed this responsibility. This improves latency in single-threaded use cases.

When many user threads are active at the same time, the receive thread is becoming active. The threshold to this is set by the MySQL Server variable --ndb_recv_thread_activation_threshold. By default this is set to 8. One problem that we can get with the receive thread is that it is a thread that uses much more CPU compared to the other MySQL Server threads.

This means that the normal Linux scheduler will give it a lower priority compared to the rest of the threads. This is not beneficial to the other threads using this cluster connection since it will delay them getting woken up to serve the replies from the data nodes.

To ensure that the receive thread gets a higher priority we set the nice level of the receive thread to -20 if possible. As mentioned in the chapter on Installing RonDB in the section on adding a new mysql user, it is necessary to set the highest nice level that can be set by the user. To set this higher nice level the user mysql must have CAP_SYS_NICE capabilities as shown in the above chapter how to set.

Using a receive thread that is locked to a CPU and that gets activated as soon as more than one user thread is active is the most optimal solution for latency using the NDB API. But it requires that the mysql user can set the nice level higher and that it can lock CPUs in a safe way without interfering with user threads or other processes.

The default manner where the user threads takes care of everything has slightly worse latency, but it still scales very nicely.

NDB API cluster manager threads#

There is a thread taking care of heartbeats, registering as a new node with the data node. This thread will wake up every 100 millisecond and send a heartbeat signal if needed.

NDB API wakeup thread#

In some situations the NDB API is bottlenecked by the need to wake up threads receiving data. It takes a few microseconds to wake a thread from sleep. To ensure that the current receiving thread isn’t consumed by wakeup activity it can offload the wakeup processing to a specialised wakeup thread, thus waking a single thread to wake possibly hundreds of user threads.

Blocks in API nodes#

A block in the API node is simply an Ndb object. When referring to a block, it is referring to a pointer in an array that in turn points to an Ndb object. Each Ndb object has to be handled by one thread at a time, thus it is easy to handle signals to blocks in the NDB API.