Virtual Machine in Data Nodes#

In this chapter we will discuss a few more things regarding the virtual machines and its threads in ndbmtd.

Each execution module is placed into one thread. It could be a thread that only handles one execution module, but we could also have threads that handles many execution modules. One execution module cannot be split up in different threads, but multiple execution modules can be joined together in a thread.

In the presentation below we will describe things as if each execution module had its own thread.

Thread types in ndbmtd#

ldm threads#

The ldm thread cannot communicate with any other ldm thread in the same node, it can communicate with all tc threads and with the main thread. It can communicate with itself as can every other thread. The first ldm thread was a bit special in earlier version when backup processing wasn’t parallelised. This is no longer the case.

The ldm threads have an important characteristic regarding its CPU usage. It is extremely efficient in using the CPUs. The instructions per cycle (IPC) can be as high as 1.5 when executing a standard Sysbench OLTP benchmark and even above 2 when executing only range scans that performs filtering of rows. This is one of the reasons why it doesn’t pay off so much to use two ldm threads per CPU core. There isn’t enough execution units in a CPU core to keep two threads moving in parallel. The thread has a branch prediction miss rate of less than 2% and also L1 cache misses are only about 2% of the accesses. Misses in the last level cache is however about 15%. This is where the actual accesses to the data parts come into play.

The ldm thread can benefit much if they stay at the same CPU all the time and their CPU caches are not influenced by other threads executed in the machine.

This is why CPU locking is an important method to improve performance of RonDB data nodes.

query threads#

Query threads execute key lookup and scan queries using the READ COMMITTED mode. This means that several threads can access the same data concurrently. The aim is that query threads execute on the same CPU core as a ldm thread.

With the introduction of query threads it was possible to decrease the number of table partitions. This means that in a large data node, not all ldm threads are used to a full extent. This provides the possibility for the query in this CPU core to be used to a greater extent, thus providing the capability to distribute load evenly.

tc threads#

tc threads communicate with all other threads. tc threads have very different characteristics compared to ldm threads. The signals executed are very short and there is a lot of those small messages. So the tc threads are up and running for a short time and goes to sleep for a very short time.

tc threads benefit a lot from hyperthreading, we get 40% extra performance for each CPU core by using hyperthreading. The IPC for tc is also a lot lower.

tc threads play an important role in handling sends to other nodes, both when using send threads and without specific send threads. Thus you can see a lot of CPU usage by the OS kernel from this thread. The send assistance by tc threads is important to decrease the response time for requests towards RonDB.

Having too many tc threads can easily lower total performance of the node. Most likely that we get to send too much small packets to the API node, this will have a negative impact on API node performance.

main threads#

main threads are normally not very busy at all. They are mainly used for file system signals during normal operation. They can be fairly busy during ALTER TABLE commands that reorganize data, build indexes and build foreign keys. Most of the CPU resources for this thread type can be colocated with the rep thread, but in larger configurations it can be useful to separate those threads on their own CPUs.

rep threads#

The rep thread is used to send event data to the NDB APIs that have subscribed to change events. It is also used for ALTER TABLE operations that build indexes, foreign keys and reorganise tables.

Other functions should have very small impact on CPU usage.

The main thing to consider for the main and rep thread is that they normally will use very little CPU resources, but at times they can spike when running a heavy ALTER TABLE operation.

Both main and rep threads can communicate with all other thread types.

recv threads#

recv threads benefit greatly from hyperthreading. One important thing to consider for recv threads is that they benefit from not using a full CPU. Locking recv threads to a CPU, one should strive to not use the CPU to more than 60%. The reason is that it is very important for the recv thread to quickly wake up and take care of any actions and similarly to ensure that other threads can quickly wake up to to handle the received signals.

If the recv thread uses more CPU than 60%, it is a good idea to add more recv threads. The ndb_top tool is a good tool to check how much CPU a thread uses if the threads are not locked to their own CPUs, in this case the top tool can be used as well.

send threads#

There are many other thread types that can assist send threads. Using a single send thread is mostly ok and having more than one might even have negative performance impact.

Still there are cases when more send threads are needed. Often performance can increase a bit by using all the way up to 4 send threads. HopsFS is a good example of an application that requires much CPU resources for sending.

send threads benefit from hyperthreading.

io threads#

When using RonDB as an in-memory engine without compressing backups and local checkpoints, the CPU usage in io threads is very small. Adding compression to backups and/or local checkpoints using the CompressedBackup and CompressedLCP configuration parameter quickly increases the CPU usage in the io threads significantly. Similarly using disk data means that io threads gets a lot more work.

The io threads can use many CPUs concurrently.

Thread setup#

Automatic Thread Configuration#

For almost all users we recommend setting the configuration parameter AutomaticThreadConfig to 1. This means that RonDB will automatically calculate the number of threads of various types.

ndbmtd will use OS calls to figure out how many CPUs that are available to the RonDB data node. By default this is all CPUs in the server/VM. However if ndbmtd was started using e.g. numactl one can limit the number of CPUs that are accessible to ndbmtd. ndbmtd will respect this limitation and not use all the CPUs in the server/VM.

Another manner is to set the configuration parameter NumCPUs, this will limit the number of CPUs to the number set in this parameter. However no CPU locking will occur if this is set since it is assumed that ndbmtd is executing in a shared environment with other programs on the same server/VM. This is heavily used for RonDB testing.

Thread setups using ThreadConfig#

Actually the ldm, query, tc, main, rep, send are all optional threads. At least one recv is always around. io are always there as well.

There are a few common possibilities to setup a RonDB data node.

Data node up to 4 CPUs#

In a small data node there is no room for send threads, main, query and rep threads.

In the very smallest data nodes it is also a good idea to combine the tc into the recv thread. Support for this was added in RonDB 21.10.0.

In earlier versions of RonDB it was possible to use a recv thread and a main thread that combined ldm, tc, main functionality. However most of the processing happens in the ldm, thus it is a good to have a separate ldm thread.

Already at 4 CPUs it becomes a good idea to add a tc thread, thus using 1 recv, 1 tc and 2 ldm threads.

Alternative setups#

RonDB data nodes also support running all threads in the recv threads. However comparing 2 recv threads with a setup using 1 recv thread and 1 ldm shows that the separated thread setup is more efficient in most cases. The only case when the use of 2 recv threads wins is when the recv and tc functionality uses very small amounts of CPU. In this case the 2 recv thread setup simply wins because it has a balanced use of CPUs although the use is still more inefficient.

Communication between threads in ndbmtd#

Communication between threads uses highly optimised code paths in RonDB. We use lock-free communication where each communication buffer is only used for communication between two threads. This means that we can use a single-reader and single-writer optimisations that avoid using any locks when communicating, only memory reads and writes combined with memory barrier operations is sufficient. This part of the code is the main portability hurdle for RonDB. This code is currently working on x86 servers and SPARC servers. Some attempts have been made to make it work on POWER servers, but it has not been completed. It currently works also on ARM servers although it isn’t tested at the same level as x86 is tested.

To wake up other threads we use futexes on Linux that are combined with special x86 instructions. On other OSs and other HW we use normal mutexes and condition variables to wake up other threads.

In a configuration with many threads in the data node, there will be significant amount of memory allocated to the communication between threads.

In RonDB we change to using a mutex on each job buffer when going beyond 64 threads. With more threads the cost of scanning all memory buffers becomes higher than the cost of using a mutex.

Scheduler in ndbmtd#

Each block thread (ldm, tc, main, rep and recv threads) have a scheduler that executes signals. It has settings that can be changed through the configuration parameters SchedulerResponsiveness. These settings define how often we will flush signals to other threads and nodes. Flushing often have a negative impact on throughput, but can have a positive impact on latency. Flushing seldomly means that latency increases and throughput can increase. The scheduler supports delayed signals through a special time queue.

What one always find when running various benchmarks is that RonDB as a distributed system is a highly coupled system. So to get best performance and latency one has to find the right balance of resources.

The MaxSendDelay is a parameter that works well in very large clusters where we need to constrain the nodes from sending to often. Sending too often will use too much resources in other nodes. To get good performance it is important that a node isn’t running so fast that it sends smaller packets that gives other nodes more work to do. This will have a negative impact on system performance.

In RonDB the MaxSendDelay configuration parameter have been removed, the functionality is now automated, thus the above delays on sending will happen automatically as the cluster load goes up.

Single-threaded Data Nodes, ndbd#

Single-threaded Data nodes is no longer supported by RonDB and isn’t even present in binary releases of RonDB. If compiled we have inserted asserts to ensure that it will not works. All the functionality of ndbd can be achieved using ndbmtd using a single-threaded configuration with a single recv thread.

The ndbd was the original data node program that was developed for NDB. In those days a large server had two CPUs and about 1 GByte of memory. Thus machines of today have a lot more CPUs. This means that a multithreaded approach was required.