Advanced Thread Configurations#
Quite a few of the products currently using NDB are building real-time applications. Examples are various types of network servers such as DNS servers, DHCP servers, HLR (used in 3G and 4G networks), LDAP servers and financial use cases with high requirements on response times.
To design a real-time service using RonDB one need to take special care in setting up the threads and how they are operated in the data node configuration. In RonDB we have automated this to ensure that the optimal performance is always achieved in all situations. This is based on more than 15 years of experience in running NDB Cluster.
However it is still possible to use the advanced thread configuration using the ThreadConfig configuration variable. This can be useful for research into threading in a key-value store.
Setting up ndbtmd for real-time operations#
Setting up ndbmtd for real-time operations requires us to deal with a set of challenges. We need to control CPU locking, we need to control if any spinning should be done while waiting for input. We need to control the thread priority. In ndbmtd all of those things can be controlled by one complex configuration variable ThreadConfig. This variable defines the number of threads of various sorts that we use in ndbmtd.
The configuration of CPU spinning has been made adaptive. This is handled in the configuration variable SpinMethod. It defaults to LatencyOptimisedSpinning. StaticSpinning means that the spinning is is static and based on the configuration variable SchedulerSpinTimer. CostBasedSpinning is the next level where we spin if there is CPU performance to gain from it. The default level LatencyOptimisedSpinning means we are willing to sacrifice a bit of CPU usage to get the best latency. The highest level DatabaseMachineSpinning means that we focus on getting the optimal latency and are ready to sacrifice a lot of CPU usage to achieve this.
Adaptive spinning means that we will keep track if CPU spinning actually provides any benefits, if it doesn't we will stop spinning until we can see that there is benefit in spinning on the CPU before going to sleep.
Most OSs provide a real-time scheduling mode. This means that the thread cannot be interrupted, not even by the OS itself and not even by interrupts. RonDB contains protective measurements to ensure that we never execute for more than tens of milliseconds without going back to normal scheduler priority. Executing prolonged on real-time priority will completely freeze the OS. Setting this mode is done through the configuration parameter RealtimeScheduler. The value in this variable is the default and can be overridden by values in the ThreadConfig variable.
LockPagesInMainMemory can be set to 0 indicating no locking and it can be set to 1 to indicate that we ask the OS to lock the memory before the memory is allocated. Setting it to 2 means that we first allocate memory and then we ask the OS to lock it in memory thus avoiding any swapping.
The ndbmtd program have one more configuration variable that will impact its real-time operation. This is the SchedulerResponsiveness variable. By default it is set to 5. It can be set between 0 and 10. Setting it higher means that the scheduler will more often synch with other threads and send messages to other nodes. Thus we improve response time at the expense of throughput. Thus higher value means lower response times and lower values means higher throughput. Experience shows that this normally have very little impact on both performance and latency.
When a signal is executed it can generate new signals sent to both the same thread, other threads in the same node and to other threads in other nodes. SchedulerResponsiveness controls scheduler parameters that control how many signals we will execute before we perform the actual delivery to other threads, it controls the the maximum number of signals executed before we deliver the signals to other nodes.
When a decision to send to a node has been done, we will send to nodes in FIFO order (First In First Out). We have always a bit more focus on sending to other data nodes in the same node group (to ensure that we have low response times on write transactions).
Thread types to configure in ThreadConfig#
ThreadConfig is by far the most complex configuration variable in RonDB.
ThreadConfig makes it possible to configure the following thread types, ldm, query, tc, send, recv, main, rep, io, wd. We will describe each thread type and how to configure it.
ldm is the most important thread type. This is the thread where the local database is located. In order to read any data in NDB one has to access the ldm thread. LDM stands for Local Data Manager. ldm contains a hash index, a T-tree ordered index, the tuple storage, handling of checkpoints, backups and various local trigger functions.
Setting the number of ldm threads is the first thread type one starts with when designing the configuration.
Query threads can read data stored in ldm threads. However it will only be able to read data where the ldm thread is using the same L3 cache as the query if the RonDB software can discover the setup of the L3 caches in the server. The number of query threads must be a multiple of the number of ldm threads, preferrably the same as the number of ldm threads and locked to using the same CPU core or at least neighbouring CPU cores.
The next thread type to consider is the tc threads. How many tc threads one needs is very much dependent on the query types one uses. For most applications it is sufficient to have around 1 tc thread per 4 ldm thread. However if the application uses a high number of key operations the number of tc threads will be higher.
As a thumb of rule that will work for most applications one should have about half as many tc threads as ldm threads. tc threads will assist the send threads a lot, having a few extra tc threads will be useful and mean that we can have fewer number of send threads.
TC is used for all key operations, these make heavy use of tc threads. Scan operations makes less use of tc threads, even less for scans that perform much filter processing. tc threads are also used for pushdown joins.
The rule of thumb is to use half as many tc threads as ldm threads. But if your application is consistent in its work profile, it makes sense to perform some kind of benchmark that simulates your application profile and see how much it makes use of the tc threads.
The ndbinfo table cpustat can be used to get CPU statistics about the different threads in your data nodes. The tool ndb_top provides a top-like interface to this information.
recv and send threads#
The next important thread type to consider is the recv threads and send threads. The recv threads takes care of a subset of the sockets used to communicate to the data node. It is usually a good idea to avoid having the recv thread at overload levels since this will delay waking up the worker threads (mainly ldm and tc threads). Actually this is true for both recv and tc threads since if these are not fully loaded they will make sure that the ldm always stays fully active.
send can work well at a very high load. Usually if the load on the send threads goes way up the traffic patterns will change such that larger packets will be sent which will automatically offload the send.
In addition in RonDB all worker threads (ldm, query, tc, main and rep) will assist send threads when the thread themselves aren't overloaded. The data node keeps track of the CPU load status of all the threads in the data node including the send threads and recv threads.
If a thread is loaded below 30% it will assist the send threads until there is no more data to send before it goes to sleep again. When the load goes up and is between 30% and 75% we will help the send threads at moments when we are ready to go to sleep since the thread has no work to do. At higher loads we will offload all send activity to other threads and not assist any other thread in executing sends.
A rule of thumb is to use 25% of the number of ldm threads both for receive and send threads.
If the application uses the asynchronous API parts the recv can be quite loaded due to the high number of signals that needs to be put in internal data structures.
main and rep threads#
main and rep threads are simple, there will always be one of each in an ndbmtd process. In smaller configurations one could skip one or both of those. It is even possible to run with a single thread which in that case has to be a recv thread.
The main thread is used mainly for schema transactions and various backgrounds activities like heartbeating. It is used quite heavily in the startup phases from time to time. In MySQL Cluster 7.4 it was heavily used in scan processing and could even in exceptional cases be a bottleneck where scan processing was very significant. The scan processing in RonDB still uses data in the main thread, but it accesses it directly from the tc thread using RCU principles and mutex protected access. In large clusters the main thread could get a lot of work to handle at certain states of local checkpoint processing, especially when the number of small tables is high.
The rep thread is taking care of Global Replication in order to get information about all updates in the RonDB to the MySQL servers in the cluster configuration responsible to handle MySQL replication from one cluster to another cluster or to any other MySQL storage engine. It handles some other activity related to multithreading and file activity. If there are much updating and one uses Global Replication than this thread can be heavily used.
Those threads are also heavily involved when we reorganize a table to a new table partitioning e.g. after adding a new set of nodes.
io threads can be configured although the count is here ignored. The actual number of IO threads is runtime dependent and can be controlled by the configuration variables DiskIoThreadPool, InitialNoOpenFiles and MaxNoOfOpenFiles. However as mentioned in the documentation it is more or less considered a bug if one has to change the *NoOpenFiles variables. The interesting thing to control for io threads is their priority and the number of CPUs they are bound to.
Normally the io threads use very little CPU time, most of it is simply starting up OS kernel activity and this can as well happen on any other kernel thread defined by the OS configuration. There is a clear case for controlling io threads when one has set either CompressedLCP and/or CompressedBackup. Given that both local checkpoints and backups send a lot of data to disk, it makes a lot of sense to compress it if CPU capacity exists and disk storage is a scarce resource. It is likely that using this feature will compress the LCP and backup files to at least half its size or even less. The compression uses the zlib library to compress the data.
As mentioned it comes at the expense of using a fair amount of CPU capacity both to write the data to disk as well as to restore it from disk in a restart situation.
If one decides to use this feature it becomes important to control the placement of io threads on the CPUs of your servers.
One can control the wd threads. This is the watchdog thread and a few other background threads used to for connection setup and other background activities. Normally it is not essential to handle these threads in any special manner. However if you run in a system with high CPU usage it can sometimes be useful to control CPU placement and thread priority of these threads. Obviously it isn't good if the watchdog is left completely without CPU capacity which can potentially happen in a seriously overloaded server.
Thread settings in ThreadConfig#
For each thread there are 3 things that one can control.
Which CPUs to lock threads to
Which priority should threads execute at
Should the thread assist the send thread
Controlling CPU placement can be done using 4 different configuration variants. cpubind, cpubind_exclusive, cpuset and cpuset_exclusive. The exclusive variants are performing the same action as without exclusive except that they ensure that no other thread from any other program can use these CPUs. We get exclusive access to those CPUs, even the OS is not getting access to those CPUs. Although a nice feature it can unfortunately only be used on Solaris which is the only OS that has an API to control CPU placement at this level. We can use cpubind and cpuset on Linux, and FreeBSD. None of the settings can be used on Mac OS X since Mac OS X has no API to control CPU placement.
When a thread or a set of threads is bound using cpubind it means that each thread is bound to exactly one CPU. If several threads are in the group they are assigned to CPUs in order. In this case it is necessary that the number of CPUs is at least the same as the amount of threads in the group (io are a bit special since they perform round-robin on the CPUs when they are started if several CPUs exist in the thread group).
Two thread groups can use the same set of CPUs in cpubind. This can be useful to have several light-weight threads not having a dedicated CPU resource. E.g. the main, rep, io and wd threads can share one CPU in many configurations.
The number of CPUs should be equal to the number of threads and cannot be smaller than the number of threads.
The cpuset means that each thread is allowed to be on any of the CPUs in the CPU set. Here the number of CPUs in the CPU set doesn't have to be the same as the number of threads although it is usually a good idea to not oversubscribe your HW since it will have a bad effect on response time and throughput of MySQL Cluster.
One can define several thread groups to use the same CPU set, but it has to be exactly the same CPU set, one cannot define overlapping CPU sets in the configuration.
Thread Priority settings (thread_prio)#
We have the ability to control the thread priority for the various thread types. This is done by using the thread_prio parameter. We can set it to any value between 0 and 10. Higher value means higher priority and the normal priority is 5.
On Linux and FreeBSD this parameter sets the nice value of the thread going from -20 for 10 down to 19 for 0. This uses the setpriority system call. Setting it to 10 gives a nice level of -20, 9 gives -16, 8 gives -12, 7 gives -8, 6 gives -4, 5 gives 0, 4 gives +4, 3 gives +8, 2 gives +12, 1 gives +16 and 0 gives +19.
Mac OS X provides no APIs to control thread priority, the thread_prio parameter is ignored on Mac OS X.
Most OSs support some method of setting real-time priority of threads. On Linux we can use the call to sched_setscheduler to set real-time priority. On Mac OS X and FreeBSD, we can use the pthread_setschedparam call.
These calls are very dangerous, they can even put the entire OS into a freeze state. We are providing a manner to avoid those freezes by downgrading from real-time priority to a normal priority level if we execute for more than 50 milliseconds without going to sleep. This means that using real-time priority is most effective when we don't have massive loads to handle but rather want very quick response times and very predictable response time in those situations.
There are two ways to configure real-time settings. The first is to set the configuration variable RealtimeScheduler. It sets the default value for all threads. This default value can be overridden by the settings in the ThreadConfig parameter by setting realtime to 0 to not use real-time priority and setting it to 1 to enable real-time priority.
We can set nosend=1 on a thread group indicating that these threads should not assist the send thread in any way.
Special configurations in RonDB 22.01#
In RonDB 22.01 we introduced the option to move more functionality into the recv threads. This means that we can setup a configuration where we only use recv threads. These threads will then execute in all roles, even the sending if required. We can also have a setup where we have ldm, query and recv thread, but no tc thread, this means that the transaction handling will be handled by the recv thread. This is currently a feature used to figure out if this could be useful also for an automated thread config.
See the blog here and here to learn more about those experiments.
The blog on automatic thread configuration provides a lot of details on the background of automatic thread configuration and more details on the background for thread configuration.