In order to get the optimal real-time experience it is necessary to setup the operating system infrastructure to ensure that MySQL Cluster is continuing to deliver in all circumstances.
The challenge comes primarily from two sources, the interrupt processing of the Linux network stack, the handling of CPU resources used to run the operating system which includes things such as error logging, handling IO interrupts, and various background maintenance activities. In addition we can be challenged by other applications that we install and run ourselves on the servers used to run MySQL Cluster nodes.
One example from early NDB history was a user that every night ran a backup of his machine, this backup process consumed most of the CPU and memory resources of the machine and caused the NDB data node process to stop every night at around 2am. It is important to confine such processes to not consume all the needed CPU and memory resources of the machine and thus endangering the health of the MySQL Cluster installation.
Running a MySQL Cluster installation means that there will be a lot of network traffic. Thus we need to understand how the OS handles interrupts from network devices and how it executes the network stack to ensure that it doesn't conflict with the NDB processing.
In addition the OS can contain many other processes that want access to CPU resources. These could be application processes, it can be operating system activities such as logging, it can be various management processes.
Setting up this infrastructure is very dependent on the application and the environment MySQL Cluster is running in.
We will describe ways to control placement of various important CPU activities in different OSs.
We will start by describing the Linux infrastructure, Linux is a very capable OS, but configuring Linux for best real-time experience is not so straightforward. We will describe a few different methods used to control the use of CPU resources in Linux, we will start with the network interrupt handling, next we will look at an option to isolate CPUs from being used for anything else than the processes that bind themself to these CPUs. It isn't very straightforward to control CPU resource usage in Linux, this is one of the reasons why the use of frameworks like Docker have been so successful, they essentially make use of the cgroup infrastructure in an easy manner for applications to use.
The most important thing to understand is how Linux handles network interrupts. Execution starts with a hardware interrupt. There are two common defaults for interrupt handling, one is that all interrupts are running on one CPU, the other is that HW interrupts are spread on all CPUs. In both cases we want to ensure that interrupt processing isn't happening on any CPUs where we have significant CPU processing.
The ldm thread is normally the most important thread type to protect, it is usually a good idea to ensure that ldm is the bottleneck for the NDB data nodes. Sometimes the tc is a bottleneck, the main should never be a bottleneck, rep threads are seldom a bottleneck, but in cases where we have lot of NDB Event API handling it could happen (this is used for MySQL Cluster Replication among other things, but could also be used for other application defined services). io threads are normally not a bottleneck, but at the same time it is important to ensure that these threads are protected from massive overloads since if these threads are blocked from executing the checkpointing of MySQL Cluster is not going to work properly.
Linux receive interrupt handling#
In Linux the first step is to control the HW interrupt for network receive. Some devices only support one interrupt model where they wake up a specific CPU. The most common is that a modern network device can support many interrupt models. Most modern network device drivers employ a technique called Receive Side Scaling (RSS). This means that the network card will calculate a hash function based on local address, remote address, local port number and remote port number. Some devices support a configurable hash function and some devices support hash functions with less parts as inout to the hash function. The network device will for each packet place the packet into a receive queue, there will be several such queues and each will be placed onto at least one CPU. This CPU is the one receiving the HW interrupt when a packet arrives.
Modern Linux systems uses the NAPI (New API) mechanism. Thus when a HW interrupt arrives the NAPI scheduler will issue a soft IRQ interrupt, when it has issued this soft IRQ it will disable the HW interrupts such that no more HW interrupts arrive until the interrupts are enabled again. When the soft IRQ have executed to completion for the receive queue, it will check the receive queue for more messages that have arrived. If no more messages have arrived it will enable interrupts again. This simple algorithm means that in a system with low load the interrupts will ensure low latency network operations. At high load when interrupt arrives faster than they are processed, then they are quickly put in the queue and we are using a poll-based mechanism instead which provides much higher throughput for a very busy networking infrastructure.
Receive Side Scaling (RSS)#
RSS is configured by default, the number of receive queues has a default setting which is often equal to the number of CPUs in the system. It is unlikely that we want to use all CPUs for interrupt processing and it hasn't been shown to be of any value to have more receive queues than the number of CPU cores in the machine. In addition we want to protect some of our cores from interrupt processing, thus we would benefit from even less number of receive queues. Setting the number of receive queues is done as part of configuring the network device driver. Check the device driver documentation and set the number of receive queues appropriately.
Even if the default settings are used for number of receive queues is used we can still ensure that the receive queues are handled by CPUs that don't disturb the NDB threads that we want to protect from network processing.
The hardware interrupts for each of those receive queues is handled by one interrupt number for each receive queue. Which CPU(s) this interrupt number will wake upon is controlled by the /proc/irq/70/smp_affinity file where 70 should be replaced by the interrupt number. The interrupts number for the receive queues can be found by looking into the file proc/interrupts. If we have a machine using eth0 as the ethernet device, we should search all lines in this file for the eth0 key. This line will start with the interrupt number followed by a :
Based on this information we will use the following command to set the CPUs that are allowed to execute the HW interrupt for a certain receive queue by writing the following command assuming that the interrupt number is 70 for the receive queue and assuming that we want to use CPU 4-11 for handling networking interrupts. This CPU mask defines both where the HW interrupt is executed as well as where the soft IRQ are handled.
echo ff0 > /proc/irq/70/smp_affinity
Receive Packet Steering (RPS)#
Now in older Linux versions (before version 2.6.35 this would be sufficient to control network receive processing. In newer versions it is still sufficient if Receive Packet Steering (RPS) is either not compiled into the Linux kernel (using CONFIG_RPS) or if the RPS isn't configured to be active in the Linux kernel. Default is that RPS is compiled in, whether it is used by default in the OS depends on the Linux distribution used.
If the network device supports RSS it isn't necessary to use RPS unless we also want to use Receive Flow Steering (RFS) which we will describe below.
When RPS is enabled it means that the soft IRQ has a very short execution, it will simply issue a call function interrupt to a CPU selected to the bottom half processing of the packets. This function call interrupt will perform the same functionality as the soft IRQ would have done otherwise. As is fairly obvious there is no benefit to RPS unless either of two things is present. The first is that the network device can only distribute HW interrupts to one CPU (this CPU might even be hard-wired to be CPU0). In this case it is a good idea that the HW interrupt is doing as little as possible and quickly starts up another kernel thread to do the bottom half processing. One popular example of such an environment is when you are running in a normal Amazon EC2 environment, in this case it is important to use RPS to distribute interrupt load on more than one CPU if you are using MySQL Cluster with high load.
The second case when RPS is beneficial is when RFS is used and the kernel can use some knowledge about where the recv system calls are processed. In this case it can issue the function call interrupt to execute on that particular CPU. If the recv is often called from the same CPU, this is very beneficial to CPU caching and thus improves throughput and latency of the system.
For NDB data nodes all recv processing is done by the recv thread for ndbmtd and by the execution thread for ndbd. Thus if the recv threads are bound to a CPU or several CPUs the recv execution will be very efficient for the NDB data nodes. The MySQL Server have a number of recv calls done by the NDB API. By setting ndb_recv_thread_cpu to the desired CPU to use AND setting ndb_recv_thread_threshold to 0 all recv execution for the NDB API will be handled by the CPU(s) set in this variable. If the MySQL Server uses several connections to the NDB data nodes then several CPUs are normally used. The MySQL Server connection threads are handled by each SQL thread in the MySQL Server.
For RPS to work well in combination with RFS it is important to control the CPU placement of the execution of the recv calls. For the absolute best real-time experience and absolute best performance of MySQL Cluster these are options that are worth conisdering and experimenting with. For the majority of users of MySQL Cluster this is likely not a good idea to consider.
Disabling and enabling of RPS can be done per receive queue. Disabling is done by the following command (with eth0 replaced with the device name and rx-0 replaced by the name of the receive queue).
echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
To enable it one uses the same command and replaces the 0 by a comma separated list of hexadecimal numbers (each hexadecimal number contain up to 64 bits, only in systems with more than 64 CPUs is it necessary to use a list rather than one hexadecimal number). E.g. to configure the rx-0 queue for eth0 to use CPU 1,2,3 and 5 we set it to the hexadecimal number 2e.
When not using RPS the network driver will first interrupt the CPU with a HW interrupt, then after processing the HW interrupt it will wake up the ksoftirqd/5 (where 5 is the number of the CPU where the HW interrupt was happening, this process always executes bound to CPU5. The ksoftirqd process will do the rest of the processing and will then wake up any threads waiting for data on the socket.
When using RPS the same execution will happen, but the major part of the Soft IRQ processing is handled through a function call interrupt. This interrupt wakes up one of the CPUs that was part of the bitmask for the receive queue as set above.
To control interrupt processing when using RPS we need to configure both the CPUs used to take care of the HW interrupts as well as the CPUs used to handle the function call interrupts for the larger part of handling the network receive stack.
It isn't necesary to use the same CPUs for HW interrupts as for function call interrupt processing.
Receive Flow Steering (RFS)#
As mentioned above it is possible to use RFS in conjunction with RPS. RFS means that the function call interrupt part of RPS has data structures setup to ensure that we can send the function call interrupt to a CPU that is close to the CPU or even the CPU that last called recv on the socket for which the packet arrived.
In order to set up those data structures Linux needs to know how many active sockets to work with, Linux will only keep track of active sockets and when they last used the recv system call. The intention of RFS is to ensure that the recv system call can start up immediately after the interrupt processing is done by the kernel thread. Since they execute on the same or a CPU close (from a CPU cache point of view), this means that the packet processing will be faster since we will have very good CPU cache behaviour.
For RFS to work well it is important that the recv calls on a socket are performed from CPUs that are not that far away from each other, the best case is if they are always executed on the same CPU or same CPU core. For NDB data nodes this means controlling the CPU placement of the recv thread such that this thread is either on the same CPU all the time or on a CPU set that is close to each other. For RFS to work it is necessary that the CPUs used by the RPS processing contains the CPUs used by the recv thread, or at least very close to those CPUs.
RFS is not configured by default and it requires setting CONFIG_RFS when compiling Linux (on by default). To configure RFS to be active one needs to set the number of active sockets to track. This is done by setting this value in the following manner:
echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
The number 32768 have been found to be appropriate in medium-sized configurations. In a node that primarily runs a NDB data node and/or a MySQL Server this value should be sufficient.
In addition it is necessary to set the number of sockets tracked per receive queue. This is normally simply the total number divided by the number of receive queues. The below example shows an example of setting this for receive queue rx-0 for the eth0 device. Replace 0 by the receive queue id and replace eth0 by the device name in the general case.
echo 4096 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
Both values are rounded up to a value of the type 2n. Both values need to be set for RFS to be active on a receive queue.
If the network device driver supports Accelerated RFS (ARFS), this will be automatically enabled provided two things. First the Linux kernel must have been compiled with CONFIG_RFS_ACCEL, second ntuple filtering must be activated using the ethtool. ARFS means that the network card will automatically send the interrupt to the correct CPU. This is implemented by a call to the device driver every time the call to recv is done from a new CPU for a socket. It is desirable that this CPU doesn't change so often.
Transmit Packet Steering (XPS)#
In the same fashion that many network cards have multiple receive queues they will often have multiple transmit queues. In order to ensure that locking on those transmit queues isn't hurting scalability it is possible to specify which CPUs are able to use a certain transmit queue.
One complication in using the XPS feature is that when a stream of packets for one socket have selected a transmit queue, it isn't possible to change the transmit queue for this socket until an acknowledge have arrived for the outstanding packets. This introduces unnecessary delays when sending is done from any CPU in the system.
With MySQL Cluster we can send packets on any socket from many different places. For example an NDB data node needs to be able to use the same set of transmit queues from all CPUs that is used by the NDB data node. However if we have several data nodes on the same machine, or if we have a MySQL Server on the same machine that uses different CPUs, then it is possible to use different sets of transmit queues for the different processes.
To configure XPS the Linux kernel must be compiled with CONFIG_XPS (default) and for each transmit queue one must set the CPUs that can use this transmit queue.
echo fff > /sys/class/net/eth0/queues/tx-0/xps_cpus
The above setting means that transmit queue tx-0 can only be used from CPU 0-11 for the device eth0. Replace 0 by the transmit queue id and eth0 by the device name for the general case. Here a hexadecimal number representing a bitmap with a comma separated list of bitmaps when more than 64 cpus exist in the machine.
Linux CPU isolation#
There is a boot option in Linux called isolcpus. This option makes it possible to remove a set of CPUs from normal scheduling, thus these CPUs are not considered when scheduling a thread to execute. There is one way to assign a thread to execute on those CPUs, this is by explicitly binding those threads to those CPUs. When using the ThreadConfig variable and setting cpubind or cpuset on a set of threads this happens, also when using the configuration variables LockExecuteThreadToCpu and LockMaintThreadsToCpu.
The isolcpus option could be used to isolate the NDB processes from other processes in the system. In this case one should ensure that all thread types have been locked to CPUs in the machine that are part of the isolcpus and when using the MySQL Server one should either bind the MySQL Server using taskset or numactl. Another option is to use the isolcpus to isolate the CPUs used for network interrupt handling to ensure that the most predictable latency can be ensured for the user of MySQL Cluster.
To setup the operating system properly for use by MySQL Cluster means that you should ask yourself a couple of questions.
The first obvious question is if you are willing to spend time to optimise the installation of MySQL Cluster at this level. If you are trying out MySQL Cluster for a simple application it is probably not worthwhile. If you are developing an application or an infrastructure component in your company used by thousands of people it is still fairly questionable if you would do this. A normal Linux installation will get you far and spending a few hours extra or days isn't likely to get you such benefits that makes this worthwhile.
If you are a student and you want to learn how MySQL Cluster works, how Linux works and build competence for this and other projects, for sure go ahead and try out all possible variants such that you have a good understanding of the Linux network stack and how MySQL Cluster operates in its context.
If you are developing an application that will be used by millions of users or even billions of users and you are living in a competetive environment where your solution is compared to other similar solutions, it is likely going to pay off to read this chapter and apply some of those in your production setup. If you are delivering 1000s of servers and this setup can improve performance by 10-20%, it is worthwhile to spend time optimising the installation.
A time when you need to read this chapter and apply it to your installation is when you hit specific bottlenecks that are required to overcome before your installation is considered complete.
One good example where you might hit specific bottlenecks is if you install MySQL Cluster in a cloud environment and you get hit by the fact that the single CPU used to handle networking interrupt is limiting your throughput.
This is the first technical question you should ask yourself if you decided to go through and the apply the techniques provided in this chapter.
The question is thus if your network card(s) only have one receive queue to handle networking interrupts. In this case you are limited to only one CPU that can handle the receive part of the networking. In this case you should go ahead and use RPS.
If your network devices already supports multiple receive queues more questions are needed before you decide if you should use RPS or not. The next question is if you have decided to setup a MySQL Cluster installation where you have decided to control the CPUs and ensure that you are in full control over where processes are executing. If this is the case you should go ahead and use RPS, but not only that you should take one more step further and also use RFS. Using RPS without RFS where RSS exists is simply extra overhead without any benefits. It will increase your response time to queries in MySQL Cluster without creating any benefits.
Next step is to decide whether to use XPS, this really only applies on large servers. It can provide benefits when you are colocating the MySQL Server and the NDB data node on the same machine or on the same virtual machine. In this case it is a good idea to consider to ensure that data nodes are placed on one set of CPUs and use its own set of transmit queues and the MySQL Server uses a different set of CPUs and transmit queues.
These are things that one should put into scripts and automate as part of process of setting up MySQL Cluster machines. The above description works both for bare metal machines as well as for virtual machines as well as for machines in a cloud environment.
Interrupt processing of the network stack requires a fair amount of CPUs, this is an important part of the MySQL Cluster infrastructure to consider. One should definitely expect to use around 20% of the available CPU resources for network interrupt processing in a distributed MySQL Cluster environment, counting both send and receive the cost of communication can often be half of the available CPU resources.
Example CPU budget#
Let's consider setting up one MySQL Cluster Data Node using an OC5M instance on the Oracle Cloud. In this case we have a virtual machine that have 16 Intel Xeon v3 cores with 240 GByte of memory.
In this case we will allocate 8 cores to run 8 ldm threads, 3 cores to handle the main and rep threads and the io threads.
This setup assumes that one will use compressed backups and compressed LCPs. This means that a few CPUs can be used for compressing writes to files in the worst case. These 3 cores are also intended for use by the operating system.
Next we will use 3 cores for send and recv threads. We will use 3 CPUs for send threads and 3 CPUs for the recv threads.
Finally the last 2 cores are intended for 4 tc threads.
The ThreadConfig setting for these would be.
In this setup it would be a good idea to use all of RSS, RPS and RFS and ensure that all receive function handlers are handled by the same CPUs as the recv threads. It might be a good idea to move one CPU used by send threads to the recv threads. In this manner the OS can always route the interrupt processing to the same CPU where the actual application receive processing is done.
In this case we would thus use CPUs 3 and 19-20 for network interrupt processing, especially the RPS part, possibly also the RSS part. Thus we will ensure that the 3 CPUs used for the recv thread is part of the CPUs that can be used for the network interrupt function handlers.
This means that the CPU mask to use in this case would be 0x180008 (note that the CPU mask is always a hexadecimal number and so is this). This mask can be used both to set the CPU mask for HW interrupts as well as for RPS CPUs.
We could isolate the transmit interrupts. At first we disable sending from the ldm threads. Thus sending can happen from the CPU core where the send threads and from the core where the main and rep threads are, and from the two CPU cores used by the tc threads. Thus send interrupt will happen from 13 CPUs and we can thus setup XPS to use those 13 CPUs only and thus completely isolate the ldm threads from both receive and send activitites.
If we want to isolate the ldm threads from the rest of the activity in the OS as well we can set isolcpus as (not possible to do on Cloud instances as far as I am aware):