Analysis of hyperthreading performance#
In this chapter we will look at the concept of hyperthreading specifically. This chapter is an introduction to the chapter on Advanced Thread Configuration. Thread Configuration is handled automatically in RonDB. But it is possible to configure thread setup in RonDB to a great detail which could be useful for those that want to use RonDB as a research tool into distributed DBMSs.
It is important to understand when hyperthreading helps, how much it helps and in which situation to use it and when not to use it. We have one section for x86 servers that always have 2 CPUs per CPU core. We will also analyse things for the SPARC M7 servers that use 8 CPUs per CPU core. It is important to understand this to configure RonDB properly. Especially for advanced configurations where one wants to get the best performance using RonDB it is important to understand this. Some thread types benefit very much from hyperthreading and some almost not at all.
The analysis in this chapter is a direct input into the next chapter on advanced thread configuration options for RonDB.
We will analyse ldm threads, query, tc threads, send threads, recv (receive) threads in the data nodes. We will also analyse normal MySQL Server threads used to execute SQL queries and we will analyse the send and receive threads of the NDB API cluster connections.
Most of the analysis done was performed running Sysbench OLTP RW. On Linux the behaviour was analysed using perf stat, a command in the perf program suite that is used to among others calculate IPC (instructions per cycle executed). On SPARC/Solaris the method used was running benchmarks and ensuring that the thread type analysed becomes the bottleneck.
x86 servers#
The ldm and query threads is the thread types in the data node that dominates the CPU usage in the data nodes. This is especially true for Sysbench OLTP RW where a lot of scans for 100 rows are performed. tc threads gets more work when used with many small primary key lookups. send threads always have a fair amount of work to do and gets more when nodes to send to are on other machines.
ldm threads and query#
In a Sysbench OLTP RW execution the ldm and query threads uses almost 75% of the CPU usage in data nodes. There are two things affecting usage of hyperthreading for those threads. The first is the efficiency of hyperthreading for the those threads and the second is the fact that increasing the number of those threads means more memory to communicate between threads.
ldm and query are best configured on the same CPU core. This means that if the ldm thread is not used so much due to the partitioning, the CPU core can be heavily used by the query thread instead and thus support other ldm threads.
However large tables and tables that are heavily updated should be spread amongst the ldm thread instances as much as possible.
When a large table is used with unique key lookups and primary key lookups the queries will always go immediately to the correct partition, the number of partitions doesn’t matter for performance in this case. Thus the optimal for these queries is to have one primary and one backup replica fragment per ldm thread.
For scans that are using the partition key properly one can ensure that the scans only go to one partition, for these scans the number of partitions doesn’t matter either.
However all other scans, both full table scans and ordered index scans will perform the scans on all partitions. This means that if we are only looking for a small number of rows the overhead in starting up a scan one one partition will be noticable. At the same time the parallelism will increase with more ldm threads and thus latency at low load will decrease. For a generic application that haven’t been optimised for partition pruning it generally means decreased scalability of the application to have an increase in the number of partitions.
Thus the optimal for scans not using partition pruned scans is to use 1 partition. There is still some gain however in some situations, in particular when handling large join queries, to use a few more partitions.
To handle this we have set the default to 2 partitions per node. This gives a good balance between efficiency, scalability and parallelism. This has become possible with the introduction of query threads in RonDB.
The gain in performance of using one ldm thread per CPU rather than one ldm thread per CPU core is only 15%. Thus for generic applications it isn’t really beneficial to use hyperthreading for ldm threads. It is better to use only one thread per CPU core for ldm threads.
As mentioned above it is thus better to use one query and one ldm thread per CPU core when hyperthreading is available to ensure that we make full use of all CPU cores.
There is also a third thing that affects latency and performance when the number of ldm threads increases. Doubling the number of ldm threads can have an impact on the number of application threads required to reach the top performance levels. This is more or less an effect of queueing mechanisms and that with more ldm there are more hardware paths that are used. It is also to some extent the effect of some internal mutexes that have some effect on things as the number of ldm threads increases.
Exactly why the hyperthreading for ldm threads isn’t so beneficial we don’t exactly know, it is not uncommon for database engines to not benefit so much from hyperthreading. Our experiments shows that ldm threads have an IPC of 1.27, a likely cause is that with 2 threads per CPU core we run out of computing resources. But the cache levels definitely have an impact on the behaviour as well.
The most likely reason is that we overload the instruction cache which is shared between the hyperthreads.
Other thread types#
For the remaining threads hyperthreading is beneficial although it is good to know to what level it is giving benefits.
For tc threads the benefit is at 40% increased performance. For send threads it increases performance by 40%. For recv (receive) threads the benefit is a lot higher, it is all the way up to 70% increase in performance. The reason is that the receive thread is doing a lot of work on preparing protocol headers for the other threads, it does a fair amount of CPU processing on each byte it receives on the wire. ldm threads work a lot on large memory segments and thus have a lot of cache misses that misses in all cache layers.
The MySQL Server threads that execute SQL queries benefits 60% in increased performance by using hyperthreading and a similar number we get for the receive threads of the NDB API cluster connections.
Conclusion for x86 servers#
The simplest rule to follow is to use hyperthreading for all thread types except for ldm threads where one uses one ldm and one query thread per CPU core. A more complex rule is to look in more detail on the application scalability, number of CPU cores to use to decide whether to use hyperthreading for ldm threads.
Analysis of MySQL Server threads#
We ran a Sysbench OLTP RW benchmark where we kept the number of ldm threads constant at 8 and ensured that the other thread types wasn’t a bottleneck. We used 4 SPARC cores constantly and only varied the number of CPUs we made available to the MySQL Server (using the pbind command).