New features in RonDB 22.10#
Support variable sized disk rows#
The flagship feature of RonDB 22.10 is that we are making disk columns a flagship feature in RonDB. Previously this has been a feature possible to use for expert users. As an example HopsFS have built the capability to to store small files in RonDB in disk columns.
The disk columns previously had a limitation that disk rows always had a fixed size. Thus declaring a VARCHAR(100) using UTFMB4 meant that 400 bytes of space was used for each row independent of what was stored in the columns.
With RonDB 22.10 the disk rows only use the space they require, thus it depends on the data how much storage they use. The implementations use the same data structures used for in-memory rows, this means that we will also be able to support online adding of new disk columns.
In the Feature Store applications that often store arrays of numbers and arrays of status variables that are variable in size, this leads to a significant saving of storage space. It can easily save a factor of 10x storage space.
The combination of this and the use of disk columns can lead to the ability to increase the storage space for Feature Stores with a factor of 10x while the price is still similar.
The development of modern SSDs and NVMe drives is thus now fully integrated into RonDB 22.10. There has been a major drive in the last years to improve the quality of the code handling disk columns. With RonDB 21.04 and RonDB 22.10 the disk columns have the same quality as in-memory columns. With RonDB 22.10 we also have the same space efficiency of in-memory columns and disk columns and thus with RonDB 22.10 disk columns moves from the expert domain to the normal users of RonDB.
Modern NVMe drives can handle very large loads, main memory is still more capable and will deliver much better latency and throughput, but at a higher cost. Thus with this RonDB 22.10 release the user can decide based on his requirements whether to store the features in main memory columns or in disk columns. Given the rapid development of NVMe drives we expect the use case of disk columns to be increasingly important for RonDB applications.
Using RonDB 22.10 it will now be possible to store petabytes of data in a single RonDB cluster. On top of this even more data can be stored in the HopsFS file system which is also built on top of RonDB and uses HopsFS data nodes to store large files. On top of HopsFS we have Hudi that enables efficient SQL query execution of these many petabytes of data. Thus with RonDB as base platform for its data, Hopsworks enables applications to store huge amounts of data used for machine learning applications, both online applications and offline applications.
Storing columns on disk was introduced into MySQL NDB Cluster already in version 5.1. It has been constantly improved and efforts have been made to make it much more stable and performant. Thus in RonDB 22.10 the quality of disk columns is on par with in-memory columns.
Benchmark experiments available on www.rondb.com show how latency throughput is for in-memory columns and disk columns. The section on Scalable Storage focus on performance and latency of disk columns using the YCSB benchmark.
On mikaelronstrom.blogspot.com there are blogs from October 2020 about performance of large insert loads into tables using disk columns. These experiment shows how RonDB can handle much more than 1 GByte per second of insert loads into the disk columns with sufficient HW to support it.
In RonDB 22.10 we add on top of this performance and stability the ability to use more compact representation of the data in disk columns. Previously each row had a fixed size, with RonDB 22.10 each disk row is using the same data structure used by in-memory columns. This means that we have support of storing columns of variable size in a variable sized row. The disk rows also support storing dynamic columns. This means that we will also be able to support online add of disk columns.
Improved placement of primary replicas#
The distribution of primary replicas in RonDB 21.04 isn’t optimal for the new fragmentation variants. With e.g. 8 fragments per table and 2 nodes we will find that two LDM threads get two fragments to act as primary for a four-LDM setup whereas the other two LDM threads gets no primary replicas to handle.
This is handled by a better setup at creation of the table. However, to address handling of inactive nodes we need to also redistribute the fragments at various events.
The redistribution is only allowed if all nodes have upgraded to at least RonDB 22.01 in the cluster. Older versions of RonDB will not redistribute and we need to ensure that all data nodes use the same primary replicas. If not we would cause a multitude of constant deadlocks.
This change improves performance by about 30% for the DBT2 benchmark.
Index stat mutex bottleneck removed#
A major bottleneck in the MySQL Server is the index statistics mutex.
This is acquired 3 times per index lookup to gather index statistics. This becomes a bottleneck when Sysbench OLTP RW reaches around 10000 TPS with around 100 threads. Thus it is a severe limitation on scalability for the MySQL Server using RonDB. In Sysbench benchmarks this improvement has provided at least 10% higher throughput.
To handle this we ensure that the hot path through the code doesn’t need to acquire the global mutex at all. This is solved by using the NDB_SHARE mutex a bit more and making the ref_count variable an atomic variable.
We also needed to handle some global statistics variables. Fixed by adding them on local object and occasionally transferring to the global object.
Query thread improvement#
In MySQL Cluster 8.0.23 query threads were introduced. This meant that query threads could be used for READ COMMITTED queries. In this feature this is extended to also handle the PREPARE phase of LOCKED reads using key-value lookup through LQHKEYREQ.
This means more concurrency and provides a better scalability for applications that rely heavily on locked reads such as the benchmark DBT2.
Improved networking through send thread handling#
RonDB 22.10 has made improvements to the send thread handling making it less aggressive and also automatically activating send delays at high loads.
Move Schema Memory to global memory manager#
This is the final step in a long series of development to move all major memory consumers to use the global memory manager. Now all major memory consumers use the global memory manager from RonDB 22.01 and all later versions.
This change is mostly an internal change that ensures that all Schema Memory objects are using the global memory manager. Already in RonDB 21.04 memory configuration was automatic, so this makes some memory more flexibly available.
Another major improvement in this change is the addition of a malloc-like interface to the global memory manager. This is used in a few places which e.g. means that we can now have any number of partitions per table in a single table independent of the number of LDM threads in the data node.
More flexibility in thread configuration#
This patch series was introduced mainly to be able to use RonDB to experiment with various thread configurations that typically wasn’t supported in NDB. The main change is to enable to use receive threads for all types of thread types.
With these changes it is possible to e.g. run with only a set of receive threads.
The long-term goal of this patch is to find an even better configuration for automatic thread configuration.
Improve crashlog#
-
Interleave Signal and Jam dumps. Signals are printed NEWEST first, and under each signal the corresponding Jam entries, OLDEST first.
-
Let printPACKED_SIGNAL detect whether we’re in a crashlog dump. If so, print the contained/packed signals NEWEST first and under each signal the corresponding Jam entries, OLDEST first. When not in a crashlog dump, print the contained signals NEWEST first without Jam entries.
-
Better formatting and messages
1. Cases with missing/unmatched signals and Jam entries are handled gracefully.
2. Legend added
3. Print signal ids in hexadecimal form
4. Don’t print block number in packed signals
-
JamEvents can have five types
1. LINE: As before, show the line number in the dump
2. DATA: Show the data in the dump with a "d" prefix to distinguish it from a line number. This type of entry is created by *jam*Data* macros (or the deprecated *jam*Line* macros). The data is silently truncated to 16bit.
3. EMPTY: As before, do not show in the dump
4. STARTOFSIG: Used to mark the start of a signal and to save the signal Id
5. STARTOFPACKEDSIG: Used to mark the start of a packed signal and to save both its signal Id and pack index
-
Update Jam macros
1. Deprecate *jam*Line* macros and add *jam*Data* macros in their place
1. jamBlockData 2. jamData 3. jamDataDebug 4. thrjamData 5. thrjamDataDebug
2. Cleanup, add documentation, and add internal prefix to macros only used in other macro definitions
-
Static asserts to make sure that
1. EMULATED_JAM_SIZE is valid
2. JAM_FILE_ID refers to the correct filename. This was previously tested occasionally, run-time in debugging builds. With this change the test is performed always and compile-time. The jamFileNames table and JamEvent::verifyId had to be moved from Emulator.cpp to Emulator.hpp in order to be available at compile-time.
3. File id and line number fit in the number of bits used to store them
-
Refactoring, comments etc.
-
Refactor signaldata
1. Introduce printHex function and use it to print Uint32 sequences
-
jamFileNames maintenance
1. Add test_jamFileNames.sh script to find problems in the jamFileNames[] table
2. Add a unit test for test_jamFileNames.sh
3. Correct the problems found
Make query threads the likely scenario#
An optimisation was made in the code such that the patch using query threads is the one the compiler will optimise on. It provides a minor performance improvement.
Upgrade to use OpenSSL 3.0#
Improved description of workings of the RonDB Memory manager#
Changed name of alloc_spare_page to alloc_emergency_page to make it easier to understand the difference between spare pages in DataMemory and allocation of emergency pages in rare situations.
Much added comments to make it easier to understand the memory manager code.
Local optimisations#
1: Continue running if BOUNDED_DELAY jobs are around to execute
2: Local optimisation of Dblqh::execPACKED_SIGNAL
3: Inline Dbacc::initOpRec
4: Improve seize CachedArrayPool
Replace md5 hash function with XX3_HASH64#
This change decreases the overhead for calculating the hash function by about 10x since the new hash function is better suited for SIMD instructions.