Detailed description of new features in RonDB 21.04#

Integrated benchmarking tools in RonDB binary distribution#

In the RonDB binary tarball we have integrated a number of benchmark tools to assess the performance of RonDB.

Currently Sysbench, DBT2 are fully supported, and flexAsynch and DBT3 are included in the binary distribution.

Sysbench is a simple SQL benchmark that is very flexible and can be used to benchmark a combination of simple primary key lookup queries, scan queries and write queries.

DBT2 is an open source variant of TPC-C, a benchmark for OLTP applications. It uses ndb_import for data loading, so can also be used to see how RonDB handles data loading.

DBT3 is an open source variant of TPC-H, a benchmark for analytical queries.

flexAsynch is an NDB benchmark using the C++ NDB API. It tests the performance of key lookups to an extreme level.

Improved report of use of memory resources in ndbinfo.resources table#

A new row to the ndbinfo table resources was added which reports TOTAL_GLOBAL_MEMORY. This is the sum of memory resources managed by the shared global memory that contains

Data memory
Transaction Memory
Transporter Memory
Job buffers
Query Memory
Disk records
Redo buffers
Undo buffers
Schema transaction memory
Shared Global Memory

3x performance improvement of ClusterJ#

When running applications that create and release large amounts of new session instances using the session.newInstance and session.release interface, the scalability is hampered dramatically and a significant overhead is created for the application.

In a simple benchmark where one runs one batch of primary key lookups at a time, each batch containing 380 lookups, the microbenchmark using standard ClusterJ can only scale to 3 threads and then handle a bit more than 1000 batches per second.

Using the BPF tool it was clear that more than a quarter of the CPU time is spent in the call to session.newInstance. ClusterJ uses Java Reflection to handle dynamic class objects.

A very simple patch of the microbenchmark showed that this overhead and scalability hog could quite easily be removed by putting the objects into a cache inside the thread instead of calling the session.release. This simple change made the microbenchmark scale to 12 threads and reach 3700 batches per second.

The microbenchmark was executed on an AMD workstation using a cluster with 2 data nodes with 2 LDM threads in each and 1 ClusterJ application.

This feature moves this cache of objects into the Session object. The session object is by design single-threaded, so no extra mutex protections are required as using a Session object from multiple threads at the same time is an application bug.

It is required to maintain one cache per Class. A new configuration parameter com.mysql.clusterj.max.cached.instances was added where one can set the maximum number of cached objects per session object.

We maintain a global linked list of the age of objects in the cache independent of its Class. If the cache is full and we need to store an object in the cache we will only do so if the oldest object has not been used for at least a max number. Each put into the cache increases the age by 1. If the age of the oldest object is higher than 4 * com.mysql.clusterj.max.cached.instances, then this object will be replaced, otherwise we will simply release the object.

From benchmark runs it is clear that it’s necessary to use a full caching of hot objects to get the performance advantage.

The application controls the caching by calling releaseCache(Class\<?>) to cache the object. The release call will always release the object.

In addition the application can call session.dropCacheInstance(Class\<?>) to drop all cached objects of a certain Class. It can also call session.dropCacheInstance() to drop all cached objects.

In addition the application can also cache session objects if it adds and drops those at rapid rates by using the call session.closeCache(). If one wants to clear the cache before placing the session object into the cache one can use session.closeCache(true).

Changes of defaults in RonDB#

Changed defaults for

ClassicFragmentation: true -> false
SpinMethod: StaticSpinning -> LatencyOptimisedSpinning
MinWriteSpeed: 10M -> 2M
MaxDiskWriteSpeed: 20M -> 4M
AutomaticThreadConfig: false -> true

The ClassicFragmentation set to true and AutomaticThreadConfig set to false is for backwards compatability of NDB Cluster. RonDB focus on new users and thus use the most modern setup as the default configuration.

We set the default for spinning to LatencyOptimisedSpinning since RonDB is focused on Cloud installations. This will give the best latency when possible with very little extra use of power.

We decrease the setting of MinDiskWriteSpeed and MaxDiskWriteSpeed since EnablePartialLcp and EnableRedoControl will both ensure that we will always write enough LCP, so we decrease the minimum disk write speed to decrease the overhead in low usage scenarios.

Introduced new configuration variables:

AutomaticMemoryConfig
TotalMemoryConfig

AutomatedMemoryConfig is set by default. By default memory size is retrieved from the OS. Setting TotalMemoryConfig overrides this. AutomatedMemoryConfig requires a minimum of 8GB of memory available.

When this is set we change the defaults of SchemaMemory to the following:

MaxNoOfAttributes=500000
MaxNoOfTriggers=200000
MaxNoOfTables=8000
MaxNoOfOrderedIndexes=10000
MaxNoOfUniqueHashIndexes=2300

In this mode the default action is to get the memory from the OS using the NdbHW module. One can also set it using a new config variable TotalMemoryConfig. This is mainly intended for testing.

The default will work very well when starting an NDB data node in a cloud setting. In this case the data node will use the entire VMs memory and CPU resources since also AutomaticThreadConfig is default.

Thus using this setting it is not necessary to set anything since even DataMemory and DiskPageBufferMemory is set based on the total memory available.

The way it works is that the following parameters now have a 0 default. 0 means that it is automatically calculated.

RedoBuffer will get 32 MByte per LDM

UndoBuffer will get 32 MByte per LDM

LongMessageBuffer will get 12 MByte per thread used, but the first thread will get 32 MByte.

TransactionMemory will get 300 MByte + 60 MByte per thread

SharedGlobalMemory will get 700 MByte + 60 MByte per thread

We avoid using memory that is required by the OS. We avoid using 1800 MByte plus 100 MByte for each block thread.

SchemaMemory is around 1 GB in a small cluster and increases with increasing cluster size and increasing node sizes.

We will compute the use of:

Restore Memory
Schema Memory
Packed Signal Memory
OS memory usage
Send Buffer Memory
Transaction Memory
Shared Global Memory
NDBFS Memory
Static memory (in various objects)
Redo Buffer
Undo Buffer
Backup Page Memory
Job Buffers

Also Page_entry objects used by DiskPageBufferMemory is taken into account.

After computing these buffers, schema memory, various internal memory structures and the memory we want to avoid using such that the OS also has access to some memory, the remaining memory is for DataMemory and DiskPageBufferMemory. 90% of the remaining memory is given to the DataMemory and 10% to the DiskPageBufferMemory.

In addition we have changed the default of MinFreePct to 10%. This provides some possibility to increase sizes of buffers if necessary. It can also be decreased down to 5 and possibly even down to 3-4 to enable more memory for database rows.

One can also set DiskPageBufferMemory to increase or decrease its use. This will have a direct impact on the memory available for DataMemory.

Increased the hash size for foreign keys in Dbtc to 4096 from 16.

Ensured that UndoBuffer configuration parameter is used if defined or if using automatic memory configuration when defining the UNDO log. The memory is added to TransactionMemory and thus can be used for this purpose if no Undo log is created.

Fine tuned some memory parameters, added a bit more constant overhead to OS, but decreased overhead per thread.

Increase default size of REDO log buffer to be 64M per LDM thread to better handle intensive writes, particularly during load phases.

Improved automated configuration of threads after some feedback on benchmark executions. This feedback showed that as much as possible hard-working threads should not share CPU cores. Also more tc threads are needed.

Removed use of overhead CPUs up to 17 CPUs. This means that only when we have 18 CPUs we will remove CPUs for overhead handling. Overhead handling is mainly intended for the IO threads.

The previous automatic memory configuration provided around 800M free memory space in a 16 CPU machine with almost 128 GB of memory. This is a bit too aggressive, so we added 200M to free space + added 35M extra per thread, thus changing the scaling from 15M per thread to 50M per thread in OS overhead.

This is mostly a security precaution to avoid risking running out of memory and experiencing any type of swapping.

Needed to reserve even more space on large VMs, added 1% of memory reserved and another 50 MByte per thread.

ClusterJ lacked support for Primary keys using Date columns#

Removed this limitation by removing a check for that this is not supported. In addition required the addition of a new object to handle Date as Primary Key based on the handling of Date columns in non-key columns.

Required also a new test case in the Clusterj testsuite. This required fairly significant rework since the current testsuite was very much focused on testing all sorts of column types, but not very much focused on testing of different data types in primary key columns.

Support minimal memory configurations#

To support minimal configurations down to 4GB memory space we made the following changes:

Make it possible to set TotalMemoryConfig down to 3GB
Remove OS overhead when setting TotalMemoryConfig

Using the following settings:

NumCPUs=4
TotalMemoryConfig=3G
MaxNoOfTables=2000
MaxNoOfOrderedIndexes=1000
MaxNoOfUniqueHashIndexes=1000
MaxNoOfAttributes=20000
MaxNoOfTriggers=30000
TransactionMemory=200M
SharedGlobalMemory=400M

We will be able to fit nicely into 3GB memory space and out of this a bit more than half is used by DataMemory and DiskPageBufferMemory.

AutomaticMemoryConfig computation will be affected by the settings of MaxNoOfTables, MaxNoOfOrderedIndexes, MaxNoOfUniqueHashIndexes, MaxNoOfAttributes, MaxNoOfTriggers and also the setting of TransactionMemory and SharedGlobalMemory.

NumCPUs is used to ensure that we use a small configuration when it comes to the number of threads if the test is started on a large machine with lots of data nodes (can be useful for testing).

Increase SendBufferMemory default#

To handle extreme loads on transporting large data segments we increase the SendBufferMemory to 8M per transporter. In addition we also make it possible for Send memory to grow 50% above its calculated area and thus use more of the SharedGlobalMemory resources when required.

We will never use more than 25% of the SharedGlobalMemory resources for this purpose.

Decouple Send thread mutex and Send thread buffers mutex#

One mutex is used to protect the list of free send buffers. This mutex is acquired when a block thread is acquiring send buffers. Each block thread is connected to a send thread. It acquires send buffers from the pool of send buffers connected to this send thread. Whenever sending occurs the send buffer is returned to the pool owned by the send thread. This means that a block thread will always acquire the buffers from its send thread buffers. But it will return job buffers to the pool connected to the transporter the message is sent to.

This means that imbalances can occur. This is fixed when seizing from the global list - if no send buffers are available we will try to "steal" from neighbour pools.

The actual pool of send buffers are protected by a mutex. This is not the same mutex that is used to protect the send thread. This send thread mutex is used to protect the list of transporters waiting to be served with sending.

However in the current implementation we hold the send thread mutex when releasing to the send buffer pool. This means that we hold two hot mutexes at the same time. This clearly is not good for scaling RonDB to large instances.

To avoid this we decouple the release_global call in assist_send_thread as well as the release_chunk call in run_send_thread from the send thread mutex.

The release_global call can simply wait to the end of the assist_send_thread call. We will normally not spend so much time in the assist_send_thread call. The call to release_chunk in the send thread can be integrated in handle_send_trp to avoid that we need to release and seize the send thread mutex an extra time in the send thread loop.

Improved send thread handling#

Block threads are very eager to assist with sending. In situation where we have large data nodes with more than 32 CPUs this will lead to a very busy sending. This leads to sending chunks of data that is too small. This leads both to contention on send mutexes and it leads to sending small messages that hurts the receivers and makes them less efficient.

To avoid this a more balanced approach on sending is required where block threads assist sending, but as load goes up the assistance becomes less urgent.

In this patch we use the implementation of MaxSendDelay and turn it into an adaptive algorithm. To create this adaptive algorithm we add another CPU load level called HIGH_LOAD. This means that we have LIGHT_LOAD up to 30% load, MEDIUM load is between 30% and 60%, HIGH LOAD is between 60% and 75% and above that we have OVERLOAD_LOAD.

We implement two new mechanisms, the first is max_send_delay. This is activated when threads starts reaching HIGH_LOAD. This means that when a block thread puts a transporter into the send queue, it will have to wait a number of microseconds before the send is allowed to occur. This gives the send a chance to gather other sends before the send is actually performed. The max send delay is larger for higher loads and for larger data nodes. It is never set higher than 200 microseconds.

The second mechanism is the minimum send delay. Each time we send and the min send delay is nonzero, we set the transporter in a mode where it has to wait to send even if a send is scheduled. This setting normally isn’t affecting things when max_send_delay is set. So it acts to secure against overloading the receivers when our load is still not at HIGH_LOAD. This is at most set to 40 microseconds.

In the past up to 4 block threads could be choosen to assist the send threads. This mechanism had a bug in resetting the activation. This is now fixed. In addition the new mechanisms is only providing for main and rep threads to assist the send threads. In addition the main and rep threads only do when the send threads are at a high load. Once we have entered this mode, we will wait to remove the assistance using a hysteresis. We have two levels of send thread support, the first adds the main thread and the second level adds also the rep thread. Neither of those will assist when they are reaching up to MEDIUM_LOAD level. This should be fairly uncommon for those threads except in some special scenarios.

By default main and rep threads are now not assisting send threads at all.

Block threads will only send to at most one transporter when reaching MEDIUM_LOAD, as before they will not assist send threads at all after reaching the OVERLOAD_LOAD level.

The main and rep threads are activated using the WAKEUP_THREAD_ORD signal, when this is sent the main/rep thread is setting nosend=0 and once every 50ms it is restored to nosend=1. Thus it is necessary to constantly send WAKEUP_THREAD_ORD signals to those threads to keep them waking up to assist send threads.

To handle this swapping between nosend 0 and 1, we introduce a m_nosend_tmp that is the one used to decide whether to perform assistance or not, the configured value is in the m_nosend variable.

To decide whether send thread assistance is required we modify the method to calculate send thread load such that it uses a parameter with number of milliseconds back we want to track. We use 200 milliseconds to decide on send thread assistance.

The configuration parameter MaxSendDelay is deprecated and no longer used. All of its functionality is now automated.

As a first step to remove the recover thread from disturbing the block threads we make sure that it never performs any send thread assistance. This assistance will disturb things, the recover threads still wake up and use about 0.1-0.2% of a CPU, this should also be removed eventually or handled in some other manner.

Also ensured that recover threads are not performing any spinning.

Handling send thread assistance until done is now only handled by main and rep threads. Thus we changed the setting of the pending_send variable.

Removed incrementing the m_overload_counter when calling set_max_delay. This should only be incremented when calling the set_overload_delay function.

Changes to automatic thread configuration#

Changed automatic thread configuration to use all CPUs that are available. The OS will spread operating system traffic and interrupt traffic on all CPUs anyways, so no specific need to avoid a set of CPUs.

For optimal behaviour it is possible to separate interrupts and OS processing to specific CPUs, combining this with the use of numactl will improve performance a bit more.

Improved send thread handling by integrating MaxSendDelay and automating it. Also ensured that we don’t constantly increase MaxSendDelay when more jobs arrive after putting another into the queue. An additional minimum send delay also to ensure that we don’t get too busy sending in low load scenarios.

Increment TCP send/receive buffer sizes#

Increment TCP send buffer size and receive buffer size in OS kernel to ensure that we can sustain high bandwidth in challenging setups.

Place pid-files in configured location#

In a security context it is sometimes necessary to place pid-files in a special location.

Make this possible by enabling RonDB to configure the Pid-filedirectory. The new parameter is called FileSystemPathPidfile and is settable on RonDB data nodes and management servers.

Support inactive nodes in configuration#

Support ACTIVATE NODE and DEACTIVATE NODE and change of hostname in a running cluster.

Previously setting NoOfReplicas was something that could not change. This feature doesn’t change that. Instead it makes it possible to set up your cluster such that it can grow to more replicas without reconfiguring your cluster.

The idea is that you set NoOfReplicas to 3 already in the initial config.ini. Even 4 is ok if desirable. This means that 3 Data nodes need to be configured. However only 1 of them is required to be properly defined. The other data nodes simply set the node id and set NodeActive=0. This indicates that the node is deactivated. Thus the node cannot be included into the cluster. All nodes in the cluster are aware that this node exists, but they do not allow it to connect since the node is deactivated.

This has the advantage that we don’t expect a node to start when we start another data node, we don’t expect it to start when waiting in the NDB API until the cluster is started.

Its has the additional advantage that we can temporarily increase the replication from e.g. 2 to 3. This could be valuable if we want to retain replication even in the presence of restarts, software changes or other reasons to decrease the replication level.

A node can be activated and deactivated by a command in the MGM client. It is a simple command:

4 activate

This command will activate node 4.

Here is the output from show for a cluster with 3 replicas, 2 of them active, it has the possibility to run with 2 MGM servers, but only one is active, it has the possibility to run with 4 API nodes, but only one of them is active:

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     3 node(s)
id=3    @127.0.0.1  (RonDB-21.04.0, Nodegroup: 0, *)
id=4    @127.0.0.1  (RonDB-21.04.0, Nodegroup: 0)
id=5 (not connected, node is deactivated)

[ndb_mgmd(MGM)] 2 node(s)
id=1    @127.0.0.1  (RonDB-21.04.0)
id=2 (not connected, node is deactivated)

[mysqld(API)]   4 node(s)
id=6 (not connected, node is deactivated)
id=7 (not connected, node is deactivated)
id=8 (not connected, node is deactivated)
id=9 (not connected, accepting connect from any host)

Another important feature that is brought here is the possibility to change the hostname of a node. This makes it possible to move a node from one VM to another VM.

For API nodes this also makes it possible to make RonDB a bit safer. It makes it possible to configure RonDB with all API nodes only able to connect from a specific hostname. It is possible to have many more API nodes configured, but those can not be used until someone activates them. Before activating them one should then set the hostname of those nodes such that only the VM that is selected can connect to the cluster.

Thus with this feature it becomes possible to start with a cluster running all nodes on one VM (ndb_mgmd, ndbmtd and mysqld). Later as the need arises to have HA one adds more VMs and more replicas. Growing to more node groups as an online operation is already supported by RonDB. With this feature it is no longer required that all node groups at all times have the same level of replication. This is possible by having some nodes deactivated in certain node groups.

The benefit of different replication within different node groups can be used when reorganising the cluster or when one node is completely dead for a while, or we simply want to spend less money temporarily on the cluster.

When moving a node to a new VM there are two possibilities for this. One possibility is that we use cloud storage as disks, this means that the new VM can reuse the file system from the old VM, in this case a normal restart can be performed. Thus the node is first deactivated (includes stopping it), the new hostname is set for the new VM, the node is activated and finally the node is started again in a new VM. The other option is where the node is moved to a VM without access to the file system of the old VM, in this case the node need to be started using --initial. This ensures that the node gets the current version of its data from other live nodes in the RonDB cluster. This works for both RonDB data nodes as well as for RonDB management nodes.

It is necessary to start a node that has been recently activated using the --initial flag. If not used we could get into a situation where the nodes don’t recognize that they are part of the same cluster. This is particularly true for MGM server where it is a requirement that all MGM servers are alive when a configuration change is performed. With this change the requirement is for all active nodes to be alive for a configuration change to be performed. In this work it is assumed that 2 MGM servers is the maximum number of MGM servers in a cluster.

A new test program directory for this feature is created in the directory mysql-test/activate_ndb.

Running the script test_activate.sh in this directory will execute the test case for this test run. To perform this test it is necessary to set the PATH such that it points to the binary directory where the build placed the binaries. So e.g. if we created a build directory called debug_build and the git tree is placed in /home/mikael/activate_2365 we would set the path in the following manner: export PATH=/home/mikael/activate_2365/debug_build/bin:_dollar_PATH and next we will run the test_activate.sh script that will tell us if the test passed or not.

Support larger transactions#

In NDB large transactions can easily cause job buffer explosion or send buffer explosion. By introducing batching of abort and commit and complete requests we ensure that no buffer explosions occur. Also introduced CONTINUEB handling of releasing records belonging to a large transaction to sustain low latency even in the context of large transactions.

Previously Take over processing used 1 node operation at a time. This was safe against overload, but obviously can make large transactions take ridiculously long time. A transaction with 1 million operations in a 3 replica cluster would have to do 3 million round trips of which 2 million would be to remote nodes at least. Thus could easily take a couple of hours.

In this patch we instead ensure that we send a batch and when half of the batch has returned we start sending the next batch. Using batch sizes of around 1k, we ensure that we can commit very large transactions at least within a few seconds.

Much of the take over processing also used loops over all operations, this is changed to handle a batch per real-time break. Thus ensuring that other transactions are not badly affected by the large transactions.

To simplify code we don’t handle timeouts while we are still sending batches, similarly while we are transforming the transaction from the normal transaction handling to the take over variant.

The take over variant is currently always used when a TC thread takes over transactions from a failed node. It is also used when we have a timeout in the commit and complete processing.

We can later introduce other reasons to use this code. It sends COMMIT and COMPLETE messages to all participants which increases load on CPUs and networks, but it also decreases latency of commit operations.

Standardised such that all handling of timeouts used the take over processing path.

Introduced a new configuration variable LowLatency that is a boolean that defaults to off. If not set we get the normal commit behaviour with linear commit. When set we use normal commit that sends the commit to all nodes in parallel which should decrease latency at the expense of higher networking costs, both CPU wise and bandwidth-wise.

This affects the COMMIT phase and COMPLETE phase. It doesn’t affect the prepare phase. Using Low latency means that we will always wait with releasing the lock until the complete phase.

It is mainly useful in clusters with high latency and clusters with 3-4 replicas.

Improved error reporting for cluster failures#

The error 4009 has a short text Cluster Failure. However the error could be due to many different causes. To extend the possibility to troubleshoot this we extended those error messages to have more elaborate error messages and more error variants.

Also added more logging to the MySQL error log when nodes connect, disconnect, goes alive and goes dead to see why they are available and unavailable.

ARM64 support#

In RonDB 21.04 we introduce binary tarballs for ARM64 platforms. These versions have been tested and verifed on both Linux using Oracle Linux and on Mac OS X. The support for ARM64 is still in beta state. We will continue to add more test and verification of our ARM64 binaries.

Mac OS X support#

RonDB is developed on a mix of Mac OS X and Linux platforms. Thus it is natural to extend the support for RonDB on Mac OS X. What is new is that now this has also been extended to the newest ARM64 Macs.

WSL 2 support#

RonDB doesn’t support running directly on Windows. However with the release of Windows 11 and the release of WSL 2 we feel that the best way of running RonDB on Windows is to execute it with WSL 2, the new Linux subsystem on Windows.

We have added testing using both MTR and Autotest to Windows. The testing on Windows using WSL 2 have been a great success and using Remote Desktop it is possible to have Windows running the full test suite without manual intervention.

Two new ndbinfo tables to check memory usage#

Two new ndbinfo tables are created, ndb$table_map and ndb$table_memory_usage. The ndb$table_memory_usage lists four properties for all table replicas, in_memory_bytes (the number of bytes used by a table fragment replica in DataMemory), free_in_memory_bytes (the number of bytes free of the previous, these bytes are always in the variable sized part), disk_memory_bytes (the number of bytes in the disk columns, essentially the number of extents allocated to the table fragment replica times the size of the extents in the tablespace), free_disk_memory_bytes (number of bytes free in the disk memory for disk columns).

Since each table fragment replica provides one row we will use a GROUP BY on table id and fragment id and the MAX of those columns to ensure we only have one row per table fragment.

We want to provide the memory usage in-memory and in disk memory per table or per database. However a table in RonDB is spread out in several tables. There are four places a table can use memory. First the table itself uses memory for rows and for a hash index, when disk columns are used this table also makes use of disk memory. Second there are ordered indexes that use memory for the index information. Thirdly there are unique indexes that use memory for rows in the unique index (a unique index is simply a table with unique key as primary key and primary key as columns) and the hash index for the unique index table. This table is not necessarily colocated with the table itself. Finally there is also BLOB tables that can contain hash index, row storage and even disk memory usage.

The user isn’t particularly interested in this level of detail, so we want to display information about memory usage for tables and databases that the user sees. Thus we have to gather data for this, the tool to gather the data is the new ndbinfo table ndb$table_map, this table lists the table name and database name provided the table id, the table id can be the table id of a table, an ordered index, a unique index or a BLOB table, but will always present the name of the actual table defined by the user, not the name of the index table or BLOB table.

Using those two tables we create two ndbinfo views, the table_memory_usage listing the database name and table name and the above 4 properties for each table in the cluster. The second view, database_memory_usage lists the database name and the 4 properties summed over all table fragments in all tables created by RonDB for the user based on the BLOBs and indexes.

To make things a bit more efficient we keep track of all ordered indexes attached to a table internally in RonDB. Thus ndb$table_memory_usage will list memory usage of tables plus the ordered indexes on the table, there will be no rows presenting memory usage of an ordered index.

These two tables makes it easy for users to see how much memory they are using in a certain table or database. This is useful in managing a RonDB cluster.

Make it possible to use IPv4 sockets between ndbmtd and API nodes#

In MySQL NDB Cluster all sockets have been converted to use IPv6 format even when IPv4 is used. This led to that MySQL NDB Cluster no longer could interact with device drivers that only works using IPv4 sockets. This is the case for Dolphin SuperSockets.

Dolphin SuperSockets makes it possible to use extreme low latency HW in connecting the nodes in a cluster to improve latency significantly. RonDB has been tested and benchmarked using Dolphin SuperSockets.

Use of realtime prio in NDB API receive threads#

Experiments show that it removes a lot of variance in benchmarks, decreasing variance by a factor of 3. In a Sysbench Point select benchmark it improved performance by 20-25% while at the same time improving latency by 20%. One could also get the same performance at almost 4x lower latency (0.94 ms round trip time for a PK read compared to 0.25 ms after change).

RONDB-167: ClusterJ supporting setting database when retrieving Session object#

ClusterJ has been limited to handle only one database per cluster connection. This severely limits the usability of ClusterJ in cases where there are many databases such as in a multi-tenant use case for RonDB.

At the same time the common case is to handle only one database per cluster connection. Thus it is important to maintain the performance characteristics for this case.

One new addition to the public ClusterJ API is the addition of a new getSession call with a String object representing the database name to be used by the Session object. Once a Session object has been created it cannot change to another database. A session object can have a cache of DTO objects, this would be fairly useless when used with many different databases. Thus this isn’t supported in this implementation. The limitation this brings about is that a transaction is bound to a specific database.

We can cache sessions in RonDB, we have one linked list of cached session objects for the default database. Other databases create a linked list at first use of a database in a SessionFactory object. The limit on the amount of cached Session objects is maintained globally. Currently we simply avoid putting it back on the list if the maximum has been reached. An improvement could be to have a global order of the latest use of a Session object, this hasn’t been implemented here.

A Session object has a Db object that represents the Ndb object used by the Session. This Ndb object is bound to a specific database. For simplicity we store the database name and a boolean if the database is the default database. The database name could have been retrieved from the Ndb object as well. This database name in the Db object is used when retrieving an NdbRecord for the table.

ClusterJ handles NdbRecord in an optimised manner that tries to reuse them as much as possible. Previously it created on Ndb object together with an NdbDictionary object to handle NdbRecord creation. Now this dictionary is renamed and used for only the default database. Each new database will create one more Ndb object together with an NdbDictionary object. This object will handle all NdbRecord’s for that database. For quick finding of this object we use a ConcurrentHashMap using database name to find this NdbDictionary object.

Previously there was a ConcurrentHashMap for all NdbRecord’s, both for tables and for indexes. These used a naming scheme that was tableName only or tableName+indexName.

This map is kept, but now the naming scheme is either databaseName+tableName or databaseName+tableName+indexName.

Thus more entries are likely to be in the hash map, but it should not affect performance very much.

These maps are used to iterate over when unload schemas and when removing cached tables.

With multiple databases in a cluster connection the LRU list handling becomes more important to ensure that hot databases are more often cached than cold databases. Implemented a specific LRU list of Session objects in addition to a queue per database.

Added a few more test cases for multiple databases in ClusterJ. Added also more tests to handle caching of dynamic objects and caching of session objects.

Added support for running MTR with multiple versions of mysqld#

RONDB-169: Allow newer versions from 21.04 series to create tables recognized by older versions at least 21.04.9#

RONDB-171: Support setLimits for query.deletePersistentAll()#

This feature adds support for limit when using the deletePersistentAll method in ClusterJ.

deletePersistentAll on a Query object gives the possibility to delete all rows in a range or through a search condition. However a range could contain millions of rows, thus a limit is a good idea to avoid huge transactions.

RONDB-174: Move log message to debug when connecting to wrong node id using a fake hostname#

RONDB-184: Docker build changes#

New docker build files to build experimental ARM64 builds and fixes to the x86 Docker build files.

Fixes of the Jenkins build files.

Added Dockerfile with base image ubuntu:22.04 for experimental ARM64 builds
Using caching of downloads and builds within Dockerfiles
Bumped sysbench and dbt2 versions to accommodate for ARM64 building (build_bench.sh)
Dynamic naming of tarballs depending on building architecture (create_rondb_tarball.sh)
Placed docker-build.sh logic into Dockerfiles
Formatting of scripts

Removed a few printouts during restart that generated loads of printouts with little info#

Update versions for Sysbench and DBT2 in release script#

Updated to ensure that Sysbench and DBT2 works also on ARM64 platforms.

RONDB-199: Ensured that pid file contains PID of data node, not of angel#

The data node uses two processes - one is an angel process which is the first process started. This process is daemonized after which it is forked into another process which is the real data node process.

When interacting with environments like systemd it is easier to handle this if the pid file contains the pid file of the real data node process.

Stopping the angel process doesn’t stop the data node process. Thus keeping track of this PID isn’t of any great value.

Thus in using RonDB data nodes in combination it is recommended to not set StopOnError to 1 since this means that the angel will restart the ndbmtd no matter how it stopped. Thus it is better to set StopOnError=0 and to use the pid file (ndb_NODEID.pid in the same directory as the log files) to find the data node pid to stop or kill.

RonDB REST API Server#

This new feature is a major new open source contribution by the RonDB team. It has been in the works for almost a year and is now used by some of our customers.

The REST API Server has two variants. The first variant provides read access to tables in RonDB using primary key access. The reads can either read one row per request or use the batch variant that can read multiple rows from multiple tables in one request. This variant supports both access using REST and gRPC. The REST API is by default available on port 4406 and the gRPC is by default using port 5406.

The second variant is to use the Feature Store REST API. This interface is used by Hopsworks applications that want direct read access to Feature Groups in Hopsworks.

The REST API server described above is just a first version. We aim to extend it both in terms of improved performance, more functionality and also adding advanced features.

The binary for the REST API server is called rdrs and is found in the RonDB binary tarballs.

The documentation of the REST API server is found here.

The documentation of the REST API for the Feature Store is found here

RONDB-282: Parallel copy fragment process#

Copy fragment process have been limited to one fragment copy per LDM thread previously. The copy fragment is also limited by the parallelism such that at most 6000 words are allowed to be outstanding (a row counts for 56 words plus the row size in words).

This means that in particular initial node restart is fairly slow. Thus we improved the parallelism while at the same time maintaining protection against overload of the live node which could be very busy serving readers and writers of the databases.

Actually this parallelism is already implemented in DBDIH. This means that we can already send up to 64 parallel copy fragment requests to a node. However DBLQH imposes a limit of one at a time and queues the other requests.

This patch ensures that we can run up to 8 parallel copy fragment processes per LDM thread. We will get data from THRMAN about CPU load due to this, we will use this both to limit the number of concurrent copy fragment processes and also to limit the parallelism per copy fragment process.

One problem when synchronising a node and there is a lot of disk columns, is that we might get overload on the UNDO log. This would cause failure of the node restart if not avoided. To avoid this we need to be able to halt the copy fragment process and later resume it again.

This functionality was already implemented. However the implementation isn’t very likely to work. The new implementation is very simple. It sends a HALT_COPY_FRAG_REQ to the live node(s) performing the copying. It doesn’t send anything back to the starting node. It is the responsibility of the live node to halt in a proper manner. Similarly the starting node will send RESUME_COPY_FRAG_REQ when it is ok to copy again. Currently the halt will halt all copy fragment processes even if they don’t have any disk columns. It is likely that the disk columns tables are anyways the bottleneck for the node restart, so special handling wouldn’t make any major difference.

To handle up to 8 parallel copy fragment processes per LDM thread requires that we have reserved 8 operation records, there is a scan record in DBTUP, there is a scan lock record in DBTUP, there is a stored procedure record in DBTUP, there is a scan record in DBLQH and there is a scan lock record in DBACC, all of them requires 8 reserved records such that we don’t risk that the copy fragment processes runs out of memory.

A few tweaks were required to send COPY_FRAGREQ for NON_TRANSACCTIONAL use. This is part of restore fragment and is mainly used in initial node restart. Some tweaks were required here to ensure that we could start more in parallel.

Both TRANSACTIONAL and NON_TRANSACTIONAL COPY_FRAGREQ required limitations when running against an older node starting up.

A very important part of the work is to also adapt the speed of node recovery based on the CPU usage in the live node. We need to slow down when the CPU gets heavily used.

The speed of node recovery was previously limited to having 24 kBytes outstanding where 0 bytes in a row was counted as 224 bytes. This means effectively less than 50 rows at a time per thread was outstanding. We increased this substantially to now go up to 192 kBytes instead. Thus around 400 small rows can be outstanding. This should provide a good balance between latency of operations towards the live node and the speed of node recovery.

The CPUs have become much faster and on some fast CPUs adding microseconds wasn’t good enough, we had to raise the level to handle nanoseconds. This had an impact on our measurements of CPU load which is a very important part of this patch for adaptive speed of node recovery.

As part of this patch we also disable ACC scans (previously done in 22.10 branch).

RONDB-364: Add service-name parameter to ndb_mgmd and ndbmtd#

This parameter adds a --service-name parameter to ndb_mgmd and ndbmtd. E.g. --service-name=ndbmtd sets the file name of the pid file to ndbmtd.pid and node log to ndbmtd_out.log and similarly for trace files and other log files and the error file. Also sets the directory name of the NDB file system to service_name_fs.