Release Notes RonDB 21.04.0#
RonDB 21.04.0 is the first RonDB release. It is based on MySQL NDB Cluster 8.0.23.
RonDB 21.04.0 is released as an open source SW with a binary tarball for Linux usage. It is developed on Linux and Mac OS X and is occasionally tested on FreeBSD.
There are four ways to use RonDB 21.04.0:
1) You can use the managed version available on hopsworks.ai. This sets up a RonDB cluster provided a few details on the HW resources to use. The RonDB cluster is integrated with Hopsworks and can be used for both RonDB applications as well as for Hopsworks applications. Currently AWS is supported, Azure support is soon available.
2) You can use the cloud scripts that will enable you to set up in an easy manner a cluster on Azure or GCP. This requires no previous knowledge of RonDB, the script only needs a description of the HW resources to use and the rest of the set up is automated.
3) You can use the open source version and use the binary tarball and set it up yourself.
4) You can use the open source version and build and set it up yourself
RonDB 21.04.0 will be maintained for at least 6 months when RonDB 21.10.0 is released. Most likely 21.04.0 will be maintained a bit longer than 6 months. Maintaining 21.04.0 means mainly fixing critical bugs and minor change requests. It doesn't involve merging with any future release of MySQL NDB Cluster, this will be handled in newer RonDB releases.
Summary of changes in RonDB 21.04.0#
Automated Thread Configuration#
Thread Configuration defaults to automatated configuration of threads based on the number of CPUs. This feature is introduced in NDB 8.0.23, but in RonDB it is the default behaviour. A number of bug fixes have been found and fixed in this area. RonDB will make use of all CPUs it gets access to and also handles CPU locking automatically. It is possible to set NumCPUs to a specific number if one wants to override this default. It is also possible to use RonDB in a backwards compatible manner although this isn't recommended.
Automated Memory Configuration#
NDB has historically required setting a large number of configuration parameters to set memory sizes of various memory pools. In RonDB 21.04.0 the default behaviour is that no configuration parameter is required. RonDB data nodes will use the entire memory available in the server/VM. One can limit the amount of memory it can use with a new configuration parameter TotalMemoryConfig.
Automated CPU spinning#
NDB 8.0.23 contained support for CPU spinning. RonDB 21.04.0 defaults to using LatencyOptimisedSpinning. Fixed a number of bugs in this area.
Improved networking through send thread handling#
RonDB 21.04.0 has made improvements to the send thread handling making it less aggressive and also automatically activating send delays at high loads.
Configurable number of replicas#
RonDB 21.04.0 solves another configuration problem from NDB 8.0.23. In NDB 8.0.23 and earlier releases of NDB one could only set NoOfReplicas at the initial start of the cluster. The only method to increase or decrease replication level was to perform a backup and restore.
In RonDB 21.04.0 one can set NoOfReplicas=3 even if one wants to run with a single replica. It is necessary to define the extra nodes in the configuration, but by setting the configuration parameter NodeActive=0 on those nodes, they are not really part of the cluster until they get activated.
To support this feature three new commands have been added in the management client.
2 ACTIVATE 2 HOSTNAME 192.168.1.113 2 DEACTIVATE
The above commands activates/deactivates node 2 and the HOSTNAME command changes the hostname of node 2 to 192.168.1.113. A node must be inactive to make it possible to change its hostname.
This feature also makes it very easy to move a node from one hostname to another hostname.
3x improvement in ClusterJ API#
A new addition to the ClusterJ API was added that release data objects and Session objects to a cache rather than releasing them fully. In addition an improvement of the garbage collection handling in ClusterJ was handled. These improvements led to a 3x improvement in a simple key lookup benchmark.
A total of 17 bugs were fixed in this release.
Improved performance of MySQL Server and Data nodes#
Integrated benchmarking tools in RonDB binary distribution#
In the RonDB binary tarball we have integrated a number of benchmark tools to assess performance of RonDB.
Currently Sysbench, DBT2 are fully supported, flexAsynch and DBT3 are included in the binary distribution. We intend to add also ClusterJ benchmarks later.
Sysbench is a simple SQL benchmark that is very flexible and can be used to benchmark a combination of simple primary key lookup queries, scan queries and write queries.
DBT2 is an open source variant of TPC-C, a benchmark for OLTP applications. It uses ndb_import for data loading, so can also be used to see how RonDB handles data loading.
DBT3 is an open source variant of TPC-H, a benchmark for analytical queries.
flexAsynch is an NDB benchmark using the C++ NDB API. It tests the performance of key lookups to an extreme level.
10) Integrated scripts to set up a RonDB cluster in Azure or GCP.
11) Access to an early version of the RonDB managed version which runs RonDB in AWS.
Detailed description of changes in RonDB 21.04.0#
Improved report of use of memory resources in ndbinfo.resources table#
Added a new row to the ndbinfo table resources which reports TOTAL_GLOBAL_MEMORY. This is the sum of memory resources managed by the shared global memory that contains
Schema transaction memory
Shared Global Memory
3x improvement of ClusterJ performance#
When running applications that creates and releases loads of new instances using the session.newInstance and session.release interface the scalability is hampered dramatically and lots of overhead is created for the application.
In a simple benchmark where one runs one batch of primary key lookups at a time, each batch containing 380 lookups, the microbenchmark using standard ClusterJ can only scale to 3 threads and then handling a bit more than 1000 batches per second.
Using the BPF tool it was clear that more than a quarter of the CPU time is spent in the call to session.newInstance. ClusterJ uses Java Reflection to be handled dynamic class objects.
A very simple patch of the microbenchmark showed that this overhead and scalability hog could quite easily be removed by putting the objects into a cache inside the thread instead of calling the session.release. This simple change made the microbenchmark scale to 12 threads and reach 3700 batches per second.
The microbenchmark was executed on an AMD workstation using a cluster with 2 data nodes with 2 LDM threads in each and 1 ClusterJ application.
This feature moves this cache of objects into the Session object. The session object is by design single-threaded, so no extra mutex protections are required as using a Session object from multiple threads at the same time is an application bug.
It is required to maintain one cache per Class. A new configuration parameter com.mysql.clusterj.max.cached.instances was added where one can set the maximum number of cached objects per session object.
We maintain a global linked list of the age of objects in the cache independent of its Class. If the cache is full and we need to store an object in the cache we will only do so if the oldest object has not been used for at least a max number. Each put into the cache increases the age by 1. If the age of the oldest object is higher than 4 * com.mysql.clusterj.max.cached.instances, then this object will be replaced, otherwise we will simply release the object.
From benchmark runs it is clear that it's necessary to use a full caching of hot objects to get the performance advantage.
The application controls the caching by calling releaseCache(Class\<?>) to cache the object. The release call will always release the object.
In addition the application can call session.dropCacheInstance(Class\<?>) to drop all cached objects of a certain Class. It can also call session.dropCacheInstance() to drop all cached objects.
In addition the application can also cache session objects if it adds and drops those at rapid rates by using the call session.closeCache(). If one wants to clear the cache before placing the session object into the cache one can use session.closeCache(true).
Improvements in ClusterJ to avoid single-threaded garbage collection#
Create RoNDB tree and 21.04 branch#
Renamed various parts in the programs to use RonDB and its version number.
Changes of defaults in RonDB#
Changed defaults for
ClassicFragmentation: true -> false
SpinMethod: StaticSpinning -> LatencyOptimisedSpinning
MinWriteSpeed: 10M -> 2M
MaxDiskWriteSpeed: 20M -> 4M
AutomaticThreadConfig: false -> true
The ClassicFragmentation set to true and AutomaticThreadConfig set to false is for backwards compatability of NDB Cluster. RonDB focus on new users and thus use the most modern setup as the default configuration.
We set the default for spinning to LatencyOptimisedSpinning since RonDB is focused on Cloud installations. This will give the best latency when possible with very little extra use of power.
We decrease the setting of MinDiskWriteSpeed and MaxDiskWriteSpeed since EnablePartialLcp and EnableRedoControl will both ensure that we will always write enough LCP, so we decrease the minimum disk write speed to decrease the overhead in low usage scenarios.
Introduced new configuration variables: AutomaticMemoryConfig TotalMemoryConfig AutomatedMemoryConfig is set by default By default memory size is retrieved from the OS. Setting TotalMemoryConfig overrides this. AutomatedMemoryConfig requires a minimum of 8GB of memory available.
When this is set we change the defaults of SchemaMemory to the following:
In this mode the default action is to get the memory from the OS using the NdbHW module. One can also set it using a new config variable TotalMemoryConfig. This is mainly intended for testing.
The default will work very well when starting an NDB data node in a cloud setting. In this case the data node will use the entire VMs memory and CPU resources since also AutomaticThreadConfig is default.
Thus using this setting it is not necessary to set anything since even DataMemory and DiskPageBufferMemory is set based on the total memory available.
The way it works is that the following parameters now have a 0 default. 0 means that it is automatically calculated.
RedoBuffer will get 32 MByte per LDM
UndoBuffer will get 32 MByte per LDM
LongMessageBuffer will get 12 MByte per thread used, but the first thread will get 32 MByte.
TransactionMemory will get 300 MByte + 60 MByte per thread
SharedGlobalMemory will get 700 MByte + 60 MByte per thread
We avoid using memory that is required by the OS. We avoid using 1800 MByte plus 100 MByte for each block thread.
SchemaMemory is around 1 GB in a small cluster and increases with increasing cluster size and increasing node sizes.
We will compute the use of:
Packed Signal Memory
OS memory usage
Send Buffer Memory
Shared Global Memory
Static memory (in various objects)
Backup Page Memory
Also Page_entry objects used by DiskPageBufferMemory is taken into account.
After computing these buffers, schema memory and various internal memory structures and the memory we want to avoid using such that the OS also have access to some memory, the remaining memory is for DataMemory and DiskPageBufferMemory. 90% of the remaining memory is given to the DataMemory and 10% to the DiskPageBufferMemory.
In addition we have changed the default of MinFreePct to 10%. This provides some possibility to increase sizes of buffers if necessary, this can also be decreased down to 5 and possibly even down to 3-4 to enable more memory for database rows.
One can also set DiskPageBufferMemory to increase or decrease its use. This will have a direct impact on the memory available for DataMemory.
Removed a serious bug in Dblqh where DEBUG_FRAGMENT_LOCK was set. This increased the size of each fragment record by 4kB.
Increased the hash size for foreign keys in Dbtc to 4096 from 16.
Ensured that UndoBuffer configuration parameter is used if defined or if using automatic memory configuration when defining the UNDO log. The memory is added to TransactionMemory and thus can be used for this purpose if no Undo log is created.
Fixed a problem with calculating hw_memory_size on Linux. Fine tuned some memory parameters, added a bit more constant overhead to OS, but decreased overhead per thread.
Fixed a problem with running with 1 thread for LongMessageBuffer.
Fixed some review comments.
Fixed compute_os_memory based on OS overhead in a cloud VM.
Increase default size of REDO log buffer to be 64M per LDM thread to better handle intensive writes, particularly during load phases.
Improved automated configuration of threads after some feedback on benchmark executions. This feedback showed that as much as possible hard-working threads should not share CPU cores. Also more tc threads are needed.
Removed use of overhead CPUs up to 17 CPUs. This means that only when we have 18 CPUs we will remove CPUs for overhead handling. Overhead handling is mainly intended for the IO threads.
Fixed a compiler warning about too small buffer for version string.
The previous automatic memory configuration provided around 800M free memory space in a 16 CPU machine with almost 128 GB of memory. This is a bit too aggressive, so added 200M to free space + added 35M extra per thread, thus changing the scaling from 15M per thread to 50M per thread in OS overhead.
This is mostly a security precaution to avoid risking running out of memory and experiencing any type of swapping.
Needed to reserve even more space on large VMs, added 1% of memory reserved and another 50 MByte per thread.
ClusterJ lacked support for Primary keys using Date columns#
Removed this limitation by removing check for that this is not supported. In addition required the addition of a new object to handle Date as Primary Key based on the handling of Date columns in non-key columns.
Required also a new test case in the clusterj testsuite. This required fairly significant rework since the current testsuite was very much focused on testing all sorts of column types, but not very much focused on testing of different data types in primary key columns. This meant that most of the methods had to be changed a bit and the whole idea with reuse of code didn't work so well.
To support minimal configurations down to 2GB memory space we made the following changes:
Make it possible to set TotalMemoryConfig down to 2GB
Remove OS overhead when setting TotalMemoryConfig
Using the following settings:
We will be able to fit nicely into 3GB memory space and out of this a bit more than half is used by DataMemory and DiskPageBufferMemory.
AutomaticMemoryConfig computation will be affected by the settings of MaxNoOfTables, MaxNoOfOrderedIndexes, MaxNoOfUniqueHashIndexes, MaxNoOfAttributes, MaxNoOfTriggers and also the setting of TransactionMemory and SharedGlobalMemory.
NumCPUs is used to ensure that we use a small configuration when it comes to the number of threads if the test is started on a large machine with lots of data nodes (can be useful for testing).
To handle extreme loads on transporting large data segments we increase the SendBufferMemory to 8M per transporter. In addition we also make it possible for Send memory to grow 50% above its calculated area and thus use more of the SharedGlobalMemory resources when required.
We will never use more than 25% of the SharedGlobalMemory resources for this purpose.
Decouple Send thread mutex and Send thread buffers mutex#
One mutex is used to protect the list of free send buffers. This mutex is acquired when a block thread is acquiring send buffers. Each block thread is connected to a send thread. It acquires send buffers from the pool of send buffers connected to this send thread. Whenever sending occurs the send buffer is returned to the pool owned by the send thread. This means that a block thread will always acquire the buffers from its send thread buffers. But it will return job buffers to the pool connected to the transporter the message is sent to.
This means that imbalances can occur, this imbalance is fixed when seizing from the global list, if no send buffers are available we will try to \"steal\" from neighbour pools.
The actual pool of send buffers are protected by a mutex. This is not the same mutex that is used to protect the send thread. This send thread mutex is used to protect the list of transporters waiting to be served with sending.
However in the current implementation we hold the send thread mutex when releasing to the send buffer pool. This means that we hold two hot mutexes at the same time. This clearly is not good for scaling RonDB to large instances . To avoid this we decouple the release_global call in assist_send_thread as well as the release_chunk call in run_send_thread from the send thread mutex.
The release_global call can simply wait to the end of the assist_send_thread call. We will normally not spend so much time in the assist_send_thread call. The call to release_chunk in the send thread can be integrated in handle_send_trp to avoid that we need to release and seize the send thread mutex an extra time in the send thread loop.
Improved send thread handling#
Block threads are very eager to assist with sending. In situation where we have large data nodes with more than 32 CPUs this will lead to a very busy sending. This leads to sending chunks of data that is too small. This leads both to contention on send mutexes and it leads to sending small messages that hurts the receivers and makes them less efficient.
To avoid this a more balanced approach on sending is required where block threads assist sending, but as load goes up the assistance becomes less urgent.
In this patch we use the implementation of MaxSendDelay and turn it into an adaptive algorithm. To create this adaptive algorithm we add another CPU load level called HIGH_LOAD. This means that we have LIGHT_LOAD up to 30% load, MEDIUM load is between 30% and 60%, HIGH LOAD is between 60% and 75% and above that we have OVERLOAD_LOAD.
We implement two new mechanisms, the first is max_send_delay. This is activated when threads starts reaching HIGH_LOAD. This means that when a block thread puts a transporter into the send queue, it will have to wait a number of microseconds before the send is allowed to occur. This gives the send a chance to gather other sends before the send is actually performed. The max send delay is larger for higher loads and for larger data nodes. It is never set higher than 200 microseconds.
The second mechanism is the minimum send delay. Each time we send and the min send delay is nonzero, we set the transporter in a mode where it has to wait to send even if a send is scheduled. This setting normally isn't affecting things when max_send_delay is set. So it acts to secure against overloading the receivers when our load is still not at HIGH_LOAD. This is at most set to 40 microseconds.
In the past up to 4 block threads could be choosen to assist the send threads. This mechanism had a bug in resetting the activation. This is now fixed. In addition the new mechanisms is only providing for main and rep threads to assist the send threads. In addition the main and rep threads only do when the send threads are at a high load. Once we have entered this mode, we will wait to remove the assistance using a hysteresis. We have two levels of send thread support, the first adds the main thread and the second level adds also the rep thread. Neither of those will assist when they are reaching up to MEDIUM_LOAD level. This should be fairly uncommon for those threads except in some special scenarios.
By default main and rep threads are now not assisting send threads at all.
Block threads will only send to at most one transporter when reaching MEDIUM_LOAD, as before they will not assist send threads at all after reaching the OVERLOAD_LOAD level.
The main and rep threads are activated using the WAKEUP_THREAD_ORD signal, when this is sent the main/rep thread is setting nosend=0 and once every 50ms it is restored to nosend=1. Thus it is necessary to constantly send WAKEUP_THREAD_ORD signals to those threads to keep them waking up to assist send threads.
To handle this swapping between nosend 0 and 1, we introduce a m_nosend_tmp that is the one used to decide whether to perform assistance or not, the configured value is in the m_nosend variable.
To decide whether send thread assistance is required we modify the method to calculate send thread load such that it uses a parameter with number of milliseconds back we want to track. We use 200 milliseconds to decide on send thread assistance.
The configuration parameter MaxSendDelay is deprecated and no longer used. All of its functionality is now automated.
As a first step to remove the recover thread from disturbing the block threads we make sure that it never performs any send thread assistance. This assistance will disturb things, the recover threads still wake up and use about 0.1-0.2% of a CPU, this should also be removed eventually or handled in some other manner.
Also ensured that recover threads are not performing any spinning.
Handling send thread assistance until done is now only handled by main and rep threads. Thus we changed the setting of the pending_send variable.
Removed incrementing the m_overload_counter when calling set_max_delay. This should only be incremented when calling the set_overload_delay function.
Added some comments on this new functionality.
Fixed some problems with lagging timers. Important to perform do_send also when this happens.
Changed automatic thread configuration to use all CPUs that are available. The OS will spread operating system traffic and interrupt traffic on all CPUs anyways, so no specific need to avoid a set of CPUs.
For optimal behaviour it is possible to separate interrupts and OS processing to specific CPUs, combining this with the use of numactl will improve performance a bit more.
Improved send thread handling by integrating MaxSendDelay and automating it. Also ensured that we don't constantly increase MaxSendDelay when more jobs arrive after putting another into the queue. Added minimum send delay also to ensure that we don't get too busy sending in low load scenarios.
Increment TCP send buffer size and receive buffer size in OS kernel to ensure that we can sustain high bandwidth in challenging setups.
In a security context it is sometimes necessary to place pid-files in a special location.
Make this possible by enabling RonDB to configure the Pid-filedirectory. The new parameter is called FileSystemPathPidfile and is settable on RonDB data nodes and management servers.
Support ACTIVATE NODE and DEACTIVATE NODE and change of hostname in a running cluster.
Previously setting NoOfReplicas was something that cannot change. This feature doesn't change that. Instead it makes it possible to set up your cluster such that it can grow to more replicas without reconfiguring your cluster.
The idea is that you set NoOfReplicas to 3 already in the initial config.ini. Even 4 is ok if desirable. This means that 3 Data nodes need to be configured. However only 1 of them is required to be properly defined. The other data nodes simply set the node id and set NodeActive=0. This indicates that the node is deactivated. Thus the node isn't possible to include into the cluster. All nodes in the cluster are aware that this node exists, but they do not allow it to connect since the node is deactivated.
This has the advantage that we don't expect a node to start when we start another data node, we don't expect it to start when waiting in the NDB API until the cluster is started.
This has the added advantage that we can temporarily increase the replication from e.g. 2 to 3. This could be valuable if we want to retain replication even in the presence of restarts, software changes or other reasons to decrease the replication level.
A node can be activated and deactivated by a command in the MGM client. It is a simple command:
This command will activate node 4.
Here is the output from show for a cluster with 3 replicas, 2 of them active, it has the possibility to run with 2 MGM servers, but only one is active, it has the possibility to run with 4 API nodes, but only one of them is active:
ndb_mgm> show Cluster Configuration --------------------- [ndbd(NDB)] 3 node(s) id=3 @127.0.0.1 (RonDB-21.04.0, Nodegroup: 0, *) id=4 @127.0.0.1 (RonDB-21.04.0, Nodegroup: 0) id=5 (not connected, node is deactivated) [ndb_mgmd(MGM)] 2 node(s) id=1 @127.0.0.1 (RonDB-21.04.0) id=2 (not connected, node is deactivated) [mysqld(API)] 4 node(s) id=6 (not connected, node is deactivated) id=7 (not connected, node is deactivated) id=8 (not connected, node is deactivated) id=9 (not connected, accepting connect from any host)
Another important feature that is brought here is the possibility to change the hostname of a node. This makes it possible to move a node from one VM to another VM.
For API nodes this also makes it possible to make RonDB a bit safer. It makes it possible to configure RonDB with all API nodes only able to connect from a specific hostname. It is possible to have many more API nodes configured, but those can not be used until someone activates them. Before activating them one should then set the hostname of those nodes such that only the VM that is selected can connect to the cluster.
Thus with this feature it becomes possible to start with a cluster running all nodes on one VM (ndb_mgmd, ndbmtd and mysqld). Later as the need arises to have HA one adds more VMs and more replicas. Growing to more node groups as an online operation is already supported by RonDB. With this feature it is no longer required that all node groups at all times have the same level of replication. This is possible by having some nodes deactivated in certain node groups.
The benefit of different replication within different node groups can be used when reorganising the cluster or when one node is completely dead for a while, or we simply want to spend less money temporarily on the cluster.
When moving a node to a new VM there are two possibilities for this. One possibility is that we use cloud storage as disks, this means that the new VM can reuse the file system from the old VM, in this case a normal restart can be performed. Thus the node is first deactivated (includes stopping it), the new hostname is set for the new VM, the node is activated and finally the node is started again in a new VM. The other option is where the node is moved to a VM without access to the file system of the old VM, in this case the node need to be started using --initial. This ensures that the node gets the current version of its data from other live nodes in the RonDB cluster. This works for both RonDB data nodes as well as for RonDB management nodes.
It is necessary to start a node that has been recently activated using the --initial flag. If not used we could get into a situation where the nodes don't recognize that they are part of the same cluster. This is particularly true for MGM server where it is a requirement that all MGM servers are alive when a configuration change is performed. With this change the requirement is for all active nodes to be alive for a configuration change to be performed. In this work it is assumed that 2 MGM servers is the maximum number of MGM servers in a cluster.
A new test program directory for this feature is created in the directory mysql-test/activate_ndb.
Running the script test_activate.sh in this directory will execute the test case for this test run. To perform this test it is necessary to set the PATH such that it points to the binary directory where the build placed the binaries. So e.g. if we created a build directory called debug_build and the git tree is placed in /home/mikael/activate_2365 we would set the path in the following manner: export PATH=/home/mikael/activate_2365/debug_build/bin:_dollar_PATH and next we will run the test_activate.sh script that will tell us if the test passed or not.
Fix a bug where pgman.cpp missed initialising the variable m_time_track_get_page which is an array.
Fix a bug which meant that setting TransactionMemory prevented setting of MaxNoOfConcurrentOperations and MaxNoOfConcurrentScans. At the same it didn't prevent setting MaxNoOfLocalScans and MaxNoOfLocalOperations.
Fix such that SpinMethod config parameter actually works. Used wrong check in native_strcasecmp and set m_configured_spintime from SchedulerSpintimer even when SpinMethod is set.
In the code to setup automatic thread config the code works to find the correct CPUs to bind to. However missing is the setting of m_bind_type which is what the init_thread code looks for to discover whether to actually perform the binding.
Needed this to be added to all thread types.
When a log queue entry is to be executed, it uses the method writePrepareLog. This method does however simply put it backin the log queue. This method is not designed to write from the REDO log queue.
This can be simply fixed by understanding that one arrives in this method with log part locked and that the logPartState is Active.
Currently there is an adaptive algorithm to decide how many objects to keep in cache.This algorithm will keep too many objects if there are occasional usage where a lot ofoperations is part of a transaction.
The solution is to never keep more than 128 objects in each free object list in theNdb object.
Sent erroneus CONTINUEB after REDO log problem. Missed setting signal object before sendSignal and missed unlocking log part before returning.
When executing a queued REDO log we call get_table_frag_record. This call is used to get the instance number and actually can fail if the fragment record doesn't belong to the owner of the REDO log. So the ndbrequire is not correct. Actually better to separate this call into a new function get_table_frag_instance that avoids getting table and fragment pointers.
When executing queued log records with 3 LDMs the wrong path in the code was choosen which resulted in the signal being executed by the wrong block instance which led to problems.
In addition a check that the logPartState was ACTIVE was not detailed enough, required a bit more details since we set the state to IDLE before executing the REDO log record and not after to ensure that we don't need to keep the mutex for too long time.
Changed the checks to use m_use_mutex_for_log_parts that is properly calculated to discover if we can have the operations for a log record in a different LDM instance.
Bug fixes in queued REDO log record handling + some name fixing
Currently the size of the UNDO log buffer is limited to 600 MB.To set up RonDB on a Metal Server with 96 CPUs requiresaround twice the amount using the automatic sizing of buffersand memories.
The problem is a small map in LGMAN, by increasing this from40 to 1200 we can support UNDO log buffers of up to 16 GBytewhich is sufficient to even handle a data node with 1024 CPUs.
ndb_import can hang when there are NDB errors in loading a large CSV file. In this case the execution threads stops progressing due to the failures. However the CSV input threads and the output workers are all stuck in trying to send rows to the next thread in the pipeline (output threads for CSV input threads and execution threads for output threads). They are stuck in this state since there are no checks for stop processing in this code path.
Added checks for stop processing in those paths, but also changed such that all errors will lead to stopping the entire ndb_import program and not just the current job being executed. It is better to have a program with a consistent error behaviour rather than trying to continue processing after a failure.
Changed some parameters for ndb_import to make it more responsive and quicker to change jobs. Some of those led to dramatic decrease of time to perform import jobs.
Fix padString for ByteBuffer longer than 255#
BLANK_PAD was a byte array of 255. If the number of bytes to pad was longer than 255 bytes, the buffer.put() operation throws an IndexOutOfBoundsException.
Replace the buffer.put() operation with the equivalent for loop operation which initializes the correct number of bytes.
We have code that ensures that the sum of all CPU usage is 100%, this priorities the usage in the following order: buffer full percentage execution percentage send percentage spin percentage sleep percentage
There was a bug in ensuring that the sum was always 100% in that we changed execution percentage instead of send and spin percentage, this led to a crash in rare situations.
The LCP processing is based on CPU usage, this means that LCP will slow down in the presence of high load. However the load reported is including spin time. Thus the reported CPU usage is overestimated and this will slow down LCPs more than necessary.
Fix by reporting execution time - spin_time
Every time we report ACC locks to ndbinfo we print ACC_OP(...), this is some debug information that have been left in the code. Remove it.
We have set MinDiskWriteSpeed and MaxDiskWriteSpeed very low in the default configuration. These values are intended as a minimum disk write speed for LCP processing even in the context of a lower requirement from the adaptive algorithm deciding the disk write speed.
However there is also an algorithm to decrease the LCP speed to its minimum. To avoid that this minimum is not below what is reasonable to process even in the context of CPU overload or disk overload we hardcode the parameters for disk write speed minimum, but do this per LDM thread rather than the total.
The min disk write speed is set to 2 MByte, the max to 3 MByte, the max during another nodes restart is set to 12 MByte and the max during our own restart is set to 50 MByte. All numbers per LDM thread.
In the logbuffers table the log_part is the instance number, it should be the log part number. The log_id is the log part number, but this should be 0 for REDO logs.
The code to create a CPU map to use in assigning CPU locking didn't take into account the case with 0 LDM instances. Easiest solution was to simply change to 1 LDM instance, the code above will get its list of CPUs that it desired also in this case. Added a number of unit tests to verify this feature as well.