Release Notes RonDB 21.04.3#

RonDB 21.04.3 is the fourth release of RonDB 21.04. It is based on MySQL NDB Cluster 8.0.23. It is a bug fix release based on RonDB 21.04.2.

RonDB 21.04.3 is released as an open source SW with binary tarballs for usage in Linux usage and Mac OS X. It is developed on Linux and Mac OS X and using WSL 2 on Windows (Linux on Windows).

RonDB 21.04.3 can be used with both x86_64 and ARM64 architectures although ARM64 is still in beta state.

There are four ways to use RonDB 21.04.3:

You can use the managed version available on hopsworks.ai. This sets up a RonDB cluster provided a few details on the HW resources to use. The RonDB cluster is integrated with Hopsworks and can be used for both RonDB applications as well as for Hopsworks applications. Currently AWS is supported, Azure support is soon available.
You can use the cloud scripts that will enable you to set up in an easy manner a cluster on Azure or GCP. This requires no previous knowledge of RonDB, the script only needs a description of the HW resources to use and the rest of the set up is automated.
You can use the open source version and use the binary tarball and set it up yourself.

You can use the open source version and build and set it up yourself. This is the commands you can use:

# Download x86_64 on Linux
wget https://repo.hops.works/master/rondb-21.04.3-linux-glibc2.17-x86_64.tar.gz
# Download ARM64 on Linux
wget https://repo.hops.works/master/rondb-21.04.3-linux-glibc2.28-arm64_v8.tar.gz
# Download x86_64 on Mac OS X (at least Mac OS X 11.6)
wget https://repo.hops.works/master/rondb-21.04.3-macosx11.6-xcode-13.1-x86_64.tar.gz
# Download ARM64 on Mac OS X (at least Mac OS X 12.2)
wget https://repo.hops.works/master/rondb-21.04.3-macosx12.2-xcode-13.1-arm64_v8.tar.gz

RonDB 21.04 is a Long Term Support version that will be maintained until at least 2024.

Maintaining 21.04 means mainly fixing critical bugs and minor change requests. It doesn’t involve merging with any future release of MySQL NDB Cluster, this will be handled in newer RonDB releases.

Backports of critical bug fixes from MySQL NDB Cluster will happen.

Summary of changes in RonDB 21.04.3#

RonDB has 19 bug fixes since RonDB 21.04.2. In total the RonDB 21.04 contains 15 new features on top of MySQL Cluster 8.0.23 and a total of 87 bug fixes.

Test environment#

RonDB has a number of unit tests that are executed as part of the build process.

MTR testing#

RonDB has a functional test suite using the MTR (MySQL Test Run) that executes more than 500 RonDB specific test programs. In adition there are thousands of test cases for the MySQL functionality. MTR is executed on both Mac OS X and Linux.

We also have a special mode of MTR testing where we can run with different versions of RonDB in the same cluster to verify our support of online software upgrade.

Autotest#

RonDB is very focused on high availability. This is tested using a test infrastructure we call Autotest. It contains also many hundreds of test variants that takes around 36 hours to execute the full set. One test run with Autotest uses a specific configuration of RonDB. We execute multiple such configurations varying the number of data nodes, the replication factor and the thread and memory setup.

An important part of this testing framework is that it uses error injection. This means that we can test exactly what will happen if we crash in very specific situations, if we run out of memory at specific points in the code and various ways of changing the timing by inserting small sleeps in critical paths of the code.

During one full test run of Autotest RonDB nodes are restarted thousands of times in all sorts of critical situations.

Autotest currently runs on Linux with a large variety of CPUs, Linux distributions and even on Windows using WSL 2 with Ubuntu.

Benchmark testing#

We test RonDB using the Sysbench test suite, DBT2 (an open source variant of TPC-C), flexAsynch (an internal key-value benchmark), DBT3 (an open source variant of TPC-H) and finally YCSB (Yahoo Cloud Serving Benchmark).

The focus is on testing RonDBs LATS capabilities (low Latency, high Availability, high Throughput and scalable Storage).

Hopsworks testing#

Finally we also execute tests in Hopsworks to ensure that it works with HopsFS, the distributed file system built on top of RonDB, and HSFS, the Feature Store designed on top of RonDB, and together with all other use cases of RonDB in the Hopsworks framework.

BUG FIXES#

RONDB-110: Fix with CS_RELEASE state#

Proper handling of CS_RELEASE state introduced in RonDB 21.04.2.

RONDB-109: Autotest improvements#

Improvements in execution of Autotest. Now configuration makes it possible to run with any configuration listed in conf-autotest.cnf found in storagendbtestrun-test.

RONDB-108: Network send issue#

Ensured communication transporters are removed from non-neighbour lists when moved to neighbour transporters. Could cause issues with 3 and 4 replicas.

RONDB-107: Network receive issue#

Sometimes we read events from epoll even after the socket should be closed. In this case the allTransporters entry is NULL, ignore any signals in this state and reset the m_recv_transporters and m_has_data_transporters bitmasks.

RONDB-106: Fix execute flag#

The execute flag was never reset after being set in the first interaction of the transaction. This had not so large issues previously, only allowing TCKEYCONF to be sent a bit premature in some cases.

However now being used to discover if we are executing a single operation at a time, it is important to reset the execute flag every time we get a TCKEYREQ from the NDB API and ensuring that it is set again when the NDB API sends the last TCKEYREQ in a batch.

Introduced a new flag, TF_SINGLE_EXEC_FLAG to indicate that the interaction is using a single TCKEYREQ from the NDB API. This single TCKEYREQ can still generate lots of other TCKEYREQ using Reorg trigger, FK triggers, Fully replicated triggers, Unique Key interactions. But those are all dependendant on the single operation and is thus be ok to execute in any order. Also the final Deferred trigger execution phase is ok to execute in parallel, no interactions with NDB API happens at this point.

RONDB-106: Batch execution improvements#

This improvement makes it clear what the consistency of batch executions are. The aim is to ensure that we can always read our previous writes in the same transaction even if we are using batched executions.

If there is a risk that we issue several operations in the same row in overlapping batches we need to ensure that all operations of this row is executed in the primary replica first and using the LDM thread.

It isn’t necessary for all operations in the batch to do this, only those that have multiple operations on the same row.

We provide no special guarantees of what order executions on different rows will have.

This means that for batch executions to be able to use read backup and query threads we need to set a BatchSafe flag on the TCKEYREQ signal. Otherwise only TCKEYREQ operations with Exec flag will be considered for use by backup replicas and query threads.

HOPSWORKS-2937: savepointId destroyed in Commit#

In a Commit in DBTUP we start by setting m_commit_disk_callback_page. This variable is in union with savepointId. savepointId is protected by fragment mutex and can cause Dirty Reads processed while commit is ongoing to read the before value even for Dirty Reads part of the same transaction.

Solution is to remove union of those 2 variables.

Fix output from various test cases#

HOPSWORKS-2927: Problems with more than 64k extents in a fragment#

DBTUP stores extent number in a 16-bit variable, however there can be more than 64k extents in a data file. Thus DBTUP should use 32-bits to store the extent number.

This can issues in restart_setup_page and in debug printouts.

HOPSWORKS-2928: Avoid looping over all extents when reporting extent space and free extent space#

HOPSWORKS-2914: Improve lc_scriptsconfig_git#

Some files has an empty line before the copyright message. This changes the commit hook to accept a copyright message in a paragraph that begins within the 10 first lines or ends within the last 10 lines.

HOPSWORKS-2912: GCP list can be empty#

GCP list can be empty when printing the GCP report in SUMA, in this case it is not a good to use a pointer of the last GCP record since this is a null pointer.

Use GCI = 0 as a workaround in the event report when the pointer is null.

HOPSWORKS-2911: Lost signals on ARM#

Sending signals between threads on x86 use a technique that is using memory barriers in a number of places. This technique requires a fair bit of work to get to work on a new HW platform such as an ARM processor or Power or any other processor. Until this work have been done we can instead rely on using mutexes for this communication.

The extra overhead this brings costs about 1.5% in performance.

HOPSWORKS-2907: Test case fix#

At first use of a thread in data nodes some data memory pages are allocated and not returned. These pages lead to failures in testIndex -n Bug56829 T1. Probably a good idea at some time to investigate this a bit deeper.

HOPSWORKS-2906: Index stats improvements#

The index statistics auto-update feature suffered from a number of issues. 1) More than one concurrent update could be started 2) If the auto-update suffered an error at begin of schema transaction we leaked a record of c_txHandlePool of which there are only 2 records.

Fixed by ensuring only 1 update ongoing at a time Fixed by handling the error at begin schema transaction properly Also added extra insurance by not allowing index stats to allocate a Tx handle when 2 have already been allocated for schema transaction and handling of API failure.

HOPSWORKS-2905: Failed setup of m_nodes array in fix_nodegroup#

Missed initialising m_nodes array before setting it up in fix_nodegroup. This can lead to high nodeids being stored beyond number of replicas used. This leads to crash instead of the correct behaviour which is to return no nodes.

HOPSWORKS-2903: Fix SignalLoggerManager::printSignalData#

Send the correct block number to print functions.

HOPSWORKS-2898: Fix compiler issue#

Report of error code 4009 turned out on wrong due to wrong data type in extern declaration.

Backport of BUG33162059#

Backup of Bug33162059 from MySQL 8.0.28.