Advanced Configuration of RonDB file systems#

In this chapter we will discuss how to setup file system related configuration parameters for both in-memory data and disk columns.

The following things need to be considered when configuring disks for RonDB. How RonDB disk data works was covered in another chapter on Disk Data columns where we covered the potential configuration parameters. Here we will focus on recommended ways to use those parameters.

Decision on which disks to use for tablespaces
Decision on which disks to use for UNDO logs
Decision on size of DiskPageBufferMemory
Decision on size of Undo log buffer
Decision on which disks to use for REDO logs and local checkpoints

Directory placement#

The default placement of disks are specified by the configuration parameter DataDir. This should always be set in the configuration file.

We can move all the disk usage except the node logs and trace files to another directory using the configuration parameter FileSystemPath.

Currently local checkpoint data and REDO logs cannot be separated on different disks. They are always written with sequential writes, so this should not present any issues.

A tough challenge is to provide predictable latency for RonDB using disk data. When NDB was designed this was simply not possible and thus early versions of MySQL Cluster had no support for disk data.

The main reason why it is now possible to overcome these challenges is the hardware development that have happened the last years. Nowadays we have access to SSD devices that can respond in less than hundreds of microseconds and NVMe devices that are even faster and even persistent memory devices that cut down latency to only a few microseconds.

An installation that uses a properly configured hardware and have setup the file systems in a good way can definitely use disk columns and still maintain predictable latency requirements where transactions with tens of round trips can be executed in tens of milliseconds even involving disk data.

The throughput of RonDB disk data is not equal to the throughput of in-memory data. The main difference is in writes where disk data columns uses one tablespace manager that manages extents of tablespaces and allocation of pages in extents and one log manager that handles the UNDO log. Thus there are a few bottlenecks in mutexes around these manager that limits the write throughput for disk data tables. It is still possible to perform hundreds of thousands of row writes per second of disk data rows for a cluster of NDB data nodes.

Reads on tables with disk columns are as scalable as in-memory tables and are only limited by the amount of reads from the file system that can be served. It is expected that millions of rows per second can be read per cluster even when the cache hit rate is low and with high cache hit rates tens of millions of reads per second and even hundreds of millions of reads per second is possible per cluster.

Most of the decision making here is about choosing the correct disk architecture for your application and the rest is about selecting proper buffer sizes.

It is recommended to avoid using the same disks for tablespaces as is used for local and global checkpointing and REDO logging. The reason is if the disks containing tablespaces are overloaded by the application it will affect the operation for checkpointing, REDO logging and so forth. This is not desirable. It is preferrable to ensure that logging and checkpointing always have dedicated disk resources, this will remove many overload problems.

As an example if we want to setup RonDB in the Oracle Bare Metal Cloud using the IO-heavy servers we can do the following. This machine comes equipped with 768 GBytes of memory and a total of 51 TByte of space using 8 NVMe devices.

One way to use this setup is to use two devices for local checkpoint files, REDO log files, the UNDO logs and the backups in RonDB. Let’s assume we set up this file system on /ndb_data.

The other six devices are used for tablespaces. This means setting up two file systems using a RAID 0 configuration on two devices in both cases. As an example we can use 1 TByte of disk space for the UNDO log and 30 TByte disk space for tablespaces.

The blog here provides a detailed example of how one can setup a complex disk data set up for RonDB.

The blog here provides a description of a heavy benchmark using disk columns.

Let’s assume that we set up this file system on /ndb_disk_data.

Thus we will set the configuration variable DataDir to /ndb_data and we set the configuration variable FileSystemPathDD to /ndb_disk_data.

DataDir will also contain node logs, cluster logs, trace files and pid files generated by the data node. It could be useful to place these files on a separate device. Normally those files will not grow extensively, but if the data nodes hit some situation where information is written to the node logs a lot, it could be useful to also remove those from the same disks as checkpoint files and REDO log files. DataDir is always used for those files, but if FileSystemPath is also set then local checkpoint files, REDO logs and metadata and system files are moved to this directory.

It is possible to specifically set a directory for backups in BackupDataDir, for tablespaces in FileSystemPathDataFiles and a specific directory for UNDO log files in FileSystemPathUndoFiles.

Compressed files#

To ensure that we can fit at least one and possibly even two backups we should use compression on the backup files. This means setting CompressedBackup to 1 in the configuration.

It is possible to use compressed local checkpoints as well by setting CompressedLCP.

Compressing happens in io threads and uses zlib for compression. Expect this library to use one second of CPU time for around 100-200 MBytes of checkpoint files or backup files.

Encryption at rest#

RonDB supports encryption of on-disk data, thereby providing at-rest protection for data stored in RonDB. This includes LCP files, TS files, redo logs and undo logs but not backups. To use encryption, refer to the following two configuration variables:

EncryptedFileSystem: Set to 1 to enable on-disk encryption or to 0 to disable on-disk encryption. This variable should be set in config.ini on the RonDB management server. Example:
```
[ndbd default]
EncryptedFileSystem=1
```
filesystem-password-from-stdin: This flag should be set in the local data node configuration in my.cnf. Example:
```
[ndbd]
filesystem-password-from-stdin
```

When encryption is enabled with the two variables detailed above, the data node process ndbmtd will expect the passphrase on STDIN. You must start the process in such a way that STDIN is fed the passphrase, followed by a newline character (a.k.a. line feed, \n or 0x0a). For example,

If the passphrase followed by a newline character is stored in a file, you can execute ndbmtd < /path/to/passphrase/file in a shell script.
If the passphrase is stored in a KMS or otherwise needs a certain command to retrieve, you can execute ndbmtd < <(command-to-retrieve-passphrase; echo) in a shell script.
If you want to enter the passphrase manually, run ndbmtd in a shell. When prompted, type the passphrase and press return.

The passphrase length must be no longer than 256 characters. Allowed characters are A-Z, a-z, 0-9 and #&()*+,-./:;<=>?@[]_`{|}~.

Converting the on-disk data from unencrypted to encrypted, from encrypted to unencrypted, or from one passphrase to another is not supported. In order to turn on or off encryption or change the passphrase for the on-disk data, a rolling restart using --initial is required. Each node in turn will then delete all on-disk data, reinitialize the file system using the new configuration and populate it using data fetched from a replica node.

WARNING: When performing reconfiguration of encryption, care must be taken to not lose data. A data node process started with the --initial flag will delete all on-disk data on startup. To not lose this data it must be present on a replica (another data node in the same node group). It is recommended to perform reconfiguration of encryption only on online, redundant RonDB clusters.

Verifying at-rest encryption#

To verify the encryption configuration of the data nodes, run the following query in the SQL interface.

SELECT cv.node_id AS NodeId, cv.config_value AS EncryptedFileSystem
FROM ndbinfo.config_values cv JOIN ndbinfo.config_params cp
  ON cv.config_param = cp.param_number
WHERE cp.param_name = 'EncryptedFileSystem';

A value of 1 indicates that the file system on that data node is encrypted, while 0 indicates that the file system is unencrypted.

Encryption implementation details#

The data node generates a 256-bit passphrase key, using key derivation process PKCS #5 PBKDF2 with SHA-256, the provided passphrase, a 256-bit random salt and 100000 iterations. It also maintains a secrets file, XTS-encrypted with the passphrase key, containing the node master key. The salt for the passphrase key is stored in the secrets file header. The node master key is a 256-bit random value generated during initialization and will be different on each node even if the passphrase is the same. Files used to store data (including LCP files, TS files, redo logs and undo logs but not backups) are XTS-encrypted with one or more randomly generated 256-bit data keys. The data keys are wrapped using AESKW with the node master key and stored in the respective file’s header.

Configuring the REDO log files and local checkpoints#

Local checkpoints and REDO logs are an important part of the configuration setup for MySQL Cluster.

We write the update information of all inserts, updates and deletes into the REDO log. It is important to be able to cut the REDO log tail every now and then. This happens when a local checkpoint have been completed. At recovery time we first install a local checkpoint and then we apply the REDO log to get the state we want to restore.

Configuring REDO log#

It is important to set the correct size of the REDO log before starting up a production cluster. It is possible to change the size, but it requires that the nodes are restarted using an initial node restart.

Setting up the REDO log using NoOfFragmentLogParts, NoOfFragmentLogFiles and FragmentLogFileSize is mostly not required and was covered in a previous chapter on advanced configuration of RonDB.

InitFragmentLogFiles#

At initial start of a node, the REDO log files will be initialised. This ensures that the REDO log space on disk is really allocated. Otherwise the OS might fake the allocation and we might run out of REDO log although not all of it was used.

In a testing environment it might be useful to speed up initial starts by skipping this initialisation phase. For this purpose it is possible to set InitFragmentLogFiles to sparse to avoid this initialisation phase. Thus there is a risk instead that we run out of REDO log although there is supposed to exist space in the REDO log. This will cause the data node to crash since we cannot write anymore to the REDO log.

RedoBuffer#

This parameter is handled automatically by the automatic memory handling in RonDB.

At commit time the REDO log messages have to be written to the REDO log buffer. This is an in-memory buffer that will be written to disk at intervals prescribed by global checkpoints and we write to disk at certain intervals. If the disk cannot keep up the RedoBuffer usage will grow. This parameter sets the size of this buffer. When we run out of REDO buffer we can abort transactions in the prepare phase. The commit messages must be written into the REDO log buffer, so we ensure that there is always some space left in the buffer for this purpose. Thus running large transactions requires a large RedoBuffer setting.

The setting of RedoBuffer is per ldm thread. The total memory used for Redo buffers is RedoBuffer multiplied by the number of ldm threads.

DefaultOperationRedoProblemAction#

When a transaction finds the REDO buffer full, it has two choices. It can either abort the transaction immediately, or it can queue up and hope that the REDO buffer situation improves.

The API node have a configuration option for this purpose called DefaultOperationRedoProblemAction. This configuration parameter can be set to ABORT or QUEUE. It can be influenced by an option in the NDB API for NDB API applications. The default is QUEUE which means that we will queue waiting for Redo Buffer to be available.

RedoOverCommitLimit and RedoOverCommitCounter#

As long as we have IO bandwidth to write out the REDO log in the same rate as we write the REDO log, there are no problems.

If we write more to the file than what is acknowledged, we know that the disks are not able to keep up.

If the disks cannot keep up, we can calculate the disk IO rate through the acks of file writes. By measuring the amount of outstanding IO we have at any point in time, we can measure the IO lag.

The IO lag is measured in seconds. If the IO lag in seconds exceeds the RedoOverCommitLimit, we will increment a counter with the lag divided by RedoOverCommitLimit. If we have a lag also the next second exceeding the RedoOverCommitLimit we will add even more to this counter. When this counter reaches RedoOverCommitCounter we will start aborting transactions to gain control of the REDO log writes. The default setting of RedoOverCommitLimit is 20 and of RedoOverCommitCounter is 3.

Every second that we have a lag of the REDO log writes, we will write a message to the node log describing how much lag we are seeing.

Configuring Local checkpoints#

In MySQL Cluster 7.6 an implementation of Partial Local Checkpoints were performed. This includes automated handling of speed of local checkpoints. Thus it is no longer necessary to configure this. RonDB will adjust the checkpoint speed based on how much checkpointing is required to keep the REDO log from filling up.

There are three reasons why transactions can be aborted due to the REDO log. The first we saw above was that we run out of REDO buffer space, we simply can’t write to the REDO log quick enough to move data from the buffer to the disk IO subsystem.

The second reason is that the disk IO subsystem cannot keep up with the REDO log writes as explained above.

The third reason is that we run out of REDO log space.

The only way to get REDO log space back is to ensure that the REDO logs are no longer needed. This happens through a local checkpoint.

Local checkpoint execution is controlled through a set of configuration parameters.

The first decision to make is when to start a local checkpoint again. We have one parameter that directly controls this, it is the TimeBetweenLocalCheckpoints parameter. In principle NDB is designed to execute local checkpoints continously. There is no real reason to change this parameter, it is also a bit complex to understand.

The second configuration parameter controlling when to start an LCP is the MaxLCPStartDelay. This parameter is effective during node restarts only. When the master decides that it is time to start an LCP, it will check if there are nodes that soon will need to wait to be part of an LCP. The reason for this is that any node restart is finalised by participating in a local checkpoint. The MaxLCPStartDelay sets the time we will wait for those nodes to be ready for an LCP before we start an LCP.

TimeBetweenLocalCheckpoints#

The default setting of this parameter is 20. This parameter is logarithmic. Thus 20 means 2 raised to 20 words, thus 4 MBytes. We measure the amount of committed information since the last LCP was started. If it is bigger than 4 MBytes, we will start a new LCP.

Thus as soon as there is some level of write activity we will start an LCP. If NDB is used mainly for reading there will be no LCP activity.

The setting can at most be set to 31 which means that it will wait for 8 GBytes of write activity to occur in the cluster before an LCP is started.

It is recommended to not change this parameter, mainly because there is little value in changing it. It can be useful sometimes as a means of debugging RonDB since setting it to 6 leads to a new local checkpoint immediately stated after the previous one completed.

MaxLCPStartDelay#

This parameter is set in seconds, defaults to 25 seconds. If an LCP is starting and we know that it is close to be ready to join an LCP we will wait for this long before starting the LCP.

MinDiskWriteSpeed#

When an LCP is executing it will use an adaptive algorithm to set the desired disk write speed of the local checkpoint. By default this parameter is set to 2M (= 2 MBytes). Thus we will attempt to write 2 MByte per second in the node. We will however never set this below 2 MByte per ldm thread.

This parameter is also used as the speed of taking backups. The speed of backups isn’t as adaptive as the speed of local checkpoints.

MinDiskWriteSpeed is the minimum speed that the adaptive algorithm will never go below.

The adaptive algorithm will choose normally the currently active max value. There are two reasons to decrease the disk write speed of a local checkpoint. The first reason is that the CPU is used quite a lot in the ldm thread. We measure CPU usage in ldm threads both by asking the OS how much CPU we used and by measuring how much time we spent being awake. If the time of awakeness is higher than the executed time, it means that the OS removed us from the CPU to execute other threads even when we were ready to execute on the CPU. This will impact the adaptive algorithm a bit as well.

We will start slowing down execution of LCPs when the CPU is becoming too much used. At 97% we will step down the disk write speed, we will move faster in decreasing disk write speed if we reach 99% and even more when reaching 100% CPU usage. If the CPU usage drops down below 90% we will start increasing disk write speed again.

If any REDO log part has an IO lag of 2 seconds or more as described above in the section on RedoOverCommitLimit, we will quickly decrease the disk write speed. Currently the REDO log and the local checkpoints are always placed on the same disk, if the REDO log cannot keep up, it can be a good idea to decrease the disk write speed of LCPs.

MaxDiskWriteSpeed#

In normal operation the maximum disk write speed is set by MaxDiskWriteSpeed, this defaults to 20 MBytes per second. It is the total for all ldm threads.

MaxDiskWriteSpeedOtherNodeRestart#

When another node is executing a node restart we increase the disk write speed since executing faster LCPs means that we can restart faster. In this case we have a higher max disk write speed. The same adaptive algorithm can still decrease the speed of the LCP.

This setting defaults to 50 Mbyte per second.

MaxDiskWriteSpeedOwnRestart#

When we are restarting ourselves and in cluster restarts, we will use this setting as max disk write speed. This defaults to 200 MBytes per second. During a cluster restart there is no risk of the adaptive algorithm stopping activity since there is no transactions ongoing while running the LCP.

LCPScanProgressTimeout#

This sets the watchdog timer for ensuring that there is progress of LCPs. It is set by default to 60 seconds and after a third of this time, after 20 seconds we will start writing warning messages to the node log.

Configuring backups#

To configure backups we need to configure an appropriately sized log buffer for backups and we need to handle the imbalance of backup execution.

Writing backups is controlled by the same parameters as the execution of LCPs. This means that if a backup is executing in parallel with an LCP, the LCP and the backup will share the disk write speed. The max speed is adjusted to be the sum of the checkpoint speed and the backup speed.

BackupLogBufferSize#

The size of the backup log buffer is handled automatically by the automatic memory handler.

A backup must both write out all records (in-memory data and disk columns). In addition it must write all changes during the backup. The changes is a form of UNDO log or REDO log that we must write during backups in addition to all other logging and checkpointing that is going on in a data node.

By default it is set to 16 MByte. In a cluster with much write activity it is recommended to set the size of this buffer much larger, e.g. 256 MByte. This buffer is only allocated in the first ldm thread, thus only one instance of it is allocated.

Configuring global checkpoints#

TimeBetweenGlobalCheckpoints#

In NDB data durability at commit time is ensured by writing into the REDO buffer in several nodes as part of commit. Durability to survive cluster crashes happens when that data is committed to disk through a global checkpoint.

The default timer for global checkpoints is set to 2000 milliseconds.

If the data node runs on hard drives this number is quite ok, it might even in some cases be tough to sustain high loads at this number with hard drives.

With SSDs this value should be perfectly ok and should normally not be any issue unless the write load is very high.

With faster SSDs or NVMe drives one could even decrease this time since the IO bandwidth should be a lot higher than what NDB will be able to match with its updates.

Decreasing this value means that changes are made durable on disk faster. All transactions ensure that changes are durable in memory on all live replicas, thus they are network durable, this parameter specifies the delay before they also disk durable.

TimeBetweenGlobalCheckpointsTimeout#

When the disks gets severely overloaded the global checkpointing might slow down significantly. To ensure that we don’t continue running when we are in an unsafe state we crash the node when the time between global checkpoints is more than TimeBetweenGlobalCheckpointsTimeout. This is also a watchdog ensuring that we don’t continue if global checkpoint execution has stopped due to a software bug.

It is set by default to 120000 milliseconds (120 seconds or 2 minutes).

Memory Buffers for disk columns#

The DiskPageBufferMemory is a very important parameter, by default this will be set to 64 MBytes. This is not for production usage, only for trying out MySQL Cluster on a small computer.

It is a good idea to consider this parameter specifically. If it is set too low it will have a dramatic impact on latency and throughput and if set too high it will remove the possibility to store more data in the DataMemory (in memory data).

The size of the UNDO log buffer is set either in InitialLogfileGroup or in the command where the UNDO log buffer is created. It is not changeable after that, it is important to set it sufficiently high from the start. There is one log buffer per data node.

The Undo log buffer size should be sufficient to ensure that transactions can proceed without interruptions as often as possible. The pressure on the log buffer increases with large transactions when large amounts of change data is written quickly into the log buffer and it increases with large amount of parallel write transactions. If the buffer is overloaded it means that commits will have to wait until buffer content have been sent to the disk. Thus transaction latency can be heavily impacted by using disk data with a too small Undo log buffer and similarly by using a disk device for Undo logs that cannot keep up with the write speed.

New tablespace and undo log files#

New undo log files can be added as an online operation as shown in the chapter on disk columns. Similarly for tablespaces. Tablespaces can also be added through the configuration parameter InitialTablespace.

File system configurations#

DiskIOThreadPool#

When writing to tablespace files we can have several file system threads writing to the same file in parallel. These threads are in a special pool of threads for this purpose. The number of threads in this pool is configurable, it defaults to 8, but can be set higher for higher write rates to tablespace files. The 8 here means that we have 8 dedicated threads available for writing to the tablespace files. If both of those threads are busy writing any writes will have to wait for one of them to complete its file operation. We can set the number of threads in the pool for writing to tablespace files higher through the configuration parameter DiskIOThreadPool.

In running the benchmark on disk data here we set this parameter to 128 to ensure that the file system could keep up with the load of writing many GBytes each second.

MaxNoOfOpenFiles#

It is possible to set a maximum limit to the number of open files (and thus file threads) in a data node. It is set by default to 0 which means that no limit exists. It is strongly recommended to not change this parameter other than to get an understanding of the system and see how many files it opens. There are no problems associated with having many open files in parallel and the algorithms in NDB are designed to limit the amount of open files in the data node.

InitialNoOfOpenFiles#

InitialNoOfOpenFiles is the amount of file threads that are started in the early phases of the restart. It is not necessary to change this parameter. But setting it higher moves some work to create threads to an earlier point before the node has started up and is executing transactions. It defaults to 27. The number of open files required is very much dependent on the data node configuration. Running RonDB with very many threads could have many hundreds of IO threads started.

File threads consume very small resources, it mainly uses a 128 kByte stack. There are no real issues associated with ignoring this thread creation and the memory associated with it. Modern operating systems can handle hundreds of thousands of concurrent threads and we normally use less than one hundred threads for file IO.

This is an area where we rarely, if ever, see any issues.

ODirect#

ODirect is a special file system option that bypasses the page cache in the operating system. This option is always used on UNDO log files and tablespace files. For REDO log files, backup files and local checkpoint files it is configurable whether to use it or not.

It is recommended to set this parameter on Linux unless you are running on an ancient Linux kernel. It is set to false by default.

ODirectSyncFlag#

When this flag is set we treat the writes using ODirect as sync:ed writes. The flag is false by default.

DiskSyncSize#

This parameter ensures that we write out any buffers inside the OS frequent enough to avoid issues. It was added since in Linux you can buffer the entire memory in a machine before the writes start to happen. This led to a very unpleasant behaviour when too much buffering happened in the file system. This parameter ensures that we write out buffers frequently and defaults to 4 MByte. Setting ODirect means that this parameter is ignored. Setting ODirect is the preferred method on Linux.

Final words#

The above mentioned machines in the Oracle Bare Metal Cloud represents a very good example of a high-end machine that fit very well with the requirements RonDB have on the disk subsystem when using disk columns.

The six devices used for tablespace data and UNDO log data will be able to sustain millions of IO operations and thus reading and writing several GBytes of tablespace pages and writing of UNDO log files per second.

Thus it can sustain several hundred thousand updates per second of disk rows which is more than what one node currently can handle for disk data rows (for main memory rows it can handle millions of updates per second in specific setups). This setup is a safe setup for RonDB where there is very little risk that the disk subsystem becomes a bottleneck and a problem for RonDB.