Procedure to define advanced configuration#
The first choice for setting up MySQL Cluster is to select the number of replicas to use. By default and by far the most common is to use two replicas. The code supports 1 to 4 replicas, but only 1 and 2 replicas are supported features in MySQL Cluster 7.5.
Next one needs to define the set of data nodes, the set of management servers, the set of MySQL Servers and any additional API nodes to use.
Next it is important to set NoOfFragmentLogParts. This parameter specifies the number of log parts per node. It is a very important parameter in defining the number of partitions of the tables and how many ldm threads we can have in each data node.
Most of the data in NDB resides in memory, setting up the memory resources available for the database is the next important step.
All changes of the databases in NDB happens through transactions. Transaction resources can be configured.
Tables, indexes, foreign keys, disk data objects and so forth, all of them require schema resources. These schema resources can be configured.
NDB is a distributed DBMS, thus communication between nodes can be configured.
Recovery in NDB is an important part, it is also highly linked to the partitioning scheme in NDB and this in turn has a great impact on the thread configuration.
Setting up the optimal thread configuration requires a fair bit of understanding about what affects it, this is covered in a few special chapters. We cover configuration of disk data in NDB in a separate chapter and we cover configuration of MySQL Servers for use with MySQL Cluster in a separate chapter.
Setting NoOfReplicas and NoOfFragmentLogParts#
Setting NoOfReplicas is a very important parameter, it can currently not be changed in any other manner than by initialising the cluster and restoring a backup. It cannot be changed without downtime. At the same time it is normally set to 2 and this is the only supported setting other than 1 which is not used for high availability setups.
The number of partitions in a table for a default partitioned table is number of replicas multiplied by number of node groups multiplied by number of ldm threads.
The number of ldm threads can never be set higher than NoOfFragmentLogParts. Thus it is a very important parameter influencing the number of ldm threads. In MySQL Cluster 7.5 it is expected that NoOfFragmentLogParts is equal to the number of ldm threads except that NoOfFragmentLogParts is always set to at least 4.
Setting number of ldm threads can be done in two ways. By setting the configuration parameter MaxNoOfExecutionThreads we implicitly define the number ldm threads.
Setting MaxNoOfExecutionThreads between 0 and 3 gives 1 ldm thread. Between 4 and 6 gives 2 ldm threads. 7 through 11 gives 4 ldm threads, 12 through 15 gives 6 ldm threads. 16 through 19 gives 8 ldm threads, 20 through 23 means 10 ldm threads. 24 through 31 gives 12 ldm threads. 32 through 39 gives 16 ldm threads. 40 through 47 gives 20 ldm threads, 48 through 63 gives 24 ldm threads. 64 and up to 72 gives 32 ldm threads. The parameter cannot be set higher than 72.
The second method to set number of ldm threads is by using the ThreadConfig parameter. These parameters will be described in more detail in coming chapters on configuration of the threads in the data nodes.
It is possible to set the NoOfFragmentLogParts higher than the number of ldm threads. But there is no real gain from doing this. The best is to decide on the number of ldm threads you want to have and set NoOfFragmentLogParts to the same value. Changing the number of log parts can only be done using an initial node restart and old tables will have to be reorganised to use the new number of log parts. Tables can only be reorganised to increase number of partitions, the number of partitions in an existing table cannot be decreased other than through backup and restore.
Thus deciding on number of log parts and number of ldm threads is very important. The number of ldm threads should be set according to the amount of CPU resources available for the data node.
The preferred method of configuring thread resources is by using ThreadConfig. For production purposes this gives the ultimate control over your environment and the best chance to have a stable environment.
Setting MaxNoOfExecutionThreads is easier and can still be combined with CPU locking using LockExecuteThreadToCPU.
Setting up nodes#
The cluster configuration must contain all nodes that will be part of the cluster. We have to define the data nodes, these are equal to the ndbd or ndbmtd processes we start in the cluster. Similarly the management server nodes is equal to the number of ndb_mgmd processes we start.
This is not true for MySQL Servers and other API nodes. In this case each MySQL Server can use a number of API node ids.
We strongly recommend to use node ids on all data nodes and on all management servers and on all static servers using the NDB API, such as the MySQL Server or any other program that is expected to be up all the time when the cluster is up.
One can also define a few API node slots for NDB tools, these need not have a node id.
Similarly we recommend strongly that data nodes, management servers and all static servers (MySQL Servers and others) set the hostname allowed to connect on each node id.
For security reasons it is a good idea to do the same also with the free API node slots to control from where API nodes can connect to the cluster.
Setting up data nodes#
I recommend that data nodes are the only nodes set up to use node ids 1 through 48. This provides the possibility to grow the number of data nodes to the maximum size without any hassle with reconfiguration of API nodes and other nodes.
As mentioned we recommend to set node id and host name of each data node.
These parameters should be set in the section for each unique data node.
More or less all other parameters should be defined in the default section for data nodes ([NDBD DEFAULT]).
In addition each data node should set the parameter ServerPort. Not setting this means that the management server assigns dynamic ports to use in connecting to a data node.
More or less every modern IT environment have firewalls that makes it very hard to work in such a manner. Thus it is recommended to set the ServerPort for all data nodes. There is no IANA defined port for this purpose. I recommend using port 11860 for this, it is easy to remember since the management server by default uses 1186 and there is no conflicting IANA port for this port number.
By using this port there are 3 port numbers to consider for MySQL Cluster installations. The 3306 for MySQL Servers, 1186 for NDB management servers, 11860 for NDB data nodes. Using the X plugin and MySQL Document Store means adding 33060 to the set of ports to open in the firewalls to set up MySQL Cluster.
It is important to define at least DataDir to ensure that we have defined where the various files for a data node is stored, more on other configuration items for file set up will be described in the next chapter.
Setting up management server nodes#
As mentioned one should set node id and host name of the management server.
Normally PortNumber use the default port number 1186 for NDB management servers, otherwise this needs to be set.
DataDir will default to /var/lib/mysql-cluster, this variable should always be set.
Setting up MySQL Server nodes#
Given that each MySQL Server requires one API node for each cluster connection it uses, it is important to set up the correct amount of MySQL Server nodes in the cluster configuration. For example if the MySQL Server uses 4 cluster connections it is necessary to set up 4 API nodes. As mentioned above it is recommended to set node id and host name of all those nodes. In this case all 4 of the API nodes must use the same host name but different node ids.
Setting up API nodes#
Setting up API nodes for permanent servers follows the same principle as for MySQL servers. Setting up API nodes for use by the various NDB tools usually doesn't require any fancy setups. For security reason it could be a good idea to set the host name. There is no specific need to set the node id on those nodes.
Here is a basic configuration file for two data nodes and two MySQL servers that are colocated. It only sets up the number of replicas, the nodes themselves, the data directories, hostnames and node ids. More details will be covered in later chapters.
[ndbd default] NoOfReplicas=2 NoOfFragmentLogParts=4 DataDir=/usr/local/ndb_data ServerPort=11860 [ndb_mgmd default] DataDir=/usr/local/ndb_data [ndb_mgmd] NodeId=49 Hostname=192.168.1.102 [ndbd] NodeId=1 Hostname=192.168.1.100 [ndbd] NodeId=2 Hostname=192.168.1.101 [mysqld] NodeId=51 Hostname=192.168.1.100 [mysqld] NodeId=52 Hostname=192.168.1.101 [api]
Advanced Configuration of Data nodes#
Configuring data nodes have quite a lot of configurable resources and parameters that affect various internal algorithms.
Configuring memory resources#
DataMemory and IndexMemory#
The most important parameter for a data node is the amount of memory that can be used for data storage. Rows are stored in memory configured through DataMemory. IndexMemory is the memory used by the primary key hash index. Each row has one entry in such an index, this includes unique index rows. Each table use around 15 bytes of space in IndexMemory for each row, similarly each row in a unique index consumes about 15 bytes per row.
DataMemory is used by copy rows. Each row change (except insert row) creates a copy row that contains the data before the row change happened.
DataMemory is used by ordered indexes, expect that each row consumes about 10 bytes of memory per row per ordered index. If a table has 1 million rows and three ordered indexes, the indexes will consume around 30 MBytes of DataMemory space.
Personally I usually set IndexMemory to around 10% of DataMemory. This works fine with row sizes around 100-200 bytes. Larger rows means less IndexMemory and even smaller rows means more might be required.
We will always leave a bit of free space in DataMemory. The reason is that temporarily the database can grow during restart. To avoid running out of memory during recovery we won't allocate any more rows when we reach the limit on minimum free pages (except during restart when all memory is available). The default is that we save 5% of DataMemory. The default setting can be changed through the configuration parameter MinFreePct.
Another important memory resource parameter is the size allocated for the page cache. This is set in the configuration parameter DiskPageBufferMemory. By default this is set to 64 MBytes which is too small for almost every realistic use of disk columns. If disk data columns are used, it is important to set this parameter to a proper value.
Extent pages (there are 4 bits per extent in the extent pages) are permanently stored in memory. The memory allocated for the page cache should be substantially bigger than the memory required for storing the extent pages.
DiskPageBufferEntries is explained in the chapter on disk columns.
This is a common resource that can be used for various purposes. If we run out of send buffers we can extend our send buffers up to 25% by using this memory.
If we haven't defined InitialLogfileGroup in our cluster configuration file we will take the memory for the UNDO log buffer from this memory.
Memory used by disk data file requests are taken from this memory pool as well.
All the operation resources that are specific to pushdown joins are directly using the memory in SharedGlobalMemory. The operation records used for meta data operations allocates 2 MByte of memory and if it needs to extend beyond this it starts to use memory in SharedGlobalMemory.
Configuring Transaction resources#
This is another area that is often required to change for production usage or benchmarking. The default setting will work fine in many situations and even for production in some cases. This section describes the parameters and when it is required to change those. It describes the memory impact of increasing versus decreasing them.
Any record size provided in this chapter is from a specific 7.5 version and can change a few bytes up and down without any specific notice.
When we execute any read or write of data we always use a transaction record in a tc thread. A transaction record is also used for every scan we execute in the cluster.
The maximum number of transaction records is set through the config parameter MaxNoOfConcurrentTransactions. By default it is 4096. The setting is per tc thread. With more tc threads we get more transaction records.
When using the MySQL Server to access NDB the number of transaction records in parallel is fairly limited. Using the asynchronous NDB API it is possible to execute many millions of parallel transactions. In this case it is good to increase this setting. The default setting is good for a fair amount of cases, but with highly concurrent applications it should be increased.
The size of the transaction record is 896 bytes, the real record size is 288 bytes. But we need one extra record to handle the complete phase for each transaction record and additionally we need one record to ensure that we can handle node failure handling. In addition we have 2 records of 16 bytes in DBDIH for each transaction thus giving 896 bytes in total.
Primary key operations#
Every primary key operation uses one operation record in the transaction coordinator and one in the ldm thread. If it is a write operation there is one operation record for each replica. In addition each scan operation uses also one operation record in the transaction coordinator and one operation record in each of the scanned fragments in the ldm threads. This parameter is configured by setting MaxNoOfConcurrentOperations.
This parameter impacts allocation both in tc threads and in ldm threads. It defaults to 32768. This will allocate 32768 records in each tc thread. In addition we will allocate 32768 multiplied by 1.1 and divided by number of ldm threads to each ldm thread. With multiple tc threads it is a good idea to specifically set the number of operation records in the ldm threads.
The number of local operations in ldm threads can be set through the configuration parameter MaxNoOfLocalOperations. The above describes the default setting.
With 4 ldm threads it is a good idea to set this to around 150000 to ensure that full use of operation records in tc threads can happen without running out of operation records in ldm threads. If most of the operations are writes the number should be set higher since there is more operation records used in ldm thread per operation in a tc thread. As will be described later the local operation records are allocated also while performing a local scan operation.
The default setting of 32768 provides a bit of parallelism. In most of my benchmarks and experiments it is necessary to increase to at least a few hundred thousand. Setting this a bit higher can avoid issues with large transactions that are sometimes used by various tools, especially tools that load data. It is reasonable to use around 5% of the memory available for operation records given that it is such an essential part of any reads and writes issued in the database. Each GByte of memory assigned to operation records provides for a parallelism of about 1.25 million row operations.
The size of operation records in the tc threads is 304 bytes. The size is around 152 bytes, but we need two records per operation, one is required to handle node failure handling when we take over failed transactions.
The size of the local operation records in the ldm threads is 496 bytes. The record is actually split in one part in DBLQH, one in DBACC and one in DBTUP.
For a read the total operation record sizes used are 800 bytes and for an update with 2 replicas the operation record sizes used are around 1300 bytes.
Scans use a transaction record and a scan record in the tc thread. The scan record size is 120 bytes. The maximum number of parallel scans is set per tc thread. It is set by default to 256 and has a maximum number of 500. The reason that the number of scans is limited is due to the high parallelism we employ in the scan operations. It is necessary to limit the amount of parallel scans in the cluster to avoid various overload cases. Changing the default is through the parameter MaxNoOfConcurrentScans.
Each scan allocates a set of scan fragment records as well in the tc thread (currently 64 bytes in size). The number of scan fragment records is one if the scan is a pruned scan, otherwise it is equal to the maximum scan parallelism. The maximum parallelism is the number of partitions in the table. It is possible to limit the scan parallelism by using the SO_PARALLEL option in the NDB API. There is no specific setting in the MySQL Server to change this. Scans from the MySQL Server always uses a parallelism equal to the number of partitions in the table. For a default table the number of partitions is the number of nodes multipled by the number of ldm threads. E.g. with 4 data nodes and 4 ldm threads there would be 16 partitions.
All the scan resources are released as soon as the scan has completed. Only operation records from key operations are kept until commit time. A delete that is using a scan have to execute a primary key operation that uses a special scan takeover request to ensure that the lock and the update is kept until commit time.
Given that scan resources are released after execution, the amount of scan records is more based on the amount of parallelism rather than the amount of data in the cluster. The default setting of 256 parallel scans per tc thread should work fine in most cases.
The number of scan frag records is the number of scan records multiplied by number of tc threads multipled by number of ldm threads multipled by number of data nodes in the cluster. For example with 2 tc threads, 4 ldm threads and 4 data nodes and a maximum of 256 scans in parallel we will have 8192 scan fragment records in each tc thread.
The same amount of scan records exist in the ldm threads as there are in the scan fragment records in the tc threads.
For each scan we have a scan record that is in total 588 bytes, this record is divided into a 96 byte record in DBACC, a 112 byte record in DBTUP, a 136 byte record in DBTUX, a 232 byte record in DBLQH. Next we also allocate an operation record for each scan that is 496 bytes in size. In total we allocate records of size 1084 bytes per local scan in an ldm thread.
In addition for each scan record in the ldm threads we allocate a certain number of lock records. A lock record is 120 bytes in size (96 byte record in DBACC and a 12 byte record in DBTUP and DBTUX). We allocate as many lock records as our batch size per local scan. The default for this parameter is 256. Thus each scan record allocates 30 kBytes of lock records. In the above example we will have a total of 16384 scan records in the ldm threads that consumes around 500 MBytes of memory where most of it is the lock records.
Most scans are not using locking, the above allocation is on the safe side in an extreme manner. The above means that scans can lock up to 4 million records, 1 million records in each ldm thread. Remember that a normal scan from SQL uses READ COMMITTED mode, this uses no locks at all. Next if we use a lock on the scan we only hold the lock from when we scan it until the NDB API have signalled back that it has completed using the row. This time period is normally less than a millisecond. The risk of us using all the lock records is neglible.
To decrease the allocation of lock records one can set the BatchSizePerLocalScan to a lower value. Setting it to e.g. 64 should most likely suffice for almost all installations. Even lower values can most likely even be plausible if the number of scans using locks are low. Even setting it as low as 8 will provide the capability to handle 128k of concurrent scan locks in the above example.
It is possible to set the number of scan fragment records in ldm threads through the configuration parameter MaxNoOfLocalScans if the above default setting isn't ok. To increase the operation records also used by local scans it is the MaxNoOfLocalOperations that one needs to increase.
As can be seen the default setup for scans are configured to handle the worst case where no scans are pruned scans and where every scan is using locks and using a batch size of 256 on scans.
If the application is using a lot of pruned scans we can decrease the amount of local scan records by setting MaxNoOfLocalScans lower.
If the application is never using only locking scans or the batch size is smaller than 256, the BatchSizePerLocalScan can be set significantly lower and thus saving a lot of memory.
It is important to have the same configuration of these values in all nodes. These changes should always be set in the default section for data nodes in the cluster configuration file. Changing those values to smaller values can cause temporary issues with out of resources since one node that still have high values will consume resources in the nodes with a lower configuration.
The only problem with misconfiguration of scans are temporary errors. All scans in internal algorithms uses preallocated records.
Unique index operations#
Access to a record through a unique index uses two steps, the first step is a read of the unique index, the second step is a normal operation using the primary key read from the unique index.
The first phase uses a special record in the tc thread. This record is only kept until the first read of the unqiue index have completed. Immediately when the read has completed, the record is released. The record has a size of 160 bytes and the default is to have 8192 such records per tc thread.
To set this to a higher value to avoid concurrency issues or to save a bit of memory by setting it lower one can set the configuration parameter MaxNoOfConcurrentIndexOperations.
It should be very rare that anyone needs to change this variable.
Similar to unique index operations we keep a record for triggered operation from that we receive them until we executed them. This record is 80 bytes in size and we have 4000 records per tc thread. It is a bit more common to run out of this resource when committing large transactions with many foreign key checks. It should be very rare to need to change this parameter, if required the parameter is MaxNoOfFiredTriggers.
When a trigger is fired, its values are sent to the tc thread exeuting the transaction. Until the resulting action is executed we keep the before value and after value sent in the trigger in a buffer. The size of this buffer is set through TransactionBufferMemory. By default this is set to 1 MByte per tc thread.
Limiting maximum transaction size#
It is possible to limit the maximum size of a transaction. By default this option is off, to set it use the parameter MaxDMLOperationsPerTransaction.
The configuration of transaction resources can be a bit daunting we understand. There is work ongoing to improve this significantly in future versions of MySQL Cluster to ensure that transaction resources can be configured with at most a single configuration variable. The MaxNoOfConcurrentScans parameter is likely to stay since this configuration parameter is also ensuring that we don't overload the cluster.
Configuring Schema resources#
This section is required to consider for use of MySQL Cluster with a large number of tables, attributes, foreign keys, indexes and so forth. It describes the parameters and when to consider changing them, it describes the memory impact of those parameters.
MaxNoOfAttributes defines the maximum amount of columns that can be defined in the cluster.
There are two important records that are influenced by the number of columns. The first is in the block DBDICT where we have a record of 160 bytes that describe the column and its features. The second memory area is in DBTUP where we have a memory area with a great deal of pointers to methods used to read and write the columns. This area is 120 bytes per column in each ldm thread. Thus if we have a node with 4 ldm we will use 640 bytes per column. The default setting for MaxNoOfAttributes is 1000. This is a very low value, thus increasing it should probably be done for anything except very minimal cluster sizes. Even increasing it to 10000 should not be any concern. It is conceivable with large applications that the number of columns in the cluster is counted in millions even.
The maximum number of columns in the cluster will also affect the size of memory allocated for StringMemory as explained below. In the default setting it will use an additional 3533 bytes per column.
Number of tables#
There are three types of objects that all use table objects. Normal tables is the obvious one, the unique indexes are stored as tables, these have to be counted as well. In most situations the ordered indexes are also treated as table objects although they are tightly connected to normal tables and their partitions.
When calculating memory used per table, the three values MaxNoOfTables, MaxNoOfOrderedIndexes and MaxNoOfUniqueHashIndexes are added. The default value of those are 128, 128 and 64. The default number of table objects are 320. The maximum number of table objects is limited by DBDICT where at most 20320 table objects can be defined.
In most places the sum of those values are used. They are used to calculate the number of triggers in the system and here they are used individually. The number of ordered indexes have a special impact on the number of objects in DBTUX. DBTUX is only used by ordered indexes, thus this is natural.
There are many different records that are describing table objects. In DBDICT we have a table record of 280 bytes, we have a meta data object of 88 bytes and we have a global key descriptor of 520 bytes. In addition we have two hash tables on meta data object id and meta data object name. Hashes has at most an overhead of 8 bytes per table object. The table record in DBDIH is a bit bigger and consumes 3296 bytes. In addition we have around 24 bytes of memory per table object in each tc thread. In the ldm threads we have 76 bytes per table object in DBACC, 72 bytes in DBLQH, 144 bytes in DBTUX and 544 bytes in DBTUP. There is a total of 836 bytes per ldm thread.
In total for table records we have in the example with 2 tc threads and 4 ldm threads we use about 7700 bytes per table object.
Table objects is not the only memory consumer for meta data objects. The memory consumption also comes from the fragment records in various blocks.
The default number of fragments per ldm thread is equal to the number of replicas in the cluster set by NoOfReplicas. Thus normally equal to 2. Each fragment record is 496 bytes in DBLQH, 304 bytes in DBACC, 472 bytes in DBTUP and 96 bytes in DBTUX. DBTUX only stores fragment for ordered indexes, this record can almost be ignored in the calculations. Assuming that we have 2 replicas in the cluster, we have around 2600 bytes of fragment records per table object per ldm thread. With 4 ldm threads we consume about 10400 bytes per table object in the fragment records.
We have both fragment record and replica records in DBDIH. DBDIH stores one fragment record for each fragment in the cluster, the number of fragments per table by defaut is the number of nodes multiplied by the number of ldm threads. The number of replica records is this number multiplied by the number of replicas in the cluster. The fragment records are 48 bytes and the replica records are 148 bytes. With 2 nodes and 4 ldm threads and 2 replicas we use 2752 bytes per table object in DBDIH.
In the example case with 2 data nodes, 4 ldm threads, 2 tc threads and 2 replicas we consume a total of 20852 bytes per table object. Thus the table and fragment records in a default setup (a total of 322 table objects) uses around 6.5 MByte of meta data storage.
Thus even with the maximum amount of tables in this setup (20320 table objects) we consume no more than around 400 MBytes of metadata for table, fragment and replica records.
Scaling to a higher number of ldm increases most sizes linearly since more threads need table and fragment records while the number of fragments per ldm thread stays constant.
Increasing number of nodes influences the storage in DBDIH where more memory is needed since the number of fragments in a table increases linearly with the number of nodes. For example in the case with 48 data nodes with 4 ldm threads we get about 66 kBytes per table object instead in DBDIH.
The default setting of MaxNoOfTriggers is based on the amount of table objects that the cluster should handle. The default setting is set 768, but when the value set in the configuration or the default value is smaller than the calculated value, we will use the calculated value.
The caclulated values assumes that each unique index requires 3 triggers (one for insert, one for update and one for delete operations). 1 trigger is required for each ordered index. In addition we will take height for one set of 3 triggers for a metadata operation such as add partitions. Each normal table that is fully replicated will use 3 triggers as well. We take height for 100 table events. This number will be compared to the configured value and the higher value will win.
Thus the main reason to increase MaxNoOfTriggers is to ensure that we can add foreign keys, that we can add triggers for altering the amount of table partitions and when more events are added by users to the system.
The amount of memory used for triggers is fairly small. The DBDICT block consumes 116 bytes per trigger object, the DBTC blocks consumes 36 bytes per tc thread and the DBTUP block consumes 112 bytes per ldm thread. In our example with 2 tc threads and 4 ldm threads each trigger will use 636 bytes of memory.
We will have extra trigger records in DBTUP, these extra records are there to accomodate backups, backups do not use global triggers, they only use local trigger. We will add 3 extra trigger records for the number of tables defined by MaxNoOfTables.
Given that the memory consumption of triggers is low, it is a good idea to set this configuration parameter such that it is prepared for a set of foreign keys, add partitions and events on all tables through various resources.
StringMemory defines the amount of memory to allocate for column names, default values for columns, names of tables, indexes, and names of triggers and it is used to store the FRM files used by MySQL servers to store metadata.
In NDB the FRM file is stored inside NDB, thus when a MySQL Server connects to the cluster, it will receive the tables and their metadata from NDB and store it in the local file system of the MySQL Server.
The maximum size of a table name is 128 bytes, the maximum size of a column name is 192, the maximum size of the default values is equal to 14004 bytes (4 plus the maximum row size in NDB). The maximum size of an FRM file is 6000 words.
DBDICT will allocate memory to ensure that we can fit also very large table names and column names and so forth. The size is calculated based on maximum sizes and the maximum amount of columns and number of table objects and triggers.
Now this size is a bit overexaggerating to use for allocation. So StringMemory is defined in percentage. If the memory is calculated to be 20 MBytes and we use the default setting of 25, then the allocated size will 25% * 20 MBytes = 5 MBytes.
Even 25 is a bit large, for example most columns have a much smaller default value compared to the maximum size of 14000 bytes. Thus going down to 5% is mostly ok.
Assuming that we set this value to 10% each column will use 1420 bytes in StringMemory and each table object will consume roughly 620 bytes. Probably tables would in reality require a bit more whereas columns require a lot less.
The actual default value used in normal query processing is stored in DBTUP and it uses the DataMemory to store one such row per table per ldm thread. This size of this default row is limited by the maximum row size that is 14000 bytes currently.
Configuring events is ensuring that sufficient amount of records exist in the block SUMA where events are handled.
A subscription is 128 bytes in size, thus it doesn't take up a lot of resources. The most common use case for subscriptions is that it is used for MySQL Cluster replication.
Normally in this case there is one subscription per table that is replicated. The default value of this parameter is to set it to 0. In this case the value it is the sum of the MaxNoOfTables, MaxNoOfUniqueHashIndexes and MaxNoOfOrderedIndexes. Given that a subscriptions can only be set on normal tables it means that we get a bit more subscription records than what is required for normal MySQL Cluster Replication.
This is good since subscriptions are also used for online meta data operations such reorganising partitions, building indexes and building foreign key relations.
Normally there is only one subscription used even if there are multiple MySQL servers doing replication. They will subscribe to the same subscription. The handler has a specific name on the subscription to ensure that it knows how to find the subscription.
If the replication setup is different for the MySQL replication servers they will require different subscriptions, in this case the name of the subscription will differ. One case where this will happen is if the MySQL replication servers use a different setting of the --ndb-log-updated-only configuration option.
NDB API programs that defines events on tables are also likely to use their own subscriptions. It is possible to synchronize several NDB API nodes to use the same subscription by using the same name of the event.
The default setup will work just fine in most cases, but if the user has different MySQL replication server setups and/or is using the NDB Event API, it would be required to increase the setting of MaxNoOfSubscriptions.
A subscriber is using a subscription, there needs to be more subscriber records compared to the number of subscriptions. Each NDB API node or MySQL Server node that listens to events from a table uses a subscriber record. The default number of subscriber records is twice the default number of subscriptions. Thus it is designed to use two MySQL replication servers in the cluster that listens to all table events without any issues and even handle a bit more than that.
The subscriber record is only 16 bytes, thus there is very little negative consequence in increasing this value to a higher setting.
Suboperation records are used during the creation phase of subscribers and subscriptions. It is set to 256 and should normally be ok. The record is only 32 bytes in size, no problem exists in increasing this setting even substantially.
Event operations was explained in some detail in the chapters on MySQL Cluster Replication and the API towards it in the chapter on the C++ NDB API.
Events are gathered one epoch at a time. One epoch is a group of transactions that are serialised towards other epochs.Thus all transactions in epoch n committed before any transaction in epoch n+1 was committed.
The time between epochs defaults to 100 milliseconds and this value is configurable through the parameter TimeBetweenEpochs.
Now the SUMA block that maintain events grouped in epochs, must keep track of epochs. For this it uses a special epoch record, this is only 24 bytes in size. By default we have 100 such records. Thus the memory used for these records is completely ignorable. It is still important to set this parameter properly. The default setting of 100 together with the default setting of 100 milliseconds for TimeBetweenEpochs means that we can track epochs for up to 10 seconds. If any part of the event handling cannot keep up with epochs rapidly enough, subscribers will start losing events and this will lead to that we will close down the subscriber. The parameter is set in number of milliseconds.
Setting this a bit higher is not a problem. What can be problematic is whether we will have memory buffers enough to handle more epochs. This depends on the update rate in NDB and amount of memory buffers we have available for saving epochs.
The buffer space for epochs is controlled by the configuration parameter MaxBufferedEpochBytes. This value defaults to around 25 MByte. In a high-performance environment it is quite likely that this buffer should be significantly increased.
The parameter TimeBetweenEpochsTimeout can be set to a a non-zero value, in this case we will crash the node if it takes longer than this value to complete the epoch.
The parameter RestartSubscriberConnectTimeout is an important timeout during node restart. If an API node or a MySQL server have subscribed to some table events, the starting node will wait for this API node or MySQL server to connect during the last phase (phase 101) of the node restart. Normally these nodes will connect very quickly and the restart will proceed. If a node subscribing to events are down, this is the time that we will wait for those nodes to connect again.
By default this timeout is set to 2 minutes (set to 120000, set in milliseconds).
Basic thread configurations#
The preferred manner of setting up the thread configuration is to use the ThreadConfig variable. This is explained in detail in a coming chapter. A quick manner of configuring the threads is to set the MaxNoOfExecutionThreads variable.
The MySQL manual describes exactly how many threads of various sorts that we get when setting those from 0 to 72. As described in the previous chapter it is important to keep track of how many ldm threads the number generates (described in the previous chapter) and it is important to set NoOfFragmentLogParts to the number of ldm threads.
It is possible to use a simple method to lock threads to CPUs by using the configuration variable LockExecuteThreadToCPU. This variable expects a comma separated list of CPU ids. The number of CPUs must be at least as many as the threads defined (ldm, tc, main, rep, send and recv threads).
LockMaintThreadsToCPU can be combined with the above two variables to set the CPU to use for io threads. Only one CPU can be used in this configuration variable, if necessary to lock io threads to more than one CPU, it is necessary to use ThreadConfig to configure this.
The remainder of the thread configurations is considered advanced and will be explained in a coming chapter.
There is a wide range of configuration items that affect restarts and when node fails.
NDB is designed for automatic restart. The default behaviour of a data node is that it is executing two processes, the starting process is called the angel process and is responsible to restart the node automatically after a crash. There are three configuration parameters controlling this behaviour.
Setting StopOnError to 1 means that the data node process will stop at failures. Default is 0, thus the default is to automatically restart.
If the node restart fails we will count this failure. The MaxStartFailRetries is how many attempts we will do before we stop the data node completely. As soon as the restart is successful the fail counter will be set to 0. By default this is set to 3.
It is possible to invoke a delay before the node restart begins after a crash. This delay is set in StartFailRetryDelay, the unit is seconds. By default there is no delay.
Whenever a node starts up there is a set of timers that affect for how long we will wait for more nodes to join before we proceed with the restart.
The StartPartialTimeout affects how long time we will wait for more nodes to join, in the case when the cluster is starting up. If we discover during the heartbeat registration that the cluster is up and running we will start immediately. During the wait we will print in the node log which nodes we are waiting and which that have connected and for how long we waited.
In the case of an initial start of the cluster we will always wait for all nodes to come up except when the user have provided a set of nodes to not wait for on the command line using the --no-wait-nodes option.
This parameter defaults to 30000, this means 30 seconds since the unit is millseconds. Setting it to 0 means waiting forever until all nodes are starting up.
This parameter provides a possibility to start a cluster even with a suspected network partititioning problem. Setting this parameter to 0 means that we will never start in a partitioned state. By default this is set to 60000, thus 60 seconds.
This parameter should always be set to 0, the default value is not a good choice. It is strongly recommended to not start up if it is suspected that the cluster is in a network partitioned state.
This parameter is by default set to 0, this means that restart will proceed until they are done.
If we know by experience that a restart should be completed in a certain time period, then we can set this parameter to a high value to ensure that we get a crash instead of an eternally hanging state of the node.
One should be very careful in setting this parameter (unit is milliseconds) given that restarts of large databases can take a substantial amount of time at times.
This parameter sets the time we will wait for nodes with no node group assigned while starting. A node has no node group if it is assigned NodeGroup equal to 65536. Nodes without node group contains no data, so they are not relevant for cluster operation, they are only relevant to extend the cluster size.
This parameter is set to 15000 (15 seconds) by default.
Heartbeat are mainly interesting to handle when a computer dies completely. When the data node process fails the OS will ensure that its TCP/IP sockets are closed and thus the neighbouring nodes will discover the failure. Heartbeats are also good to handle cases where a node is not making any progress.
For example in Mac OS X I have noticed that accessing a zero pointer leads to a hanging process. The heartbeat will discover this type of failures.
This parameter specifies the heartbeat between data nodes. It was originally set to 1500 milliseconds. Thus the node will be considered as failed in at most 6 seconds, after 3 missed heartbeats. Every missed heartbeat is logged into the node log, if the node is often close to the limit, the node log and cluster log will show this.
To ensure that NDB works better for early users we changed the default to 5000 milliseconds, thus it can take up to 20 seconds before a node failure is discovered.
In a high availability setup with computers that are dedicated to NDB and its applications, it is possible to decrease this configuration parameter. How low it can be set depends on how much the OS can be trusted to provide continuos operation. Some OSs can leave a thread unscheduled for a second or two even in cases when there is no real resource shortage.
In a production environment it is recommended to use the old default 1500 milliseconds or even lower than this.
It is important to understand that when changing this value, one should never increase it by more than 100% and one should never decrease to less than half. Otherwise the sending of heartbeats will cause node failures with other neighbouring nodes that haven't updated it heartbeat interval yet. The heartbeat interval in different data nodes should only differ during a rolling restart to change the configuration.
This parameter sets the heartbeat interval between an API node and a data node. The heartbeats are sent from the API node once every 100 milliseconds and it expects a return within three heartbeat periods. This parameter has a default of 1500 milliseconds.
The consequence of losing the connection to the data node is smaller than a data node losing its connection. If the API node is declared down, it will immediately reconnect again normally.
The heartbeat protocol assigns a dynamic id to each node when it starts. This id provides the order of starting nodes. Thus the node with the lowest id is the oldest node. The node with the lowest dynamic id is choosen as the master after a failure of the master node.
Heartbeats are sent to its right neighbour and received from its left neighbour. The definition of who is the right and left neighbour is defined by the order the nodes are starting up, thus through the dynamic id.
It is possible to specifically set the order of the nodes in the heartbeat protocol instead of using the default.
The manual contains some description of when it might be useful, but I am not convinced that it is a very useful feature.
This is a boolean variable that adds two more heartbeat intervals before a node is declared dead. After three failed heartbeat intervals the node will be in a suspected state and in this state the node will be checked from more than one node during the last two heartbeat intervals.
This parameter makes it possible to improve the heartbeat handling in environments where the communication latency can vary.
This is a rather new feature that is an improvement compared to only using heartbeats between neighbours. If used one should adjust the heartbeat settings since with this extra check we will not fail until after missing 5 heartbeats.
Configuring arbitration in data nodes#
There are several ways to setup arbitration in the cluster. The default behaviour is that the arbitrator is asked for its opinion when we cannot be certain that no cluster split has occurred.
To ensure that it is possible to use clusterware to enable running the cluster even with only two computers, we have implemented a special mode where external clusterware makes the decision on arbitration.
This parameter sets the timeout waiting for a response from the arbitrator. If no response comes within this time period the node will crash. The default for this period is 7500 milliseconds.
By default this string variable is set to Default. It can also be set to Disabled which means that no arbitration is performed and thus the cluster crashes if there is a risk of network partitioning.
There is a mode WaitExternal that was explained in the chapter on handling node crashes in NDB. This ensures that a message is written into the cluster log ensuring that some external clusterware can decide which part of the cluster to kill.
Optimising restart execution#
This parameter enables the use of extra threads during restart to build the ordered indexes. The number of threads used is limited by the amount of fragments of a table within one data node. Thus setting it to 128 is a simple method of ensuring that maximum possible parallelism is used for ordered index building.
By default this is not enabled. It is recommended to enable this feature, it has been used succesfully for many years by multiple users and has been extensively used in test suites for as long as I can remember. It is set as default in MySQL Cluster 7.6.4.
This parameter is a boolean that is not set by default. By setting it, an initial node restart will be executed in two phases, the first phase loads the data from another node. Next a fully parallelised ordered index build is performed. After this a second phase of copying data from live nodes is performed where only changes since the first phase is copied to the starting node.
It is recommended to set this parameter. It has been set in crash tests for many years. It was set to default in the MySQL Cluster 7.6.4 DMR.
This parameter sets the parallelism used when copying data from the live node to the starting node. By default the parallelism is equal to the number of ldm threads. There should normally never be any reason to change this variable.
This configuration was probably invented to handle some support case where indexes were corrupt and thus no restart was possible. It should never be set unless as a last option to get the cluster up again to restore its data.
Configuring Deadlock Detection#
Deadlock can occur in a database using row locks to implement concurrency control. Our method of resolving deadlocks is based on timeouts. When an operation doesn't return from a read or a write we assume that it is in a lock queue and since we don't come back we assume that we're involved in a deadlock.
NDB is designed for applications using short transactions that complete in a number of milliseconds, when an operation doesn't return within seconds we assume that something is wrong. An operation would normally return in less than a millisecond and even in a highly loaded cluster would it take more than a few milliseconds to return for an operation.
This deadlock detection timeout is also used to ensure that transactions that rely on a node that have died will be aborted. The timeout here should not be materially larger than the time that it takes to detect a node failure using heartbeats. Normally it should be much smaller than this.
Short deadlock detection timeout means that we recover quickly from deadlocks. At the same time if it is too short we might get too many false indications of deadlocks in a highly concurrent application.
This parameter sets the deadlock detection timeout. The default deadlock detection timeout is 1200 milliseconds.
Deadlock detection timers are not constantly checked. A check of all running transactions is performed with some delay in between. This delay can be increased with this parameter, it defaults to once per second and should normally not need to be changed.
This sets the time that a transaction is allowed to be inactive from the API. In this case we're not waiting for any nodes, rather we are waiting for the API to decide to proceed. By default this wait is almost forever.
Given that the transaction can have locks while waiting for API, it is a good idea to set this to a much lower value of say 1500 millisecond. This ensures that misbehaving APIs will not mess up the database.
Logging in NDB for data nodes happens in a number of places. We have the cluster log that is generated in data nodes and sent to the management server to place it into the cluster log. It is possible to also direct these messages to the node log.
We have the node log, here comes various log messages that are only written locally, normal printouts end up here as well.
The cluster log levels are controlled by a number of configuration parameters. Each such parameter controls a specific part of the logging. Each message is active at a certain log level and not on lower log levels.
The log levels that are definable are the following:
LogLevelStartup, LogLevelShutdown, LogLevelStatistic, LogLevelCheckpoint, LogLevelNodeRestart, LogLevelConnection, LogLevelCongestion, LogLevelError, LogLevelInfo
These can be changed temporarily from the NDB management client.
Whenever a data node crashes we generate a set of trace files to describe the crash. The MaxNoOfSavedMessages controls the number of crashes that will be saved. By default this is set to 25. When the 26th crash occurs it will overwrite the information in the error log and the trace files generated by the first crash.
This is the buffer used to store log events in the data node until they are sent to the cluster log in a management server. It is by default 8 kBytes.
The Diskless parameter is a useful parameter for testing when the servers don't have sufficient disk bandwidth to handle the load.
It can be useful when the application have no need of any recovery. For example a stock market application will have no use of database recovery since the stock market data have completely changed since the cluster failed. In this case one could use Diskless set to 1 to ensure that all writes to the file system are thrown away. It is similar to mounting the file system on /dev/null.
A better method of doing this is to create the tables with the NOLOGGING feature. This ensures that the tables are still around after a restart although the tables are empty.
NDB data nodes is designed with a virtual machine that handles execution of signals and ensuring that we can communicate those signals within a thread, between threads and to other nodes.
The virtual machine is designed such that when a signal executes, it is up to the signal execution code to return control to the virtual machine layer. Thus as an example a run away signal that enters an eternal loop cannot be detected by the virtual machine.
Signals are supposed to execute only for a few microseconds at a time. This is by design, and so of course a bug can cause a signal to execute for much longer.
Now if a thread does execute for too long it is likely to be detected by other nodes in various ways through timers. But a deadlock in one thread could be undetected if the other threads keeps the other nodes happy.
To detect such run away threads we use a watchdog mechanism. Each thread registers with a watchdog thread. This watchdog thread wakes up every now and then and checks if the threads have moved since last time. If no movement has happened it will continue waiting. When the watchdog timer expires the node will crash.
This is the time between watchdog checks. When four such intervals have expired and no progress have been reported we will declare the node as dead. The default setting is 6000 milliseconds and thus we will detect watchdog failures after at most 24 seconds.
During memory allocation it is more common to get stuck for a long time. Thus if 24 seconds isn't enough time to allocate memory we can increase this time here. When working with very large memories this can easily happen. In a large data node it would be fairly normal that one would have to increase this variable.
Configuring index statistics#
Index statistics are maintained by the NDB data nodes, but are used by the MySQL server. The configuration of index statistics requires both a part in the data nodes and in the MySQL Server.
By default the data node will not generate any index statistics at all. It is still possible to generate index statistics by using the SQL command ANALYZE TABLE.
The actual index statistics is stored in a system NDB table. The index statistics is only based on one fragment in the table. The reason is two-fold, the first is that the only ordering is within one fragment. There is no ordering between two fragments. The second reason is that it simply is faster to work with one fragment instead of all. Given that the hash function is randomising the placement of rows, there should be very small differences between different fragments and thus it should be sufficient to analyse one fragment.
When using NDB as an SQL engine, it is recommended to activate those options. Running complex queries without analysing the index statistics can cause a query to execute thousands of times slower compared to when you have activated index statistics.
If this variable is set to on it will generate index statistics at create time of the index. Defaults to off.
If this variable is set the data nodes will regularly update the index statistics automatically. Defaults to off.
This parameter defines the minimum time between two updates of the index statistics. It is measured in seconds and defaults to 60 seconds.
IndexStatSaveSize and IndexStatSaveScale#
These two parameters define how much memory will be used in the index statistics table to store the samples from the index. The calculation is performed using a logarithmic scale.
By default IndexStatSaveSize is set to 32768 and IndexStatSaveScale is set to 100.
We calculate first the sample size as the sum of average key size and number of key columns plus one times 4. Assume that we have three columns that have average 16 bytes in size. In this case the sample size is 32 bytes = (16 + (1 + 3) * 4).
We multiply the sample size by the number of entries in the ordered index. Assume we have 1 million entries in the index. In this case the the total unfiltered index statistics would consume 32 MByte.
Instead we calculate size taking the two logarithm of 32 MByte. This is equal to 25. Now we multiply 25 by IndexStatSaveScale times 0.01 (thus treating IndexStatSaveScale as a percentage).
Next we multiply this number plus one with IndexStatSaveSize. In this case this means that we get 26 * 32768 = 851968 bytes. Next we divide this by sample size. This is 26624. This is the number of samples that we will try to get.
These numbers now apply to all indexes in the database. Using the logarithmic scale we ensure that very large indexes do not use too large memory sizes. The memory size must be small enough for us to store it and not only that, it has to be small enough to process the information as part of one query execution.
As can be seen here even with 1 billion entries in the table the two logarithm of 32 GByte only goes to 35. The index statistic size grows by about 38% with a 1000-fold increase in storage volume for the table. We can increase the scale by changing IndexStatSaveScale and we can move the start state by changing IndexStatSaveSize.
IndexStatTriggerPct and IndexStatTriggerScale#
Similarly we have two configuration parameters that use logarithmic scale to decide when to start the next index statistics update.
We get the number of entries in index first. We take the two logarithm of this number. In the case with 1 million entries in table, this number will be 20 in this case. This number is now multiplied by IndexStatTriggerScale times 0.01. By default this is 100, so scale down factor becomes 1+ (scale * 0.01 * 2log(number of index entries)). We keep track of number of index changes since last time we calculated the statistics. If this percentage of rows is higher than the percentage calculated. For example the defaults for a one million row table is that the calculated scale down factor is 21. Thus we start a new index statistics update when 100 / 21 = 4.76% of the index entries have changed.
By setting IndexStatTriggerScale to 0 we change to a completely linear scale. In this case IndexStatTriggerPct is used directly to decide when to activate a new index statistics update.
If IndexStatTriggerPct is set to 0 it means that there will be no index statistics updates issued. In this case the only way to get an index statistics update is to call the SQL command ANALYZE TABLE on the table.
Both IndexStatTriggerScale and IndexStatTriggerPct is by default set to 100.
Specialized configuration options#
When a signal is sent with sections it uses a long message buffer to store the sections in. By default this buffer is 64 MBytes, this is sufficient for most scenarios, but in cases with large rows and many threads in the data node this might have to be increased.
Buffers are allocated from this pool when messages with sections arrive through the receive thread. These buffers are kept until the destination thread have consumed the signal.
Similarly when a new signal is created with one or more sections it will allocate buffer area from this pool. This buffer will be kept either until the signal destination within the same node have consumed the signal or if it is sent to another node it will happen when the signal is copied over to a send buffer.
Thus the long message buffer stores signals that have been received, but not yet consumed and signal created, but not yet sent or consumed.
Most of those signals will be waiting in a job buffer in one of the threads. There can be many thousands of signals in one of those job buffers and we can have up to a bit more than 50 such threads. We can theoretically have up to 40.000 signals in the queues. Not all these signals have sections, but there can definitely be tens of thousands of such signals executing in parallel. The size of the sections depends mostly on how much data we read and write from the database.
The long message buffer size has a limitation and 64 MByte should be enough to handle all but the most demanding situations. In those extremely demanding situations it should be sufficient to increase to a few hundred megabytes.
In smaller setups this buffer can even be decreased to at least half without running any real risk of running out of the buffer space.
If we run out of this buffer the node will crash.
This configuration option was invented to handle a very large configuration with 30 data nodes and hundreds of API nodes running a heavy benchmark.
The logic behind this variable can be understood by considering traffic in a large city. If one highway in the city is very efficient and lots of cars can be delivered and those cars all come into a bottleneck, then a major queue happens and the queue will eventually reach the highway as well.
Now in a traffic situation this can be avoided by slowing down traffic on the highway. For example in some cities we have traffic lights when entering a highway, this limits the amount of cars that can pass. This limitation avoids the queues in the bottleneck since cars are flowing into this bottleneck at a sustainable rate.
Limiting the traffic in an area without any bottlenecks can actually increase the traffic throughput in a city.
Similarly in a large cluster it is important for nodes to not send too often. We want nodes to not send immediately when they have something to send. Rather we want the nodes to send when they have buffered data for a while.
The analogy is here that nodes that have no bottlenecks, slow down and wait with sending data. This ensures that bottleneck nodes do not slow down the entire cluster.
The above is the explanation of how this feature solves the problem in the large cluster. We saw 3x higher throughput on a cluster level and consistent throughput with this parameter set to 500 microseconds.
Without the parameter performance in the cluster went up and down in an unpredictable fashion.
What the feature does is that when a node is ready to send to another node, it checks if it has sent to this node the last microseconds. If the number of microseconds since last send exceeds the configured value we will send, if not we will wait with sending. If we have nothing more to do in the thread and wants to go to sleep, we will send anyways.
By default this feature isn't active, the unit is in microseconds. It can be set to a maximum of 11000 microseconds.
The default memory allocation behaviour is defined by the OS. Most OSs use a memory allocation that prefers memory from the same CPU socket. Now when NDB is used on a multi-socket server and where a data node spans more than CPU socket, it makes sense to spread memory allocation over all CPU sockets.
A data node often allocates a majority of the memory in a machine. In this case it makes sense to interleave memory allocation on all CPU sockets. By setting Numa to 1 on Linux where libnuma is available, this policy is ensured.
One consequence of setting Numa to 0 means that we might overload the memory bus on the CPU socket where the allocations were made. In a multi-socket server it makes sense to set Numa to 1. However if we have one data node per CPU socket it might work better to set Numa to 0. Numa is set by default to 1.
When we allocate memory we touch each memory page to ensure that the memory is truly allocated to our process and not that we only allocated swap space. This parameter delays touching parts of the memory until we have received the second start order from the management server.
To delay this touch of memory is default.
We protect the fixed part of tuples with a checksum, if this flag is true we will crash when encountering a checksum that is wrong. If the flag isn't set we will only report an error to the user trying to request this row.
Only there to support downgrades. Should normally not be set to any other than the default value.
Various reports can be produced with some frequency in the cluster log. By default none of these reports are produced. The unit is seconds.
We can get a regular report on the usage of DataMemory and IndexMemory. If not we will still get a report when we reach a number of thresholds like 80%, 90% and so forth.
During an initial restart we spend a long time initialising the REDO log. This report specifies how many REDO logs we have initialised and how much of the current file we have initialised.
This parameter ensures that we get backup status printed in the cluster log at regular intervals.
Currently we have a limit of a maximum of 256 scans per fragment. Each such scan gets an index between 0 and 255. Full table scans from MySQL uses scans through hash index. There can be at most 12 such scans in parallel. These always use index 0 through 11.
Range scans will by default use index 12 through 121, by default we can have at most 110 parallel range scans per fragment.
TUP scans are used by local checkpoints and some special operations. These use index 122 through 255, thus there can be up to 134 parallel TUP scans per fragment.
This parameter can be set to increase the number of parallel TUP scans.
This parameter should never ever need to be set.
In reality only locking scans require a scan index. Any scan using READ COMMITTED would not really require a scan index. At the same time the parallelism we support is quite sufficient. If more scans are trying to start up they will queue, thus no errors will happen due to the limited scan parallelism on a specific fragment.
The most common scans are range scans, 110 parallel such scans in parallel should suffice. It is possible to track the number of scans that are queued up through an ndbinfo table.
Here is a new [ndbd default] section based on the recommendations in this chapter.
[ndbd default] NoOfReplicas=2 NoOfFragmentLogParts=4 DataDir=/usr/local/ndb_data ServerPort=11860 MaxNoOfAttributes=10000 TwoPassInitialNodeRestartCopy=1 BuildIndexThreads=128 StartPartitionedTimeout=0 BatchSizePerLocalScan=64 HeartbeatIntervalDbDb=1500 ConnectCheckIntervalDelay=1 IndexStatAutoCreate=1 IndexStatAutoUpdate=1 StringMemory=5
Advanced Configuration of API and MGM nodes#
Most of the configuration of MySQL Cluster is in the data nodes and in the MySQL Servers. There is a few things that is configured also through the API section in the cluster configuration. In the cluster configuration there is no difference between API and MYSQLD sections.
For MySQL Servers it is recommended to set the node id and hostname of each node used.
Configuring send buffers#
These applies to both API and management server nodes. They will be covered in the next chapter on communication configurations.
This applies to both API nodes and management server nodes.
ArbitrationRank provides the priority for a a certain node to become arbitrator. 0 means that it will never become arbitrator. 1 means that it can become arbitrator. 2 means that it will be selected as arbitrator before all nodes with 1 set in arbitration rank.
The aim is to ensure that the arbitrator is placed on machines that can have an independent vote on which side of the cluster that should win. For example with 2 replicas and 3 computers we want to place the arbitrator on the third machine containing no data node.
We achieve this by setting ArbitrationRank to 2 on the management server on this third machine (or possibly an API node that is always up and running.
Delays the arbitration response by some amount of milliseconds. By default set to 0, no reason to change this that I can come up with.
Configuring scan batching#
These applies only to API and MYSQLD nodes.
When performing a scan the data nodes will scan many partitions in parallel unless the scan is a pruned scan that only scans one partition. The NDB API must be prepared for all partitions to send back data.
We control the sizes each data node can send back by setting the batch size that the NDB API can handle.
An NDB API user can override the configuration setting of batch size below by using the SO_BATCH option. The NDB API can also affect the execution of the scan by setting SO_PARALLEL to define the parallelism of a table scan (can not be used to influence of parallelism of sorted range scans, these will always use full scans).
The maximum scan batch size is measured in bytes and defaults to 256 kBytes. This sets a limit to how much data will be transported from the data nodes in each batch. This is not an absolute limit, each partitioned scanned can send up to one extra row beyond this limit.
This setting is used to protect the API node from both send buffer overflow as well as using up too much memory. At the same time it is desirable to have the size not too small since this will have a negative impact on performance.
The send buffer sizes should take into account that multiple scans can happen in parallel from different threads, this limit only applies to one scan in one thread.
The BatchByteSize limits the amount of data sent from one partition scan. If the size of the data from one partition scan exceeds this limit, it will return to the NDB API with the rows scanned so far and wait until the NDB API have processed those rows before it proceeds with the next batch.
It defaults to 16 kBytes. The actual limit on the maximum scan batch byte size is the parallelism multiplied by this number. If this product is higher than MaxScanBatchSize we will decrease the batch byte size to ensure that the product is within the bounds set by MaxScanBatchSize.
We will never send more than BatchSize rows from a partition scan even if the rows are very small. This is an absolute limit, the NDB API must however be prepared to receive this amount of rows from each partition scan.
These applies only to API nodes.
There is a few configuration parameters controlling the connect phase when an API node is connecting to a data node.
These parameters controls the behaviour after the connection was lost and the timing of connect attempts after a number of failed attempts.
The NDB API was designed to handle failures transparently. When the connection to a node went down for a while, it will be immediately reused as soon as the connection is back up again. It is possible to disable this by setting AutoReconnect to 0.
When an API node have already connected to at least one data node in the cluster, this parameter specifies the maximum delay between new attempts to connect to another data node.
This parameter was introduced to avoid that hundreds of API nodes constantly attempt to connect to a few data nodes. The default is to make a new attempt after 100 millisecond. For small clusters this is not an issue, but in a large cluster this can cause serious overhead on the data nodes when too many nodes try to reconnect at the same time.
It defaults to 1500 milliseconds. The time between new attempts will increase with every failed attempt until it reaches this value. If set to 0 it will not back off at all, it will continue sending with a delay of 100 milliseconds.
Regardless of how big this is set to, we will never wait for more than 100 seconds and we will use an exponential backoff algorithm that will send first after 100 milliseconds, next after 200 milliseconds, next after 400 milliseconds and so forth until either we have reached 1024 intervals or we have reached the above set maximum delay between two attempts.
When an API node isn't connected to the cluster at all, we can also use a back off timer. This defaults to not being used, it defaults to 0.
After connecting to the first data node, this configuration will be replaced by the ConnectBackoffMaxTime.
By default the cluster manager thread and the arbitrator thread is perfectly normal threads. It is possible to configure them to run at a higher priority on Linux kernels. Only applies to API nodes.
This is done by setting the HeartbeatThreadPriority equal to a string with either fifo or rr where fifo means the SCHED_FIFO and rr stands for SCHED_RR and 50 is the priority level that will be set. If 50 is not considered one can also configure as fifo,60 to override the default priority level 50 by 60.
This parameter will be explained in the coming chapter on configuration of the REDO log. It controls what to do with operations that finds the REDO log buffer full. Only applies to API nodes.
Only to be used for downgrades to older MySQL Cluster versions.
Setting this to 2 or higher means that we get printouts when we perform meta data operations. Only to be used in debugging the NDB API. Only applies to API nodes.
Management server nodes#
There is a few configuration options that apply only to management servers.
Management servers have a special DataDir where the node log and cluster logs are placed. In addition we store the binary configuration files in this directory. It is recommended to always set this variable in the cluster configuration.
API nodes have no special node logs, the output is channeled to stdout. The MySQL server will redirect stderr to the error log and will skip output to stdout.
There are three values for LogDestination.
CONSOLE means that the cluster log is redirected to stdout from the management server. I've never used this and see very little need of ever using it.
FILE means that the cluster log is piped to a set of files stored in DataDir. This is the default behaviour and personally I never changed this default. The default name of the cluster log files are e.g. ndb_49_cluster.log where 49 is the node id of the management server.
The file is swapped to a file named ndb_49_cluster.log.1 when the file reached its maximum size, the maximum size is 1000000 bytes. At once a new empty file is created when the first one becomes full. A maximum of 6 files are kept by default.
To set any other file name or maximum size or maximum number of files the syntax used is the following
There is also an option to redirect the cluster log to the syslog in the OS.
To do this use the following syntax:
There are numerous options for where to redirect the syslog, see the MySQL manual for details on possible options.
It is even possible to use multiple options, it is possible to redirect output to CONSOLE, a FILE and the SYSLOG at the same time.
The management server uses the HeartbeatIntervalMgmdMgmd for heartbeats between two management servers. This heartbeat is used when communication between the management servers to keep track of when management servers are coming up and going down.
By default it is set to 1500 millseconds.
Here is a part of cluster configuration file focusing on how to setup arbitration in a correct manner. First set it up with ArbitrationRank set to 2 in one of the management servers and after that one more management server and MySQL server that set it to 1. The rest can set it to 0.
[ndb_mgmd] NodeId=49 Hostname=192.168.1.100 ArbitrationRank=2 [ndb_mgmd] NodeId=50 Hostname=192.168.1.101 ArbitrationRank=1 [mysqld] NodeId=51 Hostname=192.168.1.102 ArbitrationRank=1
Advanced Communication configuration#
Configuring send buffers#
The most important configuration parameter for communication is the send buffer sizes. This is configured partly through communication parameter and partly from node parameters.
TotalSendBufferMemory and ExtraSendBufferMemory#
In the past in NDB the memory per communication channel was allocated independent of others. Now the memory is allocated globally and all transporters share this memory. The memory size of this global buffer can be calculated in two different ways. The first is to add up all the SendBufferMemory settings.
The default option allocates memory to ensure that we can handle the worst case when all transporters are overloaded at the same time. This is rarely the case. With nodes that have many transporters it makes sense to allocate less memory to the global pool.
For example if we have 4 data nodes and 32 API nodes and the SendBufferMemory is set to 2 MByte, the default setting will be to allocate 70 MByte of send buffer globally. Now there is a very slim risk of us using this much memory at a single point in time. It would make sense to set TotalSendBufferMemory to e.g. 35 MByte.
ExtraSendBufferMemory is 0 and can stay that way. It is always the sum of TotalSendBufferMemory and ExtraSendBufferMemory that is used, so there is no need of using two parameters here.
The opposite could happen in rare cases if the user runs very large transactions. In this case the commit phase might generate signals that increase the send buffer beyond the SendBufferMemory limit although we abort both scans and key operations.
In this case we might decide to set TotalSendBufferMemory to more than 70 MByte instead.
In the API nodes the global memory allocated is all there is. In the data nodes we can overcommit up to 25% more memory than the global memory allocated. This memory will be taken from SharedGlobalMemory if available.
Setting TotalSendBufferMemory in a data node to 35 MByte means that the real limit is almost 44 MByte.
The SendBufferMemory is normally set in TCP default section in the cluster configuration file. It defaults to 2MByte. It can be set per communication channel where both node ids must be set and the send buffer memory size is provided.
As described in this chapter, SendBufferMemory influences the global memory allocated unless TotalSendBufferMemory is set and influences the OverloadLimit unless this is set.
If both those parameters are set, the SendBufferMemory is ignored.
Group sets the priority of one connection. This is used when selecting the transaction coordinator from the API nodes. It is handled by other configuration parameters, it is not expected that the user changes this parameter.
This is mostly a debug option. By using this option each signal sent over the network is tagged with a signal id. This adds 4 bytes to each signal header. It makes the trace files a bit more descriptive since it becomes possible to connect on a distributed scale which signals that initiated signals executed in another node. Internally in a node all signals are tagged with a signal id, but by default we don't use this option for distributed signals.
This feature can be useful to ensure that no software or hardware bugs corrupt the data in transit from one node to another. It adds a 4-byte checksum on each signal sent. This is not used by default. If a faulty checksum is found the connection between the nodes is broken and the signal is dropped with some descriptive error messages.
We have a few ways to decrease the pressure on a communication channel. The first level is to declare the channel as overloaded. When this happens any scans using this channel will automatically set the batch size to 1. Thus each scan will use the minimum amount of send buffer possible and still handle the scans in a correct manner.
This happens when the send buffer has used 60% of the OverloadLimit setting.
The default for the OverloadLimit is 80% of the SendBufferMemory setting of the communication channel (called transporter in NDB software).
When the overload limit is reached we start aborting key operations in the prepare phase. Operations in the commit phase will continue to execute as before.
This parameter makes it possible to set the OverloadLimit independent of the setting of SendBufferMemory.
Each transporter has a receive buffer. This buffer is used to receive into from the OS. Thus we can never run out of this buffer, we will ensure that we never receive more from the OS than there is buffer to receive into.
Before we receive more, we need to handle the signals we received.
The size of this buffer is set to ensure that we can receive data from the socket rapidly enough to meet the senders speed of writing. It is set to 2 MByte by default.
Configuring OS properties#
The main reason why the possibility to change the send and receive buffer options and setting TCP_MAXSEG for the connections was to handle API nodes that was accessing NDB through a WAN (Wide Area Network) connection.
The default settings of TCP_SND_BUF_SIZE and TCP_RCV_BUF_SIZE are set to fit LAN settings where nodes are in the same building or at most a few hundred meters apart.
The number of outstanding IP messages that a socket can handle is controlled by these parameters together with TCP_MAXSEG.
The reason is that the sender must know that there is memory space in the receiver to avoid useless sending.
To handle API nodes that are longer apart these three parameters were added to the list of configurable options on each connection.
All of these default to 0 which is interpreted as using the defaults of the OS.
The normal OS send buffer size and OS receive buffer size is around 100 to 200 kBytes.
For long-haul connections one might need up to a few MBytes to ensure the parallelism is kept up.
By setting the configuration parameter WAN to 1 on the communication channel the TCP_SND_BUF_SIZE and TCP_RCV_BUF_SIZE is set to 4 MBytes instead and TCP_MAXSEG_SIZE is set to 61440 bytes.
Sets the send buffer size allocated by the OS for this socket.
Sets the receive buffer size allocated by the OS for this socket.
Sets the TCP_MAXSEG parameter on the socket.
By default the server part of the connection (the data node or the data node with the lowest node id) will bind the server part of the socket to the hostname provided in the config. By setting the TcpBind_INADDR_ANY parameter to 1, we skip binding the socket to a specific host. Thus we can connect from any network interface.
Advanced Configuration of RonDB file systems#
In this chapter we will discuss how to setup file system related configuration parameters for both in-memory data and disk columns.
The following things need to be considered when configuring disks for MySQL Cluster. How MySQL Cluster disk data works was covered in an earlier chapter on Disk Data columns where we covered the potential configuration parameters. Here we will focus on recommended ways to use those parameters.
Decision on which disks to use for tablespaces
Decision on which disks to use for UNDO logs
Decision on size of DiskPageBufferMemory
Decision on size of Undo log buffer
Decision on which disks to use for REDO logs and local checkpoints
The default placement of disks are specified by the configuration parameter DataDir. This should always be set in the configuration file.
We can move all the disk usage except the node logs and trace files to another directory using the configuration parameter FileSystemPath.
Currently local checkpoint data and REDO logs cannot be separated on different disks. They are always written with sequential writes, so this should not present any issues.
A tough challenge is to provide predictable latency for MySQL Cluster using disk data. When NDB was designed this was simply not possible and thus early versions of MySQL Cluster had no support for disk data.
The main reason why it is now possible to overcome these challenges is the hardware development that have happened the last years. Nowadays we have access to SSD devices that can respond in less than hundreds of microseconds and NVMe devices that are even faster and even persistent memory devices that cut down latency to only a few microseconds.
An installation that uses a properly configured hardware and have setup the file systems in a good way can definitely use disk columns and still maintain predictable latency requirements where transactions with tens of round trips can be executed in tens of milliseconds even involving disk data.
The throughput of MySQL Cluster disk data is not equal to the throughput of in-memory data. The main difference is in writes where disk data columns uses one tablespace manager that manages extents of tablespaces and allocation of pages in extents and one log manager that handles the UNDO log. Thus there are a few bottlenecks in mutexes around these manager that limits the write throughput for disk data tables. It is still possible to perform hundreds of thousands of row writes per second of disk data rows for a cluster of NDB data nodes.
Reads on tables with disk columns are as scalable as in-memory tables and are only limited by the amount of reads from the file system that can be served. It is expected that millions of rows per second can be read per cluster even when the cache hit rate is low and with high cache hit rates tens of millions of reads per second and even hundreds of millions of reads per second is possible per cluster.
Most of the decision making here is about choosing the correct disk architecture for your application and the rest is about selecting proper buffer sizes.
It is recommended to avoid using the same disks for tablespaces as is used for local and global checkpointing and REDO logging. The reason is if the disks containing tablespaces are overloaded by the application it will affect the operation for checkpointing, REDO logging and so forth. This is not desirable. It is preferrable to ensure that logging and checkpointing always have dedicated disk resources, this will remove many overload problems.
As an example if we want to setup MySQL Cluster in the Oracle Bare Metal Cloud using the HighIO option we can do the following. This machine comes equipped with 512 GBytes of memory and a total of 12.8 TByte of space using 4 NVMe devices.
One way to use this setup is to use two devices for local checkpoint files, REDO log files and the backups in MySQL Cluster. Let's assume we set up this file system on /ndb_data.
The other two devices are used for tablespaces and UNDO logs. This means setting up two file systems using a RAID 0 configuration on two devices in both cases. As an example we can use 1 TByte of disk space for the UNDO log and 5 TByte disk space for tablespaces. Let's assume that we set up this file system on /ndb_disk_data.
Thus we will set the configuration variable DataDir to /ndb_data and we set the configuration variable FileSystemPathDD to /ndb_disk_data.
DataDir will also contain node logs, cluster logs, trace files and pid files generated by the data node. It could be useful to place these files on a separate device. Normally those files will not grow extensively, but if the data nodes hit some situation where information is written to the node logs a lot, it could be useful to also remove those from the same disks as checkpoint files and REDO log files. DataDir is always used for those files, but if FileSystemPath is also set then local checkpoint files, REDO logs and metadata and system files are moved to this directory.
It is possible to specifically set a directory for backups in BackupDataDir, for tablespaces in FileSystemPathDataFiles and a specific directory for UNDO log files in FileSystemPathUndoFiles.
To ensure that we can fit at least one and possibly even two backups we should use compression on the backup files. This means setting CompressedBackup to 1 in the configuration.
It is possible to use compressed local checkpoints as well by setting CompressedLCP.
Compressing happens in io threads and uses a zlib for compression. Expect this library to use one second of CPU time for around 100-200 MBytes of checkpoint files or backup files.
Configuring the REDO log files and local checkpoints#
Local checkpoints and REDO logs are an important part of the configuration setup for MySQL Cluster.
We write the update information of all inserts, updates and deletes into the REDO log. It is important to be able to cut the REDO log tail every now and then. This happens when a local checkpoint have been completed. At recovery time we first install a local checkpoint and then we apply the REDO log to get the state we want to restore.
Configuring REDO log#
It is important to set the correct size of the REDO log before starting up a production cluster. It is possible to change the size, but it requires that the nodes are restarted using an initial node restart.
We already described the reasoning behind the number of log parts. This should be set according to the number of ldm threads we want to use in the nodes or a minimum of 4.
NoOfFragmentLogFiles and FragmentLogFileSize#
The size of each log part is defined by the number of log files (NoOfFragmentLogFiles) multiplied by the size of each log file (FragmentLogFileSize).
The log file size isn't that important, but the default is too small. It is advisable to set the log file size to at least 256 MByte. Otherwise we might run into issues when writing to the REDO log at high speed if the file system cannot keep up with creating files at the speed required. The number of log files isn't very important either, but at least three files is a minimum and even 4 is advisable as a minimum.
It is a good idea to set the total size of the REDO log files to a size that is larger than the DataMemory. This will ensure that MySQL Cluster can handle the data load phase without running out of REDO log. After the data load phase the local checkpoints will move the data out to checkpoints and the REDO log can shrink back to a small size again.
For example with 8 Gbyte of DataMemory and 4 ldm threads we could use the following settings.
NoOfFragmentLogParts=4 NoOfFragmentLogFiles=8 FragmentLogFileSize=512M
At initial start of a node, the REDO log files will be initialised. This ensures that the REDO log space on disk is really allocated. Otherwise the OS might fake the allocation and we might run out of REDO log although not all of it was used.
In a testing environment it might be useful to speed up initial starts by skipping this initialisation phase. For this purpose it is possible to set InitFragmentLogFiles to sparse to avoid this initialisation phase. Thus there is a risk instead that we run out of REDO log although there is supposed to exist space in the REDO log. This will cause the data node to crash since we cannot write anymore to the REDO log.
At commit time the REDO log messages have to be written to the REDO log buffer. This is an in-memory buffer that will be written to disk at intervals prescribed by global checkpoints and we write to disk at certain intervals. If the disk cannot keep up the RedoBuffer usage will grow. This parameter sets the size of this buffer. When we run out of REDO buffer we can abort transactions in the prepare phase. The commit messages must be written into the REDO log buffer, so we ensure that there is always some space left in the buffer for this purpose. Thus running large transactions requires a large RedoBuffer setting.
The setting of RedoBuffer is per ldm thread. The total memory used for Redo buffers is RedoBuffer multiplied by the number of ldm threads.
When a transaction finds the REDO buffer full, it has two choices. It can either abort the transaction immediately, or it can queue up and hope that the REDO buffer situation improves.
The API node have a configuration option for this purpose called DefaultOperationRedoProblemAction. This configuration parameter can be set to ABORT or QUEUE. It can be influenced by an option in the NDB API for NDB API applications.
RedoOverCommitLimit and RedoOverCommitCounter#
As long as we have IO bandwidth to write out the REDO log in the same rate as we write the REDO log, there are no problems.
If we write more to the file than what is acknowledged, we know that the disks are not able to keep up.
If the disks cannot keep up, we can calculate the disk IO rate through the acks of file writes. By measuring the amount of outstanding IO we have at any point in time, we can measure the IO lag.
The IO lag is measured in seconds. If the IO lag in seconds exceeds the RedoOverCommitLimit, we will increment a counter with the lag divided by RedoOverCommitLimit. If we have a lag also the next second exceeding the RedoOverCommitLimit we will add even more to this counter. When this counter reaches RedoOverCommitCounter we will start aborting transactions to gain control of the REDO log writes. The default setting of RedoOverCommitLimit is 20 and of RedoOverCommitCounter is 3.
Every second that we have a lag of the REDO log writes, we will write a message to the node log describing how much lag we are seeing.
Configuring Local checkpoints#
There are three reasons why transactions can be aborted due to the REDO log. The first we saw above was that we run out of REDO buffer space, we simply can't write to the REDO log quick enough to move data from the buffer to the disk IO subsystem.
The second reason is that the disk IO subsystem cannot keep up with the REDO log writes as explained above.
The third reason is that we run out of REDO log space.
The only way to get REDO log space back is to ensure that the REDO logs are no longer needed. This happens through a local checkpoint.
Local checkpoint execution is controlled through a set of configuration parameters.
The first decision to make is when to start a local checkpoint again. We have one parameter that directly controls this, it is the TimeBetweenLocalCheckpoints parameter. In principle NDB is designed to execute local checkpoints continously. There is no real reason to change this parameter, it is also a bit complex to understand.
The second configuration parameter controlling when to start an LCP is the MaxLCPStartDelay. This parameter is effective during node restarts only. When the master decides that it is time to start an LCP, it will check if there are nodes that soon will need to wait to be part of an LCP. The reason for this is that any node restart is finalised by participating in a local checkpoint. The MaxLCPStartDelay sets the time we will wait for those nodes to be ready for an LCP before we start an LCP.
The next set of important parameters controlling LCP execution is a set of parameters controlling the speed that we perform the local checkpoints. This speed is controlled by an adaptive algorithm that have min and max values that depends a bit on the situation.
The execution of LCPs is controlled also by the buffer space available for local checkpoints and how much we write to disk at a time for an LCP.
We have a parameter that oversees that the LCP doesn't hang due to some software bug. This is a watchdog that watches over the LCP execution. If the LCP stops making progress for too long, this will cause a node crash with some log messages written describing the state of the LCP at crash time.
The default setting of this parameter is 20. This parameter is logarithmic. Thus 20 means 2 raised to 20 words, thus 4 MBytes. We measure the amount of committed information since the last LCP was started. If it is bigger than 4 MBytes, we will start a new LCP.
Thus as soon as there is some level of write activity we will start an LCP. If NDB is used mainly for reading there will be no LCP activity.
The setting can at most be set to 31 which means that it will wait for 8 GBytes of write activity to occur in the cluster before an LCP is started.
It is recommended to not change this parameter, mainly because there is little value in changing it.
This parameter is set in seconds, defaults to 25 seconds. If an LCP is starting and we know that it is close to be ready to join an LCP we will wait for this long before starting the LCP.
When an LCP is executing it will use an adaptive algorithm to set the desired disk write speed of the local checkpoint. By default this parameter is set to 10M (= 10 MBytes). Thus we will attempt to write 10 MByte per second in the node. Thus if there are 4 ldm threads, each of those threads will attempt to write 2.5 MBytes per second.
MinDiskWriteSpeed is the minimum speed that the adaptive algorithm will never go below.
The adaptive algorithm will choose normally the currently active max value. There are two reasons to decrease the disk write speed of a local checkpoint. The first reason is that the CPU is used quite a lot in the ldm thread. We measure CPU usage in ldm threads both by asking the OS how much CPU we used and by measuring how much time we spent being awake. If the time of awakeness is higher than the executed time, it means that the OS removed us from the CPU to execute other threads even when we were ready to execute on the CPU. This will impact the adaptive algorithm a bit as well.
We will start slowing down execution of LCPs when the CPU is becoming too much used. At 97% we will step down the disk write speed, we will move faster in decreasing disk write speed if we reach 99% and even more when reaching 100% CPU usage. If the CPU usage drops down below 90% we will start increasing disk write speed again.
If any REDO log part has an IO lag of 2 seconds or more as described above in the section on RedoOverCommitLimit, we will quickly decrease the disk write speed. Currently the REDO log and the local checkpoints are always placed on the same disk, if the REDO log cannot keep up, it can be a good idea to decrease the disk write speed of LCPs.
In normal operation the maximum disk write speed is set by MaxDiskWriteSpeed, this defaults to 20 MBytes per second. It is the total for all ldm threads.
When another node is executing a node restart we increase the disk write speed since executing faster LCPs means that we can restart faster. In this case we have a higher max disk write speed. The same adaptive algorithm can still decrease the speed of the LCP.
This setting defaults to 50 Mbyte per second.
When we are restarting ourselves and in cluster restarts, we will use this setting as max disk write speed. This defaults to 200 MBytes per second. During a cluster restart there is no risk of the adaptive algorithm stopping activity since there is no transactions ongoing while running the LCP.
BackupWriteSize and BackupMaxWriteSize#
As soon as we have gathered up records to write BackupWriteSize of data to the checkpoint files we will initiate a write to the checkpoint disk. We will never write more than BackupMaxWriteSize.
BackupWriteSize defaults to 256 kBytes and BackupMaxWriteSize defaults to 1 MByte.
The BackupDataBufferSize is the size of the data buffer where we gather records that are ready to be written to the checkpoint files.
The default of this parameter is 16 MBytes, there are very good reasons to set this parameter much lower. The reason is that it improves the predictable latency that we are aiming for. By setting the buffer size to 16 MBytes we spend a lot of CPU resources to fill up the buffer, after filling the buffer we will not do much for a while until we are ready to fill the buffer again.
The recommended settings for those settings are rather:
BackupWriteSize=256k BackupMaxWriteSize=512k BackupDataBufferSize=512k
These settings will improve the predictable latency and still give a good disk throughput. One can increase the settings to improve the disk throughput and decrease it a bit more to improve the predictable latency even more.
This sets the watchdog timer for ensuring that there is progress of LCPs. It is set by default to 60 seconds and after a third of this time, after 20 seconds we will start writing warning messages to the node log.
To configure backups we need to configure an appropriately sized log buffer for backups and we need to handle the imbalance of backup execution.
Writing backups is controlled by the same parameters as the execution of LCPs. This means that if a backup is executing in parallel with an LCP, the LCP and the backup will share the disk write speed.
There is one problem with this approach, the backup is only written by the first ldm thread. There is an imbalance in the use of the disk write speed resources.
To accomodate this imbalanced behaviour the first ldm gets a percentage of the disk write speed of the node in addition to its fair share to execute LCPs.
The BackupDiskWriteSpeedPct sets the percentage that is reserved for the ldm thread executing the backup. It defaults to 50. So for example if the maximum speed is 20 MByte per second, 50% of this is allocated for backup, thus 10 MByte. Assuming we have 4 ldm threads, the ldm threads gets 2.5 MBytes each.
Thus the first ldm thread gets a total of 12.5 MBytes per second of disk write speed and the remaining three gets 2.5 MBytes each.
This only affects disk write speed while writing a backup.
A backup must both write out all records (in-memory data and disk columns). In addition it must write all changes during the backup. The changes is a form of UNDO log or REDO log that we must write during backups in addition to all other logging and checkpointing that is going on in a data node.
By default it is set to 16 MByte. In a cluster with much write activity it is recommended to set the size of this buffer much larger, e.g. 256 MByte. This buffer is only allocated in the first ldm thread, thus only one instance of it is allocated.
Configuring global checkpoints#
In NDB data durability at commit time is ensured by writing into the REDO buffer in several nodes as part of commit. Durability to survive cluster crashes happens when that data is committed to disk through a global checkpoint.
The default timer for global checkpoints is set to 2000 milliseconds.
If the data node runs on hard drives this number is quite ok, it might even in some cases be tough to sustain high loads at this number with hard drives.
With SSDs this value should be perfectly ok and should normally not be any issue unless the write load is very high.
With faster SSDs or NVMe drives one could even decrease this time since the IO bandwidth should be a lot higher than what NDB will be able to match with its updates.
When the disks gets severely overloaded the global checkpointing might slow down significantly. To ensure that we don't continue running when we are in an unsafe state we crash the node when the time between global checkpoints is more than TimeBetweenGlobalCheckpointsTimeout. This is also a watchdog ensuring that we don't continue if global checkpoint execution has stopped due to a software bug.
It is set by default to 120000 milliseconds (120 seconds or 2 minutes).
Memory Buffers for disk columns#
The DiskPageBufferMemory is a very important parameter, by default this will be set to 64 MBytes. This is not for production usage, only for trying out MySQL Cluster on a small computer.
It is a good idea to consider this parameter specifically. If it is set too low it will have a dramatic impact on latency and throughput and if set too high it will remove the possibility to store more data in the DataMemory (in memory data).
The size of the UNDO log buffer is set either in InitialLogfileGroup or in the command where the UNDO log buffer is created. It is not changeable after that, it is important to set it sufficiently high from the start. There is one log buffer per data node.
The Undo log buffer size should be sufficient to ensure that transactions can proceed without interruptions as often as possible. The pressure on the log buffer increases with large transactions when large amounts of change data is written quickly into the log buffer and it increases with large amount of parallel write transactions. If the buffer is overloaded it means that commits will have to wait until buffer content have been sent to the disk. Thus transaction latency can be heavily impacted by using disk data with a too small Undo log buffer and similarly by using a disk device for Undo logs that cannot keep up with the write speed.
New tablespace and undo log files#
New undo log files can be added as an online operation as shown in the chapter on disk columns. Similarly for tablespaces. Tablespaces can also be added through the configuration parameter InitialTablespace.
File system configurations#
When writing to tablespace files we can have several file system threads writing to the same file in parallel. These threads are in a special pool of threads for this purpose. The number of threads in this pool is configurable, it defaults to 2, but can be set higher for higher write rates to tablespace files. The 2 here means that we have 2 dedicated threads available for writing to the tablespace files. If both of those threads are busy writing any writes will have to wait for one of them to complete its file operation. We can set the number of threads in the pool for writing to tablespace files higher through the configuration parameter DiskIOThreadPool.
It is possible to set a maximum limit to the number of open files (and thus file threads) in a data node. It is set by default to 0 which means that no limit exists. It is strongly recommended to not change this parameter other than to get an understanding of the system and see how many files it opens. There are no problems associated with having many open files in parallel and the algorithms in NDB are designed to limit the amount of open files in the data node.
InitialNoOfOpenFiles is the amount of file threads that are started in the early phases of the restart. It is not necessary to change this parameter. But setting it higher moves some work to create threads to an earlier point before the node has started up and is executing transactions. It defaults to 27. The number of open files required is very much dependent on the data node configuration.
File threads consume very small resources, it mainly uses a 128 kByte stack. There are no real issues associated with ignoring this thread creation and the memory associated with it. Modern operating systems can handle hundreds of thousands of concurrent threads and we normally use less than one hundred threads for file IO.
This is an area where we rarely, if ever, see any issues.
ODirect is a special file system option that bypasses the page cache in the operating system. This option is always used on UNDO log files and tablespace files. For REDO log files, backup files and local checkpoint files it is configurable whether to use it or not.
It is recommended to set this parameter on Linux unless you are running on an ancient Linux kernel.
This parameter ensures that we write out any buffers inside the OS frequent enough to avoid issues. It was added since in Linux you can buffer the entire memory in a machine before the writes start to happen. This led to a very unpleasant behaviour when too much buffering happened in the file system. This parameter ensures that we write out buffers frequently and defaults to 4 MByte. Setting ODirect means that this parameter is ignored. Setting ODirect is the preferred method on Linux.
The above mentioned machines in the Oracle Bare Metal Cloud represents a very good example of a high-end machine that fit very well with the requirements MySQL Cluster have on the disk subsystem when using disk columns.
The two devices used for tablespace data and UNDO log data will be able to sustain millions of IO operations and thus reading and writing several GBytes of tablespace pages and writing of UNDO log files per second.
Thus it can sustain several hundred thousand updates per second of disk rows which is more than what one node currently can handle for disk data rows (for main memory rows it can handle millions of updates per second in specific setups). This setup is a safe setup for MySQL Cluster where there is very little risk that the disk subsystem becomes a bottleneck and a problem for MySQL Cluster.
Advanced Configuration of the MySQL Server#
Now that we have covered everything except advanced thread configurations for the cluster configuration we will take a deeper look at what configurations that are useful when setting up a MySQL Server to use NDB as a storage engine.
Most of the configuration options to consider are NDB specific, but I will mention a few general MySQL options as well. I will not discuss any configuration options for InnoDB.
Basic MySQL server options#
Basic options are required whenever you set up a MySQL server independent of what storage engine you use. These options specify where to find the files, which port to use, where to place socket files, where to log errors, how many connections we will be able to handle in parallel and so forth. Most of those options can be set either at command line or in the MySQL server configuration file. I will not discuss details on exactly where they can be placed, this information is available in the MySQL manual.
I will not discuss any performance schema options. The performance schema have few advantages when using the MySQL server with NDB as storage engine.
To use the NDB storage engine with the MySQL server it is required to use the --ndbcluster option when starting the MySQL Server. It is important to use a MySQL Server from a MySQL Cluster installation when using the NDB storage engine.
--datadir specifies the data directory of the MySQL Server. This is where most files related to the MySQL Server will be stored. This includes error logs, data files, pid files and so forth.
In MySQL Cluster you will normally have many MySQL servers connected to the cluster. To make it easy to know which MySQL server you are examining it is a good idea to have the node id (or one of the node ids if there are multiple ones) as part of the directory name. This makes it easier to find your way about in the various logs produced by the nodes in the cluster.
--basedir points to the installation directory of MySQL Cluster. It is used among other things to find the error messages for different languages.
The --log-error option specifies that you want errors logged and where you want those errors to be written. A filename can be specified, the location of this file will be in the datadir. The default name of the error log is the hostname with .err as file extension.
A pid file contains the process id of the MySQL server process. This is a file located in the data directory. A filename is provided in the option.
This parameter defaults to 3306. If another port is desired it should be provided in this option.
If the MySQL client is located on the same machine it can communicate to the MySQL server using Unix domain sockets. This is normally a bit faster compared to using a TCP/IP socket on the local machine. The default setting for this option is /tmp/mysql.sock. It can be set to point to a different socket file name in this option.
It is possible to specify the network interface that the MySQL server will accept connections from. This can be useful to avoid accepting connections from a public network and only accept connections from a specific private network.
One manner to specify the network interface is to provide the IP address of the desired network interface.
The default behaviour is to accept connections from any network interface.
The MySQL server can set query logging on all queries, the log output is specified using the --log-output option and by setting --general-log all queries are logged and by setting --log-slow-queries one can set logs on only queries that are slower than a certain time.
The NDB handler will log more or less dependent on the setting of the --ndb-extra-logging option. This defaults to 1. Setting it to 0 means that most logging is disabled, setting it to 20 means that all possible messages are printed in the error log.
Query cache settings for NDB#
The query cache works with NDB, but the query cache will limit the scalability of the MySQL server, it is not recommended to use it unless one is really sure that it pays off to use it. The query cache is disabled in MySQL 8.0.
It is highly recommended to ensure that the query cache is completely shut off when starting the MySQL server. If not properly shut off, it will limit the scalability of the MySQL server on systems with many CPUs.
To shut it off completely set --query-cache-size=0 and --query-cache-type=0.
If the query cache is used with NDB one should consider setting the --ndb-cache-check-time option to a proper value.
There are a few parameters that are important to configure properly when starting up a MySQL server to be part of an NDB Cluster. It is important to point the MySQL server to an NDB management server (or a set of them) such that it can get the cluster configuration. There are parameter settings to ensure that we can decide how to handle cases where it takes time to connect to the cluster (it might be starting up for example).
It is important to decide how many API node slots this MySQL server will use and it is important to set the node ids of those API node slots.
This option points to the management server from where the cluster configuration is fetched. It is possible to specify several management servers in which case the configuration is fetched from any of those management servers.
The default setting for this parameter is localhost:1186. If only a hostname is provided the port is 1186 by default.
By default one cluster connection is used. In this case one can set the node id of the MySQL server using the option --ndb-nodeid. If more than one cluster connection it is still possible to set the node ids of those cluster connections (and recommended to do so in both cases). In this case one uses --ndb-cluster-connection-pool-nodeids to set the node ids.
This option specifies the number of cluster connections.
Normally one cluster connection scales to at least 8 CPUs and in many cases it can scale to around 16 CPUs. To scale to a higher number of CPUs for the MySQL server requires using more cluster connections.
In my experiments I have been able to scale the MySQL server to at least around 60 CPUs using 4 cluster connections.
With one cluster connection this option can be used to set the node id of the MySQL Server. It is recommended that MySQL server have specific node ids that are set in the cluster configuration with a node id and a host name.
This option accepts a comma separated list of node ids when using multiple cluster connections.
If a MySQL server is started it will first connect to the NDB management server and then it will perform various synchronisations with the existing cluster to ensure that the MySQL server knows about the tables that are present in the cluster and that it gets informed of any new tables added or tables being dropped or altered.
This parameter specifies the time that the MySQL server will wait for NDB to connect before it proceeds assuming that the NDB storage engine is unavailable. Defaults to 30 seconds.
This option specifies the amount of time the MySQL server will wait for the NDB storage engine to perform its startup. Defaults to 30 seconds.
Defaults for table creation#
Your MySQL server is up and running and you want to start creating the tables in your NDB Cluster. There is a number of options that have an effect on the tables that are created.
Before MySQL Cluster 7.5 NDB always read the primary replica. In 7.5 the option was introduced to create tables that can read any replica. See the chapter on table options for more descriptions of this.
By default the tables are still created without the read backup feature. However if you are planning to use NDB mainly for SQL and use a lot of SELECT queries and in particular if you are planning to colocate the MySQL server with NDB data nodes, we recommend that you create all tables to use the read backup feature.
A read backup table will have a slightly higher latency when updating the table to ensure that we can read our own updates.
If you are using NDB to build a cluster with read scaling, you want to connect up to hundreds of MySQL servers and many data nodes that all read the same data and still have the ability to update from anywhere such that all MySQL servers see your updates.
In this case it might make sense to make all tables use the fully replicated feature explained in the chapter on table options. In this case all data is replicated in all data nodes.
This option set to 1 will ensure that all tables are created as fully replicated tables.
Setting this variable has no effect.
Setting this option to 1 means that any table created will use the NOLOGGING feature explained in the chapter on table options.
NOLOGGING tables will be restored as empty tables after a cluster crash.
By default column formats used are FIXED which means that they will use fixed size part of the row for fixed size columns and the variable sized part of the row for columns that are variable sized.
If set to DYNAMIC all columns will use the dynamic part of the row. Columns that are stored in the DYNAMIC section will use no memory to store a NULL value, but will use a bit more memory when not NULL.
All new fields added through an ALTER TABLE with algorithm set to inplace will be using the DYNAMIC part of the row.
This option should no longer be used.
Alter table options#
There are options to provide control over ALTER TABLE as well.
If this option is set to 0 it means that copying ALTER TABLE isn't allowed. It is recommended to set this option to 0 since copying ALTER TABLE should be avoided and not be used by mistake. Setting this to 0 means that it must be explicitly enabled before a copying ALTER TABLE can be performed.
This defaults to 0 and is recommended to stay that way.
With some tables added into NDB, it is now time to start querying those tables. There is a fair amount of options that control execution behaviour.
The MySQL server caps the number of connections it will accept. By default it is set to 151 connections. In many applications it is desirable to use a lot more connections. In my benchmark runs I always set this variable to an appropriate value.
The server caps the amount of prepared statements that a server will accept. It is set by default to 16382. A prepared statement is bound to a connection, so with many similar prepared statements the server can use more than 16382 prepared statements.
As an example running sysbench will prepare around 20 prepared statements per connection, running a benchmark with 1024 connections requires this parameter to be changed.
It is set a bit lower for security reasons. It cannot be set higher than 1048576.
The MySQL server limits the amount of memory a connection can use for sorting to 256 kBytes. To use more than that requires changing this configuration option.
autocommit is set to 1 by default. This means that every statement is a separate transaction and will be committed as part of the query execution. It is still possible to start a longer transaction using a BEGIN statement and end it with either COMMIT or ROLLBACK.
When this option is set to 0 every statement after a COMMIT or ROLLBACK will start a new transaction that have to be completed with a COMMIT or ROLLBACK.
There is not many reasons to change the default setting.
Meta data options#
The MySQL server caches table metadata to speed up queries. Each connection uses one table object from the table open cache for each table it queries, in addition it uses a table objects from the table definition cache. To retrieve such objects there is a number of parallel hash tables to ensure that many connections can access these hash tables in parallel.
Each query needs one table object from the table open cache for each table it uses. This option sets the amount of table objects that can be in the cache before we start deleting table objects. There can be many table objects referring to the same table since every connection needs its own table objects.
The default is to allow for up to 2000 table objects. Increasing this also increases the number of file descriptors that the MySQL server uses.
Each table object is read from an frm file normally stored in the datadir. The table definition cache specifies how many frm files are stored in memory such that we don't have to open the frm file to get the meta data of the table.
The default value is calculated based on the setting of table open cache. If this is set to its default it will be set to 1400.
This specifies how many hash tables are used to access table objects. It defaults to 16, this number makes the MySQL Server scale to a high number of CPUs. To improve scaling on very large servers one can increase this number, there is very little reason to decrease this number other than to save a bit of memory.
By setting this option to 1 we ensure that the MySQL server locks all memory it uses, so it will fail allocating memory when there is no real memory available.
--lc-messages and --lc-messages-dir is used to set the error messages in a proper language. The default is english and a directory found through basedir.
--ndb-use-exact-count defaults to 0 and should always stay that way. The alternative is to use a full table scan to get the number of rows. This is less exact than the default method, thus there are no benefits in changing this option.
When a connection is ready to communicate with the data nodes it can either use force send or adaptive send. Force send is the default and works ok in most situations. In a highly concurrent environment the adaptive send might have some benefit, the difference between the two methods is usually very small, should only be changed after testing with your application.
Should never be changed, was introduced for backwards compatibility.
In most cases the MySQL server will be able to deduce which data node is the closest neighbour by comparing the hostname of the MySQL server and the hostname of the data nodes.
If they are running on the same machine and using different hostnames one can tell the MySQL server explicitly which data node is the closest neighbour.
The proximity of a data node will be used to select transaction coordinator when no specific hint has been provided.
This option has a weird name. It is set by default to 1.
The meaning of setting it to 1 is that large insert statements will be split into several transactions. If set to 0 also large insert statements will be executed using normal transaction boundaries.
When NDB discovers a large insert, delete or a large update, it will normally use batching. Batching here means that multiple row writes will be sent in parallel to the data node. This is the default setting and should normally be kept.
The only reason I can think of when it could be interesting to not set it is when it is required to avoid overloading the data nodes.
During large write operations we keep a memory buffer for retrieved values. By default this buffer is 32 kBytes in size. Increasing this can lead to a higher batch size for large write operations and thus higher throughput at the expense of using more memory per connection.
The default value of this parameter is 1. This means that auto increments will use steps of 1 between rows inserted in using an auto increment column.
Normally an autoincrement columns should use a BIGINT and thus have no problems that the autoincrement numbers have gaps and that they are not necessarily inserted in perfect order.
In this case this size should be set much higher, setting it to 1024 will make it fairly safe that the autoincrement column will not cause scaling issues.
The default setting will for sure cause scalability settings with many parallel inserts into the autoincrement table.
The maximum it can be set to is 65536.
--ndb-blob-read-batch-bytes and --ndb-blob-write-batch-bytes#
These two options control the behaviour of accesses to BLOB tables. When this option is set to a non-zero value, the MySQL server will send reads and writes to the BLOB tables in batches of this size. It defaults to 64 kBytes. If the MySQL server
By increasing this size we will perform more work in parallel when reading and updating BLOB tables. At the same time it increases the risk of overloading the cluster. The default setting is intended for a balanced setting.
The maximum size of temporary tables created during query execution is limited by the minimum of these two options. Both are set to 16 MByte by default.
Is not intended to be changed. It delays the check of constraints to the end of the transaction.
By default the MySQL client only shows the documented ndbinfo tables. The base tables are not documented since they can change from release to release. If one wants to also see the base tables one should set the option ndbinfo-show-hidden to 1.
There are a number of options that directly impact the query optimisation.
Pushdown of joins is a key feature of NDB. It means that many complex queries can execute various row conditions in parallel. It uses a form of linked joins.
It is possible to set this option to not be used. This might be useful in some cases when testing have shown that a certain query is more efficient when executed using a different algorithm present in the MySQL server.
There should be no reason to reset it when starting the MySQL server.
Index statistics is enabled by default for NDB. Enabling index statistics can at times provide major improvements to query execution time through a much better query plan.
There is a great number of options that can be influenced through this option. It is beyond the scope of this book to go through all of those. The MySQL manual contains a list of the options and how to set them.
When executing OPTIMIZE TABLE we will have a delay between each batch of rows sent for optimisation. This delay defaults to 10 milliseconds.
This variable is a massive switch that can be used to enable or disable 19 different algorithms.
There is a set of options available to trace the workings of the optimizer. Many applications have problems in developing SQL queries that work well. The optimizer trace feature will give detailed information about what makes the query optimizer choose certain paths and avoid other paths.
During query development and query debugging this feature can be very useful. This option is only available in debug builds of the MySQL server.
Receive thread configuration#
Each cluster connection have one receive thread. This thread or some execution thread that takes on the role of receiver executes all signals received from the data nodes. The behaviour of the receive logic is important to get optimal performance when running the MySQL server using NDB.
By default the user threads will take care of the receiver role of the cluster connection until at least 8 threads are active in waiting for response from the data node.
When more than 8 threads are waiting for response from the data nodes, the receive thread will take care of the receive role.
For optimal performance it is usually best to ensure that the receive thread takes care of the receive role in all situations except when there is a single thread executing.
To get optimal performance the receive thread must be able to execute on its own CPU.
For the receive thread to use its own CPU requires that we have locked the MySQL server (and other processes running on the same machine) to other CPUs, thus ensuring that no other thread is using the CPU assigned to the receive thread.
To lock the MySQL server to certain CPUs I use the taskset program or the numactl program.
For optimal behaviour the --ndb-recv-thread-activation-threshold should be set to 1.
By default this is not set and thus the receiver thread(s) can execute anywhere. When set, it should be a comma separated list of CPU ids. The number of CPUs in this list should be one for each cluster connection we use.
MySQL Cluster replication setup#
There is a wide range of options to use for MySQL replication servers. Remember that in NDB these are normally specialized MySQL servers used primarily to be replication masters or replication slaves.
Most of these options were explained in their context in the chapters on MySQL Cluster Replication.
--log-bin and --ndb-log-bin#
In a replication master the --log-bin must be set to the basename of the binlog files. By default it is OFF. In addition the --ndb-log-bin must also be set (is ON by default).
In addition the --ndb-log-bin-index should always be set when setting up a MySQL replication server for NDB. This is ON by default, no action is required for this to happen.
Each MySQL replication server must have a unique server id.
This option defines how many bits of the 32 bit server id to actually use. It defaults to 32, but can be set as low as 7. The idea with this is to provide room for NDB API applications to use special NDB API options to insert changes into the binlog using special server ids.
By default the binlog is synched once per transaction. NDB writes one epoch as a transaction into the binlog, synching this is what we want. So sync-binlog should be set to 1 when replicating from an NDB Cluster. There is about 10 epochs per second generated, so there will be no performance impact from this.
When the MySQL replication server is executing as a slave it will use relay logs to store the binlog information retrieved from the replication master before the logs are applied to the slave cluster. This option can be used to set a non-standard name of the relay logs.
To achieve the desirable speed of the slave, this option should be set to 1. It ensures that the slave applier uses batching to improve the speed of applying the replication logs in the slave cluster.
--ndb-eventbuffer-max-alloc and --ndbeventbuffer-free-percent#
The --ndb-eventbuffer-max-alloc option can be used to set a maximum on the memory available to use for event buffers in the NDB API. By default there is no limit. If this option is set and we reach the maximum allocated size, the free percent option indicates how much memory must be free before we start receiving new events again. By default that is set to 20%.
This option defaults to 10. It sets the number of epochs that the injector binlog thread will be behind before the MySQL server logs this lag to the error log.
The epochs are first received by the NDB API into a memory buffer. Next the injector thread injects them into the binlog. This reports when the injector thread cannot keep up with the NDB API.
Sets the level when reports about memory usage are written in the error log.
By default this is set to ON. Thus only columns that are changed are logged and columns that are not changed are not part of the log records. This makes replication to other NDB Clusters more efficient.
It is required to set it to OFF also when using the conflict detection functions.
This option is used to minimize the amount of binlogging required when using conflict detection functions NDB$EPOCH2 and NDB$EPOCH2_TRANS. It should never be set in any other situation. It is OFF by default.
By default NDB logs updates as writes, this ensures that no before value image is required and thus higher efficiency. When replicating to another storage engine and when using any of the conflict detection functions this option should be set to OFF.
In a circular setup where the slave cluster will replicate to yet another cluster (or another storage engine) we must enable the --log-slave-updates in the MySQL replication servers that act as binlog servers.
This option is off by default, it needs to be set to enable circular replication. It is necessary to know the origin cluster of the log records such that we can stop replicating when we are back at the originating cluster. It should be set on all binlog servers in the circular replication.
It is possible to log also empty epochs, by default this is disabled. In the chapter on global failover architecture we discussed how this could be used as part of a failover script. With this feature we can use the ndb_log_apply_status table for updates to see if replication channel is up and running.
By default the RESET SLAVE command will clear all entries from the ndb_apply_status table. By setting this option to OFF the table will not be cleared as the command RESET SLAVE is executed.
This option is OFF by default. When ON it ensures that updates to the ndb_log_apply_status table is replicated to the slave cluster.
It is used in chain replication and circular replication to ensure that the replication chain can be discovered.
This option is required to be set to ON when using the NDB_EPOCH2_TRANS method for conflict resolution. By default this option is OFF.
This option can be used for the NDB_EPOCH2_TRANS method for conflict resolution. By default this option is OFF. If it is set the option --ndb-log-empty-update must also be set to ON.
See the chapter on MySQL Cluster replication for a description of this. It is only related to replication with conflict detection functions.
Shows some mock tables that are used by NDB to handle the foreign_key_checks equal to 0. Adds a bit more descriptive logging in the error log.
It is possible to display information about the NDB version, the ndbinfo version and the NDB version in text format by using the following options --ndb-version, --ndbinfo-version, and --ndb-version-string.
If it is desirable to get a core file when the MySQL server crashes this option should be set.
MySQL Server Status variables#
There are many different status variables available in the MySQL Server, both for the NDB storage engine and for general MySQL Server usage. We will not go through them at all in this book, we will instead refer to the MySQL manual that contains a description of each and every one of those status variables. They can be helpful in examining a MySQL server in production.
Analysis of hyperthreading performance#
In this chapter we will look at the concept of hyperthreading specifically. It is important to understand when hyperthreading helps, how much it helps and in which situation to use it and when not to use it. We have one section for x86 servers that always have 2 CPUs per CPU core. We will also analyse things for the SPARC M7 servers that use 8 CPUs per CPU core. It is important to understand this to configure MySQL Cluster properly. Especially for advanced configurations where one wants to get the best performance using MySQL Cluster it is important to understand this. Some thread types benefit very much from hyperthreading and some almost not at all.
The analysis in this chapter is a direct input into the next chapter on advanced thread configuration options for MySQL Cluster.
We will analyse ldm threads, tc threads, send threads, recv (receive) threads in the data nodes. We will also analyse normal MySQL Server threads used to execute SQL queries and we will analyse the send and receive threads of the NDB API cluster connections.
Most of the analysis done was performed running Sysbench OLTP RW. On Linux the behaviour was analysed using perf stat, a command in the perf program suite that is used to among others calculate IPC (instructions per cycle executed). On SPARC/Solaris the method used was running benchmarks and ensuring that the thread type analysed becomes the bottleneck.
The ldm threads is the thread type in the data node that dominates the CPU usage in the data nodes. This is especially true for Sysbench OLTP RW where a lot of scans for 100 rows are performed. tc threads gets more work when used with many small primary key lookups. send threads always have a fair amount of work to do and gets more when nodes to send to are on other machines.
In a Sysbench OLTP RW execution the ldm threads uses almost 75% of the CPU usage in data nodes. There are two things affecting usage of hyperthreading for ldm threads. The first is the efficiency of hyperthreading for the ldm threads and the second is the fact that increasing the number of ldm threads means more memory to communicate between threads and more partitions to handle.
The number of ldm is tied to the number of REDO log parts in the data nodes. Thus doubling the number of LDM threads means that the number of partitions in the tables doubles.
The consequence of doubling the number of partitions is the following. A bit more memory is used for storing information about the partitions and their fragment replicas. This doesn't have any immediate impact on performance. We can declare small tables to use only one partition per node or node group if we want to remove this problem.
However large tables and tables that are heavily used should be spread amongst the ldm thread instances as much as possible.
When a large table is used with unique key lookups and primary key lookups the queries will always go immediately to the correct partition, the number of partitions doesn't matter for performance in this case.
For scans that are using the partition key properly one can ensure that the scans only go to one partition, for these scans the number of partitions doesn't matter.
However all other scans, both full table scans and ordered index scans will perform the scans on all partitions. This means that if we are only looking for a small number of rows the overhead in starting up a scan one one partition will be noticable. At the same time the parallelism will increase with more ldm threads and thus latency at low load will decrease. For a generic application that haven't been optimised for partition pruning it generally means decreased scalability of the application to have an increase in the number of partitions.
The gain in performance of using one ldm thread per CPU rather than one ldm thread per CPU core is only 15%. Thus for generic applications it isn't really beneficial to use hyperthreading for LDM threads. It is better to use only one thread per CPU core for ldm threads.
There is also a third thing that affects latency and performance when the number of ldm threads increases. Doubling the number of ldm threads can have an impact on the number of application threads required to reach the top performance levels. This is more or less an effect of queueing mechanisms and that with more ldm there are more hardware paths that are used. It is also to some extent the effect of some internal mutexes that have some effect on things as the number of ldm threads increases.
Our advice for ldm threads is the following. If your application is perfectly scalable (all scan queries are directed towards only one partition) it might be ok to always use hyperthreading also for ldm threads. When running on very small number of ldm threads there is no specific reason to avoid it.
However for example when running on a computer or VM that have 16 CPU cores we would definitely argue against using 20 or 24 ldm threads on 10 or 12 CPU cores and instead go for using 10 or 12 ldm threads using the same CPU cores. Increasing the number of ldm threads beyond 16 gives a small performance loss due to mutex contention.
Up to 4 ldm threads it is ok to use hyperthreading more or less always, but from 6 ldm threads and up it is often a good idea to use one ldm thread per CPU core.
The maximum number of ldm threads are 32, thus one data node can at most scale to around 40-48 CPU cores. One can still have multiple data nodes in one computer if more cores need to be used for data nodes.
Exactly why the hyperthreading for ldm threads isn't so beneficial we don't exactly know, it is not uncommon for database engines to not benefit so much from hyperthreading. Our experiments shows that ldm threads have an IPC of 1.27, a likely cause is that with 2 threads per CPU core we run out of computing resources. But the cache levels definitely have an impact on the behaviour as well.
Other thread types#
For the remaining threads hyperthreading is beneficial although it is good to know to what level it is giving benefits.
For tc threads the benefit is at 40% increased performance. For send threads it increases performance by 40%. For recv (receive) threads the benefit is a lot higher, it is all the way up to 70% increase in performance. The reason is that the receive thread is doing a lot of work on preparing protocol headers for the other threads, it does a fair amount of CPU processing on each byte it receives on the wire. ldm threads work a lot on large memory segments and thus have a lot of cache misses that misses in all cache layers.
The MySQL Server threads that execute SQL queries benefits 60% in increased performance by using hyperthreading and a similar number we get for the receive threads of the NDB API cluster connections.
Conclusion for x86 servers#
The simplest rule to follow is to use hyperthreading for all thread types except for ldm threads. A more complex rule is to look in more detail on the application scalability, number of CPU cores to use to decide whether to use hyperthreading for ldm threads. However it is not likely to bring more than 15% more performance to use hyperthreading for ldm threads.
The analysis of SPARC servers is based on the SPARC M7 processor. This CPU has 32 cores per CPU socket and each CPU core have 8 CPUs. Thus a single CPU socket provides up to 256 CPUs to play around with.
Analysis of MySQL Server threads#
We ran a Sysbench OLTP RW benchmark where we kept the number of ldm threads constant at 8 and ensured that the other thread types wasn't a bottleneck. We used 4 SPARC cores constantly and only varied the number of CPUs we made available to the MySQL Server (using the pbind command).
Using 1 CPU per core we got 1363 TPS, with 2 CPUs per core the result was 2183 TPS (+60%). Using 4 CPUs per core the performance increased to 3315 TPS (+143% compared to 1 CPU per core) and when making use of all 8 CPUs per core we got 3888 TPS (+185% compared to 1 CPU per core). Almost 1000 TPS per CPU core the SPARC M7 was able to crunch out on the MySQL Server side. The SPARC CPU is very well suited for the thread types used in a MySQL Server. The MySQL Server requires a bit more cluster connections compared to when used with x86 servers.
For MySQL Servers where often we have many threads running independent tasks that have a fairly large working set the SPARC CPU is very well equipped to achieve good use of the CPU resources and it pays off to use all 8 CPU threads in each CPU core.
Analysis of ldm threads#
We made a similar experiment where we ensured that the MySQL Server wasn't the bottleneck. We kept the number of ldm threads stable here as well at 8. We tested using 1 ldm thread per core, 2 per core, 4 per core and 8 per core.
The result was that 1 thread per core delivered 740 TPS per core. With 2 threads per core the result was 1000 TPS per core (+35%). With 4 threads per core we scaled all the way to 1640 TPS per core (+122% compared to 1 thread per core). With 8 threads per core we scaled to almost 1800 TPS.
Here it is clear that we get a major advantage of using up to 4 CPUs per core. It seems that each additional CPU provides almost 30% extra performance up to 4 CPUs per core. However using more than 4 CPUs per core isn't a good idea. Doubling the amount of partitions in the tables for less than a 10% increase in performance is certainly not worth the extra overhead that the extra partitions provide for the same reasons as for x86 CPUs.
Conclusion for SPARC servers#
Using SPARC to execute MySQL Cluster seems like a good idea as a method of handling many smaller clusters. Given that even a very capable data node with 16 ldm threads can fit in 6 CPU cores and that a very capable MySQL Server can be handled within 8 CPU cores it means that a virtual machine using half of a CPU socket will be a very capable node in a cluster.
Advanced Thread Configurations#
Quite a few of the products currently using MySQL Cluster are building real-time applications. Examples are various types of network servers such as DNS servers, DHCP servers, HLR (used in 3G and 4G networks), LDAP servers and financial use cases with high requirements on response times.
To design a real-time service using MySQL Cluster one need to take special care in setting up the threads and how they are operated in the data node configuration. We will treat this here.
We will start by describing how to set up the single-threaded ndbd for real-time operation and next move on to how to set up the multi-threaded ndbmtd for real-time operation. The reason that we start with ndbd is that the solution for ndbmtd has evolved from the solutions used for ndbd. It is good to see how we solved the problem historically for ndbd before we move onto showing how we solved it in ndbmtd. Most of the solutions in ndbmtd have a direct mapping to how it works in ndbd. Sometimes the parameters used for ndbd still applies to ndbtmd.
Also the ndbd setup is much easier to understand, by explaining this first we build the foundation to make it possible to fully understand the thread configuration options for ndbmtd.
Setting up ndbd for real-time operation#
When setting up a real-time service the optimal configuration is to use the old binary ndbd for data nodes. This binary only have a single thread executing both read from network, write to the network and performing database services. We have kept this configuration possibility since there are cases when response time is the overall target and the performance requirements are smaller.
In most environments one will set up data nodes to use ndbmtd, the multi-threaded data node given that the throughput of ndbd is very limited, ndbmtd can scale to more than 10x the performance of ndbd.
We will describe how to setup both of those environments for real-time use cases.
This setup is still used in production all over the world and it helps to understand the design ideas from here and how they later translate into similar concepts in the ndbmtd. ndbd is a very popular use case when you simply want a highly reliable data service and don't care so much about performance and you want a very cheap IT infrastructure.
One important feature to be able to control the execution environment is to lock the threads into specific CPUs. It is vital to ensure that these CPUs are shielded from both OS activity and interrupt activity.
The only OS that supports this shielding is Solaris that has the ability to exclusively lock CPUs to threads. In Linux, Windows and FreeBSD it is possible to lock threads to CPUs, but not to avoid having other threads executing there as well. In modern Linux one can use the cgroup concept to make lockings exclusive. Systems such as Docker makes extensive use of the cgroup concept to control resource usage.
Linux provides some means to control execution of OS threads and there are means to control interrupt execution.
In ndbd we have two important configuration parameters that control CPU locking. The first one is LockExecuteThreadToCPU. One sets this configuration parameter to the CPU id one wants the execution thread to be locked to. The other part is to set the LockMaintThreadsToCPU. This controls the locking of the file system threads to a CPU.
Thus ndbd will execute on one 1 CPU core or for the optimal use case one will use 2 CPU cores, one core for execution thread and the other core for file system threads.
It is important to remember here that when using the feature of locking to CPUs it is important to control other programs running on the same machine to not interfere by using the same CPUs. This can be done by using taskset or numactl to control placement of programs to CPUs. With the aid of those two parameters we have now gained control of the execution environment for ndbd.
This is still not sufficient to provide a real-time service environment making full use of the ndbd environment.
To achieve extremely low response times requires busy waiting. Thus the execution thread cannot go to sleep after completing its work. We control this through the SchedulerSpinTimer configuration parameter. This specifies for how many microseconds the execution thread will spin in the main loop waiting for external input before going to sleep.
In ndbd we have one more configuration parameter that can be used to control the execution thread. This is the SchedulerExecutionTimer. This controls the number of microseconds we will execute before we perform the send operation. For extreme real-time services we need to set this to a low value possibly even to 0. By default this is set to 50 to improve throughput.
Most OSs provide a real-time scheduling mode. This means that the thread cannot be interrupted, not even by the OS itself and not even by interrupts. MySQL Cluster contains protective measurements to ensure that we never execute for more than tens of milliseconds without going back to normal scheduler priority. Executing prolonged on real-time priority will completely freeze the OS. Setting this mode is done through the configuration parameter RealtimeScheduler.
There is a protective setting LockPagesInMainMemory. By setting this the parameter memory is locked such that no swapping can occur. In the normal case this will have no effect, but if there is an issue with not sufficient memory than swapping can occur and this can have very significant effect on response times. This feature is useful if the data node operates in a shared environment where for example a MySQL Server is executing in the same environment.
There is one more option that affects response times and this is the choice of transporter. MySQL Cluster have an architecture that supports different transporter types. We currently only use TCP/IP sockets.
Historically we've had experimental support for the shared memory transporter, this transporter was useful in running a benchmark on a 72-CPU SPARC machine in 2002. We managed to execute 1.5M reads per second already at that time using a 32-node cluster where each node used one of the 72 CPUs available and the communication was using the shared memory transporter.
There was a SCI transporter based on an API that Dolphin developed for the SCI communication card. This hasn't been used in many years since Dolphin developed a driver that makes it possible to use SCI through a normal socket protocol. I worked a lot with Dolphin a number of years ago to ensure that the Dolphin socket driver had the optimal configuration and had minimal response time issues. The Dolphin socket driver and Dolphin cards is still an interesting option to use for MySQL Cluster.
Dolphin still delivers cards, today they are based on PCI Express. The driver software is still the same, Dolphin SuperSockets. Using this driver one can get a very low-latency network drivers. To use this driver one uses normal TCP/IP sockets and ensure that one uses the IP addresses used by Dolphin SuperSockets.
The other low-latency approach is to use Infiniband. Currently this requires using IP over Infiniband (IPoIB). This will still use the normal TCP/IP transporter, we're only changing the underlying transport network. We have made benchmarks at Intel sites where we have connected 100s of computers using IPoIB and managed to handle to perform 205M reads per second of around 120 bytes of data per read. Thus transporting 25 GByte per second (200 Gb/s) in user data between the hosts.
In order to configure ndbd for a real-time environment it is necessary to consider the following options:
Use a low-latency transport network (Dolphin or Infiniband)
This concludes the type of settings required to set up a real-time service using ndbd. It is required to handle locking threads to CPUs, it is required to have the option for threads to spin instead of immediately sleeping, it is required to run the thread as threads operating with real-time priority, it is required to ensure that memory is locked such that swapping cannot occur and it is required to control send processing.
Setting up ndbtmd for real-time operations#
Setting up ndbmtd for real-time operations requires us to deal with the same set of challenges as with ndbd. We need to control CPU locking, we need to control if any spinning should be done while waiting for input. We need to control the thread priority. In ndbmtd all of those things are controlled by one complex configuration variable ThreadConfig. This variable defines the number of threads of various sorts that we use in ndbmtd.
The configuration parameter SchedulerSpinTimer is still used in ndbmtd, it is the default value for the spintime of threads. This value can be overridden by a value in ThreadConfig. RealtimeScheduler is handled in the same manner, thus the value in this variable is the default and can be overridden by values in the ThreadConfig variable.
LockPagesInMainMemory has the same meaning in ndbmtd as it have in ndbd. It can be set to 0 indicating no locking and it can be set to 1 to indicate that we ask the OS to lock the memory before the memory is allocated. Setting it to 2 means that we first allocate memory and then we ask the OS to lock it in memory thus avoiding any swapping.
The ndbmtd program have one more configuration variable that will impact its real-time operation. This is the SchedulerResponsiveness variable. By default it is set to 5. It can be set between 0 and 10. Setting it higher means that the scheduler will more often synch with other threads and send messages to other nodes. Thus we improve response time at the expense of throughput. Thus higher value means lower response times and lower values means higher throughput.
There is an undocumented variable __scheduler_scan_priority. This variable can be used to decrease priority of scan operations and raise the priority of key operations. We don't expect this to be used other than in very special cases with very high requirements on response times. Setting it will decrease throughput significantly (up to 10%), so it is an option for extreme real-time applications, but for most applications it should not be used. Setting it to 1 means that 1 row per real-time break will be executed in scan operations. The default is 6 rows per real-time break (except first row which will be only 1 row during frst real-time break independent of configuration).
There is no correspondence to SchedulerExecutionTimer in ndbmtd. However the environment is very different in ndbmtd, some of the aspects on how often we send is handled by configuring SchedulerResponsiveness. We still have the option to hold onto sending to a node for some time. This is defined by the config variable MaxSendDelay.
When a signal is executed it can generate new signals sent to both the same thread, other threads in the same node and to other threads in other nodes. SchedulerResponsiveness controls scheduler parameters that control how many signals we will execute before we perform the actual delivery to other threads, it controls the the maximum number of signals executed before we deliver the signals to other nodes.
When a decision to send to a node has been done, we will send to nodes in FIFO order (First In First Out). We have always a bit more focus on sending to other data nodes in the same node group (to ensure that we have low response times on write transactions).
In order to get the highest throughput, this is especially true in large clusters, we might need to delay the sending of signals to ensure that signal buffers are as large as possible to get the highest possible throughput. MaxSendDelay sets the maximum time to delay signals in the ready to send state. It is provided in microseconds.
Thread types to configure in ThreadConfig#
ThreadConfig is by far the most complex configuration variable in MySQL Cluster. One can use a simple version of this by using MaxNoOfExecutionThreads. One can even specify the CPUs to lock those threads to by providing those in a CPU bitmap in the configuration variable LockExecuteThreadToCPU and IO threads can be specified in LockMaintThreadsToCPU as in ndbd. This possibility is mostly for simplicity, it doesn't necessarily provide a good configuration.
Using ThreadConfig is definitely the preferred manner of configuring the thread configuration and the characteristics of each thread in terms of CPU locking, thread priority and spin time to provide an optimal real-time service using ndbmtd.
ThreadConfig makes it possible to configure the following thread types, ldm, tc, send, recv, main, rep, io, wd. We will describe each thread type and how to configure it.
ldm is the most important thread type. This is the thread where the local database is located. In order to read any data in NDB one has to access the ldm thread. LDM stands for Local Data Manager. ldm contains a hash index, a T-tree ordered index, the tuple storage, handling of checkpoints, backups and various local trigger functions.
The number of ldm threads is very important since it defines how many partitions default tables will have. By increasing the number of ldm threads we increase the availability for reading and writing. Unfortunately increasing the number of ldm threads adds to the cost of range scans since they need to scan all partitions to find the data.
There is a very important correlation between the number of ldm threads and the number of fragment log parts. It is highly advisable to have this set to the same number. This is set by the config variable NoOfFragmentLogParts. Setting this variable lower than the number of ldm threads means that a number of ldm threads are unused and setting the number of log parts higher than the number of ldm threads means that we can get unabalanced load on the ldm threads.
Setting the number of ldm threads is the first thread type one starts with when designing the configuration.
The next thread type to consider is the tc threads. How many tc threads one needs is very much dependent on the query types one uses. For most applications it is sufficient to have around 1 tc thread per 4 ldm thread. However if the application uses a high number of key operations the number of tc threads will be higher.
As a thumb of rule that will work for most applications one should have about half as many tc threads as ldm threads. tc threads will assist the send threads a lot, having a few extra tc threads will be useful and mean that we can have fewer number of send threads.
TC is used for all key operations, these make heavy use of tc threads. Scan operations makes less use of tc threads, even less for scans that perform much filter processing. tc threads are also used for pushdown joins.
The rule of thumb is to use half as many tc threads as ldm threads. But if your application is consistent in its work profile, it makes sense to perform some kind of benchmark that simulates your application profile and see how much it makes use of the tc threads.
The ndbinfo table cpustat can be used to get CPU statistics about the different threads in your data nodes.
recv and send threads#
The next important thread type to consider is the recv threads and send threads. The recv threads takes care of a subset of the sockets used to communicate to the data node. It is usually a good idea to avoid having the recv thread at overload levels since this will delay waking up the worker threads (mainly ldm and tc threads). Actually this is true for both recv and tc threads since if these are not fully loaded they will make sure that the ldm always stays fully active.
send can work well at a very high load. Usually if the load on the send threads goes way up the traffic patterns will change such that larger packets will be sent which will automatically offload the send.
In addition in MySQL Cluster 7.5 all worker threads (ldm, tc, main and rep) will assist send threads when the thread themselves aren't overloaded. The data node keeps track of the CPU load status of all the threads in the data node including the send threads and recv threads.
If a thread is loaded below 30% it will assist the send threads until there is no more data to send before it goes to sleep again. When the load goes up and is between 30% and 75% we will help the send threads at moments when we are ready to go to sleep since the thread has no work to do. At higher loads we will offload all send activity to other threads and not assist any other thread in executing sends.
It is usually sufficient to configure the data nodes with a single recv thread. For larger nodes with around 8 ldm threads and up it could be useful to increase the number of recv threads. In particular if the application uses the asynchronous API parts the recv can be quite loaded due to the high number of signals that needs to be put in internal data structures. The recv thread can handle a couple of million signals per second, thus this is quite unusual. But it does occur e.g. when I ran flexAsynch benchmarks where data nodes handle more than 10M key lookups per second.
There is usually a need of a few more send threads. However the design in MySQL Cluster 7.5 where worker threads share the burden of sending with the send threads means that fewer send threads are needed. It is in most smaller configurations quite ok to run with a single send thread. Using about 1 send thread for each 4 ldm threads is a conservative number.
main and rep threads#
main and rep threads are simple, there will always be one of each in an ndbmtd process.
The main thread is used mainly for schema transactions and various backgrounds activities like heartbeating. It is used quite heavily in the startup phases from time to time. In MySQL Cluster 7.4 it was heavily used in scan processing and could even in exceptional cases be a bottleneck where scan processing was very significant. The scan processing in MySQL Cluster 7.5 still uses data in the main thread, but it accesses it directly from the tc thread using RCU principles and mutex protected access. In large clusters the main thread could get a lot of work to handle at certain states of local checkpoint processing, especially when the number of small tables is high.
rep thread is taking care of MySQL Cluster Replication in order to get information about all updates in the MySQL Cluster to the MySQL servers in the cluster configuration responsible to handle MySQL replication from one cluster to another cluster or to any other MySQL storage engine. It handles some other activity related to multithreading and file activity. If there are much updating and one uses MySQL Cluster Replication than this thread can be heavily used.
Those threads are also heavily involved when we reorganize a table to a new table partitioning e.g. after adding a new set of nodes.
io threads can be configured although the count is here ignored. The actual number of IO threads is runtime dependent and can be controlled by the configuration variables DiskIoThreadPool, InitialNoOpenFiles and MaxNoOfOpenFiles. However as mentioned in the documentation it is more or less considered a bug if one has to change the *NoOpenFiles variables. The interesting thing to control for io threads is their priority and the number of CPUs they are bound to.
Normally the io threads use very little CPU time, most of it is simply starting up OS kernel activity and this can as well happen on any other kernel thread defined by the OS configuration. There is a clear case for controlling io threads when one has set either CompressedLCP and/or CompressedBackup. Given that both local checkpoints and backups send a lot of data to disk, it makes a lot of sense to compress it if CPU capacity exists and disk storage is a scarce resource. It is likely that using this feature will compress the LCP and backup files to at least half its size or even less. The compression uses the zlib library to compress the data.
As mentioned it comes at the expense of using a fair amount of CPU capacity both to write the data to disk as well as to restore it from disk in a restart situation.
If one decides to use this feature it becomes important to control the placement of io threads on the CPUs of your servers.
One can control the wd threads. This is the watchdog thread and a few other background threads used to for connection setup and other background activities. Normally it is not essential to handle these threads in any special manner. However if you run in a system with high CPU usage it can sometimes be useful to control CPU placement and thread priority of these threads. Obviously it isn't good if the watchdog is left completely without CPU capacity which can potentially happen in a seriously overloaded server.
Thread settings in ThreadConfig#
For each thread there are 3 things that one can control.
Which CPUs to lock threads to
Which priority should threads execute at
Should threads be spinning when idle
Controlling CPU placement can be done using 4 different configuration variants. cpubind, cpubind_exclusive, cpuset and cpuset_exclusive. The exclusive variants are performing the same action as without exclusive except that they ensure that no other thread from any other program can use these CPUs. We get exclusive access to those CPUs, even the OS is not getting access to those CPUs. Although a nice feature it can unfortunately only be used on Solaris which is the only OS that has an API to control CPU placement at this level. We can use cpubind and cpuset on Linux, Windows, Solaris and FreeBSD. None of the settings can be used on Mac OS X since Mac OS X has no API to control CPU placement.
When a thread or a set of threads is bound using cpubind it means that each thread is bound to exactly one CPU. If several threads are in the group they are assigned to CPUs in order. In this case it is necessary that the number of CPUs is at least the same as the amount of threads in the group (io are a bit special since they perform round-robin on the CPUs when they are started if several CPUs exist in the thread group).
Two thread groups can use the same set of CPUs in cpubind. This can be useful to have several light-weight threads not having a dedicated CPU resource. E.g. the main, rep, io and wd threads can share one CPU in many configurations.
The number of CPUs should be equal to the number of threads and cannot be smaller than the number of threads.
The cpuset means that each thread is allowed to be on any of the CPUs in the CPU set. Here the number of CPUs in the CPU set doesn't have to be the same as the number of threads although it is usually a good idea to not oversubscribe your HW since it will have a bad effect on response time and throughput of MySQL Cluster.
One can define several thread groups to use the same CPU set, but it has to be exactly the same CPU set, one cannot define overlapping CPU sets in the configuration.
Thread Priority settings (thread_prio)#
We have the ability to control the thread priority for the various thread types. This is done by using the thread_prio parameter. We can set it to any value between 0 and 10. Higher value means higher priority and the normal priority is 5.
On Linux and FreeBSD this parameter sets the nice value of the thread going from -20 for 10 down to 19 for 0. This uses the setpriority system call. Setting it to 10 gives a nice level of -20, 9 gives -16, 8 gives -12, 7 gives -8, 6 gives -4, 5 gives 0, 4 gives +4, 3 gives +8, 2 gives +12, 1 gives +16 and 0 gives +19.
On Windows we make use of the Windows API function SetThreadPriority. We use priorities from THREAD_PRIORITY_LOWEST at 0 and 1, THREAD_PRIORITY_BELOW_NORMAL at 2 and 3, THREAD_PRIORITY_NORMAL at 4 and 5, THREAD_PRIORITY_ABOVE_NORMAL at 6 and 7, THREAD_PRIORITY_HIGHEST at 8, 9 and 10.
Solaris has a wide variety of settings for thread priority. For a database engine that often executes for a long time we selected using the FX priority class which gives a fixed priority level that doesn't change with execution time. With increasing thread priority we increase the FX priority starting at 15 and going through 5-step increments up to 8 which is 55 and 9 is 58.
Using priority 10 on Solaris 11.3 and higher has a special meaning, this sets FX priority 60 and this has a very special meaning. It means that Solaris will attempt to ensure that the thread can execute alone on the CPU core it is assigned to. Thus the CPU thread will be undisturbed by any other threads running in the machine.
Mac OS X provides no APIs to control thread priority, the thread_prio parameter is not possible to set on Mac OS X.
Most OSs support some method of setting real-time priority of threads. On Windows we can use the SetThreadPriority call using the parameter THREAD_PRIORITY_TIME_CRITICAL. On Linux we can use the call to sched_setscheduler to set real-time priority. On Mac OS X, FreeBSD, Solaris we can use the pthread_setschedparam call.
These calls are very dangerous, they can even put the entire OS into a freeze state. We are providing a manner to avoid those freezes by downgrading from real-time priority to a normal priority level if we execute for more than 50 milliseconds without going to sleep. This means that using real-time priority is most effective when we don't have massive loads to handle but rather want very quick response times and very predictable response time in those situations.
There are two ways to configure real-time settings. The first is to set the configuration variable RealtimeScheduler. This parameter also works for ndbd processes. For ndbmtd processes it sets the default value for all threads. This default value can be overridden by the settings in the ThreadConfig parameter by setting realtime to 0 to not use real-time priority and setting it to 1 to enable real-time priority.
We can set the threads to spin for a while before going to sleep. Thus the thread is ready to go to sleep. It has no work to do, but we decide that we want to be awake to quickly service requests when they come in. If we can do this spinning without affecting other parts of the system in a negative way it could be a very important feature to improve response times. We've seen as much as 20% better response times at low loads through using the spintime parameter.
This feature is available on all OSs since it doesn't involve any OS calls. It simply means that we avoid calling the condition wait call until we've spun for the time configured. This time is configured in microseconds.
Setting spintime works the same way as when setting realtime. One can set a configuration parameter called SchedulerSpinTimer. This works also in ndbd. In ndbmtd it sets the default value for all threads in the process and can be overridden by settings per thread type in ThreadConfig.
spintime can be set to maximum 500 microseconds, if it is set higher it will be set to 500.
Basic configuration setup#
When you look at the forums it is quite obvious that the absolute majority of the NDB users are looking for a very basic high availability setup with one VM per node in the cluster.
We will look at setting up the configuration for such a basic cluster as an exercise now. The previous chapters have given us a lot of input on what parameters that matters the most. We want to setup a memory efficient setup that can handle some load, but not the very highest load. We are not aiming for the absolute highest performance here, it is enough to setup things for a basic user that wants a very basic HA setup.
The most basic HA setup has 5 VMs, one for the NDB management server, two VMs for the data nodes and two VMs for the MySQL servers. In addition we will add one API node for use by NDB tools.
We assume that the management server is using IP address 192.168.1.100, the data nodes are setup on 192.168.1.101 and 192.168.1.102 and the MySQL servers are using 192.168.1.103 and 192.168.1.104.
As preparation for setting up the cluster we assume that the VMs have opened up port 1186 to communicate to and from the NDB management server, 3306 to communicate to and from the MySQL servers and port 11860 to communicate to and from the data nodes. This is particularly important to consider when the OS by default closes down all ports.
When setting up things in the cloud it is very common that one can define the ports to open up, in a cloud setup almost all VMs open up port 22 for SSH access, but in addition we need port 1186, 3306 and 11860 opened up here for the VMs.
In /etc/my.cnf we store the configuration used to startup the MySQL server nodes. In this file we need to ensure that the MySQL server can use the NDB storage engine and we need to ensure that it can connect to the management server.
In the NDB configuration file we need to create 6 nodes, the management server with node id 49, the two data nodes with node id 1 and 2 and the two MySQL servers with node id 51 and 52. In addition we provide an API node that is using node id 231. This node is not bound to any specific host.
In this case we assume that the cluster is used to store a fairly small database of up to 200 MByte of data. It is enough to configure with 256 MByte of DataMemory. In addition we need a bit of IndexMemory to handle the hash index. 40 MByte is sufficient to store more than 2 million rows, so should be sufficient for a database size of 200 MByte.
[ndbd default] NoOfReplicas=2 NoOfFragmentLogParts=4 NoOfFragmentLogFiles=4 FragmentLogFileSize=64M DataDir=/usr/local/ndb_data DataMemory=256M IndexMemory=40M DiskPageBufferMemory=0 #DiskPageBufferMemory=256M ServerPort=11860 MaxNoOfAttributes=10000 BatchSizePerLocalScan=64 StartPartitionedTimeout=0 TwoPassInitialNodeRestartCopy=1 BuildIndexThreads=128 ConnectCheckIntervalDelay=1 HeartbeatIntervalDbDb=1000 MinDiskWriteSpeed=1M MaxDiskWriteSpeed=2M BackupDataBufferSize=512k BackupMaxWriteSize=512k BackupWriteSize=256k ODirect=1 IndexStatAutoCreate=1 IndexStatAutoUpdate=1 StringMemory=5 [ndb_mgmd default] DataDir=/usr/local/ndb_data [ndb_mgmd] NodeId=49 Hostname=192.168.1.100 [ndbd] NodeId=1 Hostname=192.168.1.101 [ndbd] NodeId=2 Hostname=192.168.1.102 [mysqld] NodeId=51 Hostname=192.168.1.103 [mysqld] NodeId=52 Hostname=192.168.1.104 [api] NodeId=231
We have commented out setting a higher value for DiskPageBufferMemory, instead we set it to 0. This means no disk columns are allowed in any table. If the user wants to use disk columns we create a properly sized page cache such that the user can have up to about 2 GByte of data in disk columns as well. If so, swap the DiskPageBufferMemory rows.
This configuration will allow us to create 320 table and index objects with up to 1000 columns in the tables. This should be sufficient for a small database.
We have changed the configuration parameters where it makes sense to change the default settings. These are as discussed in previous chapters BatchSizePerLocalScan to decrease memory usage for lock records. Setting StartPartitionedTimeout to 0 to avoid ever starting up in a partitioned state. Setting NoOfFragmentLogParts, NoOfFragmentLogFiles and FragmentLogFileSize to get a proper sized REDO log. In this case we have 4 parts with 4 files each and each file is 64 MByte, thus in total 1 GByte.
We have changed the defaults of local checkpoint buffers to avoid large buffers that are not needed and even increase latency of queries towards NDB. Given that the database is very small it is a good idea to decrease the checkpoint write speed to avoid wasting disk bandwidth on writing checkpoints very fast.
Setting NoOfReplicas to 2 isn't absolutely necessary since it is default, but it is good to be clear since this is a very basic configuration parameter.
We need to set DataDir to ensure that the placement of NDB files is set.
We use the ConnectCheckIntervalDelay parameter that provides a safer handling of heartbeat failures and we combine this with a shorter heartbeat timeout that should be long enough.
We use the most efficient algorithm to perform initial node restart by setting BuildIndexThreads to 128 and TwoPassInitialNodeRestartCopy to 1.
Many of those defaults have changed in MySQL Cluster 7.6. It is a good idea to use those also in MySQL Cluster 7.5.
Using ODirect is preferrable in most file systems so we set it here.
[mysqld] ndbcluster ndb-cluster-connection-pool-nodeids=51 ndb-connectstring=192.168.1.100 query-cache-size=0 query-cache-type=0 ndb-read-backup=1 ndb-allow-copying-alter-table=0 ndb-autoincrement-prefetch-sz=1024 [ndbd] connect-string=192.168.1.100 [ndb_mgmd] connect-string=192.168.1.100 config-file=/usr/local/ndb_data/config.ini nodeid=49 [ndb_mgm] connect-string=192.168.1.100
In the my.cnf file it is important to ensure that the query cache is entirely disabled to ensure that it doesn't prevent the MySQL server from scaling to a large number of CPUs.
Using NDB for SQL applications means mostly that the focus is on read scaling. We set the flag that ensures that all tables use the Read Backup feature. To ensure good scalability of autoincrement tables we set the prefetch size to 1024 to avoid any scalability issues with inserting rows with autoincrement keys.
To ensure that the user doesn't use a copying ALTER TABLE by mistake, we set the flag that disallows using a copying ALTER TABLE.
Whether to set the node id of the MySQL server, data nodes and management server in the my.cnf is a matter of choice. Here I have done so, but this means that the file must be changed in each VM. Otherwise one can provide the node id in the startup command.