Procedure to define advanced configuration#
Number of replicas#
The first choice for setting up an advanced RonDB configuration is to select the number of replicas to use. By default and by far the most common is to use two replicas. The code supports 1 to 4 replicas. However since RonDB 21.04.0 we support the notion of not active nodes. This means that we should prepare for the maximum number of replicas already from the start. Changing the number of replicas requires a backup and restore process. Thus we should define the number of replicas we might ever need in the cluster already from the start.
Most clusters should be ok to use a maximum of 3 replicas. Thus setting NoOfReplicas to 3 should be a good start. This means that we need to add at least 3 data nodes to the configuration. If we only want 1 replica to start with we will add 2 data nodes where we will set NodeActive to 0. This means that we will not expect those nodes to start (actually they are not even allowed to start).
Define nodes in the cluster#
One needs to define the set of data nodes, the set of management servers, the set of MySQL Servers and any additional API nodes to use.
Number of log parts#
One can consider to set NoOfFragmentLogParts. This parameter specifies the number of log parts per node. This defines the number of REDO log parts and the number of mutexes used to protect writes to the REDO log.
In NDB the number of log parts and number of LDM threads was supposed to be the same. This is not required in RonDB. We can use any number of LDM threads with any number of log parts. A mutex is used if required to protect the log parts if the same log part is written to from several LDM threads.
The default setting is to use 4 log parts and the FragmentLogFileSize defaults to 1 GB and NoOfFragmentLogFiles defaults to 16. This means that we have a total of 64 GByte of REDO logs. Experiments have shown that this can handle almost all loads. Thus the main reason to change those parameters would be to decrease them to save space in the file system.
NoOfFragmentLogParts cannot be set smaller than 4. During execution of the REDO log we execute 1 operation at a time per REDO log. Thus more log parts can lead to a bit quicker recovery if we have many LDM threads. However with the changes in MySQL Cluster 7.6 where partial local checkpoints were introduced this should not be an issue since REDO log execution is normally a very small part of the recovery process. Most of the time is spent in reading back the checkpoints from disk and synchronising with the live nodes.
Most of the data in RonDB resides in memory, setting up the memory resources available for the database is the next important step.
All changes of the databases in RonDB happens through transactions. Transaction resources can be configured. The most important parameters here as mentioned in the previous chapter is TransactionMemory and SharedGlobalMemory.
Tables, indexes, foreign keys, disk data objects and so forth, all of them require schema resources. These schema resources can be configured. As mentioned in the previous chapter the main parameters to adapt here are MaxNoOfTables, MaxNoOfAttributes and MaxNoOfTriggers.
RonDB is a distributed DBMS, thus communication between nodes can be configured.
Setting up nodes#
The cluster configuration must contain all nodes that will be part of the cluster. We have to define the data nodes, these are equal to the ndbmtd processes we start in the cluster. Similarly the management server nodes is equal to the number of ndb_mgmd processes we start.
This is not true for MySQL Servers and other API nodes. In this case each MySQL Server can use a number of API node ids.
We strongly recommend to use node ids on all data nodes and on all management servers and on all static servers using the NDB API, such as the MySQL Server or any other program that is expected to be up all the time when the cluster is up.
One can also define a few API node slots for NDB tools, these need not have a node id.
Similarly we recommend strongly that data nodes, management servers and all static servers (MySQL Servers and others) set the hostname allowed to connect on each node id.
For security reasons it is a good idea to do the same also with the free API node slots to control from where API nodes can connect to the cluster.
Setting up data nodes#
We recommend that data nodes are the only nodes set up to use node ids 1 through 64. This provides the possibility to grow the number of data nodes to the maximum size without any hassle with reconfiguration of API nodes and other nodes. Actually one can even have 144 data nodes, but since the total number of nodes is at most 255, this is currently not so useful.
As mentioned we recommend to set node id and host name of each data node.
These parameters should be set in the section for each unique data node.
More or less all other parameters should be defined in the default section for data nodes ([NDBD DEFAULT]).
In addition each data node should set the parameter ServerPort. Not setting this means that the management server assigns dynamic ports to use in connecting to a data node.
More or less every modern IT environment have firewalls that makes it very hard to work in such a manner. Thus it is recommended to set the ServerPort for all data nodes. There is no IANA defined port for this purpose. We recommend using port 11860 for this, it is easy to remember since the management server by default uses 1186 and there is no conflicting IANA port for this port number.
By using this port there are 4 port numbers to consider for RonDB installations. The 3306 for MySQL Servers, 1186 for NDB management servers, 11860 for RonDB data nodes. Using the X plugin and MySQL Document Store means adding 33060 to the set of ports to open in the firewalls to set up RonDB.
It is important to define at least DataDir to ensure that we have defined where the various files for a data node is stored, more on other configuration items for file set up will be described in the next chapter. If all VMs have the same file setup we could set DataDir in the [NDBD DEFAULT] section.
Nodes not meant to be used yet should set NodeActive=0.
Setting up management server nodes#
As mentioned one should set node id and host name of the management server.
Normally PortNumber use the default port number 1186 for NDB management servers, otherwise this needs to be set.
The management servers should use node id 65 and 66.
DataDir will default to /var/lib/mysql-cluster, this variable should always be set.
A RonDB management node not meant to be used yet should set NodeActive=0. One should never use more than 2 RonDB management servers since not all protocols support handling more than 2 management servers.
Setting up MySQL Server nodes#
Given that each MySQL Server requires one API node for each cluster connection it uses, it is important to set up the correct amount of MySQL Server nodes in the cluster configuration. For example if the MySQL Server uses 4 cluster connections it is necessary to set up 4 API nodes. As mentioned above it is recommended to set node id and host name of all those nodes. In this case all 4 of the API nodes must use the same host name but different node ids.
When starting the MySQL Server one sets the parameter ndb-cluster-connnection-pool-nodeids to define the node ids to use by the MySQL Server. This could be list of node ids.
The MySQL Server node ids should start at 67.
Setting up API nodes#
Setting up API nodes for permanent servers follows the same principle as for MySQL servers. Setting up API nodes for use by the various NDB tools usually doesn't require any fancy setups. For security reason it could be a good idea to set the host name. There is no specific need to set the node id on those nodes.
Here is a basic configuration file for two data nodes and two MySQL servers that are colocated. It only sets up the number of replicas, the nodes themselves, the data directories, hostnames and node ids. More details will be covered in later chapters.
The extra nodes use the same IP as one of the other nodes, this might generate a warning when starting the management server, but this warning can be ignored. This practice is used to ensure that the IP address used as hostname actually exists.
Advanced Configuration of Data nodes#
Configuring data nodes have quite a lot of configurable resources and parameters that affect various internal algorithms.
Configuring memory resources#
The most important parameter for a data node is the amount of memory that can be used for data storage. Rows are stored in memory configured through DataMemory.
This parameter is calculated automatically by default. The automated memory configuration use 90% of the memory not used by other parts of the data node. The automated memory configuration will fail unless there is at least 1 GByte of memory available to use for DataMemory and DiskPageBufferMemory.
This memory is also used by the primary key hash index and the ordered indexes. Each row has one entry in a primary key hash index, this includes unique index rows. Each table use around 15 bytes of space in for each row in the hash index, similarly each row in a unique index consumes about 15 bytes per row.
DataMemory is not used by copy rows. Each row change (except insert row) creates a copy row that contains the data before the row change happened. This copy row uses TransactionMemory.
DataMemory is used by ordered indexes, each row consumes about 10 bytes of memory per row per ordered index. If a table has 1 million rows and three ordered indexes, the indexes will consume around 30 MBytes of DataMemory space.
We will always leave a bit of free space in DataMemory. The reason is that temporarily the database can grow during restart. To avoid running out of memory during recovery we won't allocate any more rows when we reach the limit on minimum free pages (except during restart when all memory is available). The default is that we save 5% of DataMemory. The default setting can be changed through the configuration parameter MinFreePct.
DiskPageBufferMemory are used by disk columns. This is the page cache that contains disk pages when they are in memory. 10% of the free memory is allocated to the DiskPageBufferMemory by default. Both DataMemory and DiskPageBufferMemory can be set even with automated memory configuration.
Extent pages (there are 4 bits per extent in the extent pages) are permanently stored in memory. The memory allocated for the page cache should be substantially bigger than the memory required for storing the extent pages. The size of the extent pages depends on the size of the tablespaces created.
DiskPageBufferEntries is explained in the chapter on disk columns.
This is a common resource that can be used for various purposes. If we run out of send buffers we can extend our send buffers up to 25% by using this memory.
If we haven't defined InitialLogfileGroup in our cluster configuration file we will take the memory for the UNDO log buffer from this memory.
Memory used by disk data file requests are taken from this memory pool as well.
All the operation resources that are specific to pushdown joins are directly using the memory in SharedGlobalMemory. The operation records used for meta data operations allocates 2 MByte of memory and if it needs to extend beyond this it starts to use memory in SharedGlobalMemory.
Configuring Transaction resources#
This is another area that is often required to change for production usage or benchmarking. The default setting will work fine in many situations and even for production in some cases. This section describes the parameters and when it is required to change those. It describes the memory impact of increasing versus decreasing them.
Any record size provided in this chapter is from a specific MySQL Cluster 7.5 version and can change a few bytes up and down without any specific notice.
When we execute any read or write of data we always use a transaction record in a tc thread. A transaction record is also used for every scan we execute in the cluster.
These records use TransactionMemory. The more parallel transactions we can execute and the longer the transactions we execute, the more TransactionMemory we need.
The size of the transaction record is 896 bytes, the real record size is 288 bytes. But we need one extra record to handle the complete phase for each transaction record and additionally we need one record to ensure that we can handle node failure handling. In addition we have 2 records of 16 bytes in DBDIH for each transaction thus giving 896 bytes in total.
Primary key operations#
These records are using TransactionMemory.
Every primary key operation uses one operation record in the tc thread one in the ldm thread. If it is a write operation there is one operation record for each replica. In addition each scan operation uses also one operation record in the tc thread and one operation record in each of the scanned fragments in the ldm threads.
The size of operation records in the tc threads is 304 bytes. The size is around 152 bytes, but we need two records per operation, one is required to handle node failure handling when we take over failed transactions.
The size of the local operation records in the ldm threads is 496 bytes. The record is actually split in one part in DBLQH, one in DBACC and one in DBTUP.
For a read the total operation record sizes used are 800 bytes and for an update with 2 replicas the operation record sizes used are around 1300 bytes.
Each GByte of memory assigned to operation records thus provides for a parallelism of about 1.25 million read operations.
These records are using TransactionMemory.
Scans use a transaction record and a scan record in the tc thread. The scan record size is 120 bytes. The maximum number of parallel scans is set per tc thread.
It is set by default to 256 and has a maximum number of 500. The reason that the number of scans is limited is due to the high parallelism we employ in the scan operations. It is necessary to limit the amount of parallel scans in the cluster to avoid various overload cases. Changing the default is through the parameter MaxNoOfConcurrentScans. Thus this parameter is not a parameter that specifies the memory allocated for scans, it is a parameter that limits the parallelism of scans.
Each scan allocates a set of scan fragment records as well in the tc thread (currently 64 bytes in size). The number of scan fragment records is one if the scan is a pruned scan, otherwise it is equal to the maximum scan parallelism. The maximum parallelism is the number of partitions in the table. It is possible to limit the scan parallelism by using the SO_PARALLEL option in the NDB API. There is no specific setting in the MySQL Server to change this. Scans from the MySQL Server always uses a parallelism equal to the number of partitions in the table. For a default table the number of partitions is the number of nodes multipled by 2.
All the scan resources are released as soon as the scan has completed. Only operation records from key operations are kept until commit time. A delete that is using a scan have to execute a primary key operation that uses a special scan takeover request to ensure that the lock and the update is kept until commit time.
Given that scan resources are released after execution, the amount of scan records is more based on the amount of parallelism rather than the amount of data in the cluster. The default setting of 256 parallel scans per tc thread should work fine in most cases.
The number of scan frag records is the number of scan records multiplied by number of tc threads multipled by number of ldm threads multipled by number of data nodes in the cluster. For example with 2 tc threads, 4 ldm threads and 4 data nodes and a maximum of 256 scans in parallel we will have 8192 scan fragment records in each tc thread.
The same amount of scan records exist in the ldm threads as there are in the scan fragment records in the tc threads.
For each scan we have a scan record that is in total 588 bytes, this record is divided into a 96 byte record in DBACC, a 112 byte record in DBTUP, a 136 byte record in DBTUX, a 232 byte record in DBLQH. Next we also allocate an operation record for each scan that is 496 bytes in size. In total we allocate records of size 1084 bytes per local scan in an ldm thread.
Unique index operations#
These records are using TransactionMemory.
Access to a record through a unique index uses two steps, the first step is a read of the unique index, the second step is a normal operation using the primary key read from the unique index.
The first phase uses a special record in the tc thread. This record is only kept until the first read of the unqiue index have completed. Immediately when the read has completed, the record is released. The record has a size of 160 bytes.
These records are using TransactionMemory.
Similar to unique index operations we keep a record for triggered operation from that we receive them until we executed them. This record is 80 bytes in size.
When a trigger is fired, its values are sent to the tc thread exeuting the transaction. Until the resulting action is executed we keep the before value and after value sent in the trigger in a buffer.
Limiting maximum transaction size#
The maximum size of a transaction is limited. In RonDB 21.04 this is limited by the configuration parameter MaxNoOfConcurrentOperations. This limits both the number of operations involved in one transactions as well as the number of operations handled by one tc thread.
The configuration of transaction resources can be a bit daunting we understand. This is why RonDB since version 21.04.0 has automated all these things such that it is only required to set TransactionMemory and even the setting of this parameter is automated by default.
The MaxNoOfConcurrentScans parameter is still used since this configuration parameter is also ensuring that we don't overload the cluster.
Configuring Schema resources#
This section is required to consider for use of MySQL Cluster with a large number of tables, attributes, foreign keys, indexes and so forth. It describes the parameters and when to consider changing them, it describes the memory impact of those parameters.
When using automated memory configuration we will allow to define 20300 tables in RonDB, this includes tables, unique indexes and ordered indexes. One can define 500.000 columns and 200.000 triggers. If it is required to change any of those they can be changed using MaxNoOfTables, MaxNoOfAttributes, MaxNoOfTriggers.
There is work ongoing in the branch schema_mem_21102 to move this memory to the SchemaMemory which will be defined in a similar fashion to the TransactionMemory in MBytes.
MaxNoOfAttributes defines the maximum amount of columns that can be defined in the cluster.
There are two important records that are influenced by the number of columns. The first is in the block DBDICT where we have a record of 160 bytes that describe the column and its features. The second memory area is in DBTUP where we have a memory area with a great deal of pointers to methods used to read and write the columns. This area is 120 bytes per column in each ldm thread. Thus if we have a node with 4 ldm we will use 640 bytes per column.
The maximum number of columns in the cluster will also affect the size of memory allocated for StringMemory as explained below.
Number of tables#
There are three types of objects that all use table objects. Normal tables is the obvious one, the unique indexes are stored as tables, these have to be counted as well. In most situations the ordered indexes are also treated as table objects although they are tightly connected to normal tables and their partitions.
When calculating memory used per table, the three values MaxNoOfTables, MaxNoOfOrderedIndexes and MaxNoOfUniqueHashIndexes are added. The maximum number of table objects is limited by DBDICT where at most 20320 table objects can be defined.
In most places the sum of those values are used. They are used to calculate the number of triggers in the system and here they are used individually. The number of ordered indexes have a special impact on the number of objects in DBTUX. DBTUX is only used by ordered indexes, thus this is natural.
There are many different records that are describing table objects. In DBDICT we have a table record of 280 bytes, we have a meta data object of 88 bytes and we have a global key descriptor of 520 bytes. In addition we have two hash tables on meta data object id and meta data object name. Hashes has at most an overhead of 8 bytes per table object. The table record in DBDIH is a bit bigger and consumes 3296 bytes. In addition we have around 24 bytes of memory per table object in each tc thread. In the ldm threads we have 76 bytes per table object in DBACC, 72 bytes in DBLQH, 144 bytes in DBTUX and 544 bytes in DBTUP. There is a total of 836 bytes per ldm thread.
In total for table records we have in the example with 2 tc threads and 4 ldm threads we use about 7700 bytes per table object.
Table objects is not the only memory consumer for meta data objects. The memory consumption also comes from the fragment records in various blocks.
The default number of fragments per ldm thread is equal to the number of replicas in the cluster set by NoOfReplicas. Thus normally equal to 2. Each fragment record is 496 bytes in DBLQH, 304 bytes in DBACC, 472 bytes in DBTUP and 96 bytes in DBTUX. DBTUX only stores fragment for ordered indexes, this record can almost be ignored in the calculations. Assuming that we have 2 replicas in the cluster, we have around 2600 bytes of fragment records per table object per ldm thread. With 4 ldm threads we consume about 10400 bytes per table object in the fragment records.
We have both fragment record and replica records in DBDIH. DBDIH stores one fragment record for each fragment in the cluster, the number of fragments per table by defaut is the number of nodes multiplied by the number of ldm threads. The number of replica records is this number multiplied by the number of replicas in the cluster. The fragment records are 48 bytes and the replica records are 148 bytes. With 2 nodes and 4 ldm threads and 2 replicas we use 2752 bytes per table object in DBDIH.
In the example case with 2 data nodes, 4 ldm threads, 2 tc threads and 2 replicas we consume a total of 20852 bytes per table object.
Thus even with the maximum amount of tables in this setup (20320 table objects) we consume no more than around 400 MBytes of metadata for table, fragment and replica records.
Scaling to a higher number of ldm increases most sizes linearly since more threads need table and fragment records while the number of fragments per ldm thread stays constant.
Increasing number of nodes influences the storage in DBDIH where more memory is needed since the number of fragments in a table increases linearly with the number of nodes. For example in the case with 48 data nodes with 4 ldm threads we get about 66 kBytes per table object instead in DBDIH.
Each unique index requires 3 triggers (one for insert, one for update and one for delete operations). 1 trigger is required for each ordered index. In addition we will take height for one set of 3 triggers for a metadata operation such as add partitions. Each normal table that is fully replicated will use 3 triggers as well.
Thus the main reason to increase MaxNoOfTriggers is to ensure that we can add foreign keys, that we can add triggers for altering the amount of table partitions and when more events are added by users to the system.
The amount of memory used for triggers is fairly small. The DBDICT block consumes 116 bytes per trigger object, the DBTC blocks consumes 36 bytes per tc thread and the DBTUP block consumes 112 bytes per ldm thread. In our example with 2 tc threads and 4 ldm threads each trigger will use 636 bytes of memory.
We will have extra trigger records in DBTUP, these extra records are there to accomodate backups, backups do not use global triggers, they only use local trigger. We will add 3 extra trigger records for the number of tables defined by MaxNoOfTables.
Given that the memory consumption of triggers is low, it is a good idea to set this configuration parameter such that it is prepared for a set of foreign keys, add partitions and events on all tables through various resources.
StringMemory defines the amount of memory to allocate for column names, default values for columns, names of tables, indexes, and names of triggers and it is used to store the FRM files used by MySQL servers to store metadata.
In RonDB the FRM file is stored inside RonDB, thus when a MySQL Server connects to the cluster, it will receive the tables and their metadata from RonDB and store it in the local file system of the MySQL Server.
The maximum size of a table name is 128 bytes, the maximum size of a column name is 192, the maximum size of the default values is equal to 30004 bytes (4 plus the maximum row size in RonDB).
DBDICT will allocate memory to ensure that we can fit also very large table names and column names and so forth. The size is calculated based on maximum sizes and the maximum amount of columns and number of table objects and triggers.
Now this size is a bit overexaggerating to use for allocation. So StringMemory is defined in percentage. The default setting is 6%. If the memory is calculated to be 20 MBytes and we use the default setting of 6, then the allocated size will 6% * 20 MBytes = 1.2 MBytes.
Even 6 is a bit large, for example most columns have a much smaller default value compared to the maximum size of 30000 bytes. Thus going down to smaller values is mostly ok.
The actual default value used in normal query processing is stored in DBTUP and it uses the DataMemory to store one such row per table per ldm thread. This size of this default row is limited by the maximum row size that is 30000 bytes currently.
Configuring events is ensuring that sufficient amount of records exist in the block SUMA where events are handled.
A subscription is 128 bytes in size, thus it doesn't take up a lot of resources. The most common use case for subscriptions is that it is used for RonDB replication.
Normally in this case there is one subscription per table that is replicated. The default value of this parameter is to set it to 0. In this case the value it is the sum of the MaxNoOfTables, MaxNoOfUniqueHashIndexes and MaxNoOfOrderedIndexes. Given that a subscriptions can only be set on normal tables it means that we get a bit more subscription records than what is required for normal Global Replication.
This is good since subscriptions are also used for online meta data operations such reorganising partitions, building indexes and building foreign key relations.
Normally there is only one subscription used even if there are multiple MySQL servers doing replication. They will subscribe to the same subscription. The handler has a specific name on the subscription to ensure that it knows how to find the subscription.
If the replication setup is different for the MySQL replication servers they will require different subscriptions, in this case the name of the subscription will differ. One case where this will happen is if the MySQL replication servers use a different setting of the --ndb-log-updated-only configuration option.
NDB API programs that defines events on tables are also likely to use their own subscriptions. It is possible to synchronize several NDB API nodes to use the same subscription by using the same name of the event.
The default setup will work just fine in most cases, but if the user has different MySQL replication server setups and/or is using the NDB Event API, it would be required to increase the setting of MaxNoOfSubscriptions.
The default settings are MaxNoSubscriptions the same as the maximum amount of tables and MaxNoOfSubscribers defaults to twice this.
A subscriber is using a subscription, there needs to be more subscriber records compared to the number of subscriptions. Each NDB API node or MySQL Server node that listens to events from a table uses a subscriber record. The default number of subscriber records is twice the default number of subscriptions. Thus it is designed to use two MySQL replication servers in the cluster that listens to all table events without any issues and even handle a bit more than that.
The subscriber record is only 16 bytes, thus there is very little negative consequence in increasing this value to a higher setting.
Suboperation records are used during the creation phase of subscribers and subscriptions. It is set to 256 and should normally be ok. The record is only 32 bytes in size, no problem exists in increasing this setting even substantially.
Event operations was explained in some detail in the chapters on Global Replication and the API towards it in the chapter on the C++ NDB API.
Events are gathered one epoch at a time. One epoch is a group of transactions that are serialised towards other epochs.Thus all transactions in epoch n committed before any transaction in epoch n+1 was committed.
The time between epochs defaults to 100 milliseconds and this value is configurable through the parameter TimeBetweenEpochs.
Now the SUMA block that maintain events grouped in epochs, must keep track of epochs. For this it uses a special epoch record, this is only 24 bytes in size. By default we have 100 such records. Thus the memory used for these records is completely ignorable. It is still important to set this parameter properly. The default setting of 100 together with the default setting of 100 milliseconds for TimeBetweenEpochs means that we can track epochs for up to 10 seconds. If any part of the event handling cannot keep up with epochs rapidly enough, subscribers will start losing events and this will lead to that we will close down the subscriber. The parameter is set in number of milliseconds.
Setting this a bit higher is not a problem. What can be problematic is whether we will have memory buffers enough to handle more epochs. This depends on the update rate in NDB and amount of memory buffers we have available for saving epochs.
The buffer space for epochs is controlled by the configuration parameter MaxBufferedEpochBytes. This value defaults to 500 MByte. In a high-performance environment it is quite likely that this buffer should be significantly increased.
The parameter TimeBetweenEpochsTimeout can be set to a a non-zero value, in this case we will crash the node if it takes longer than this value to complete the epoch.
The parameter RestartSubscriberConnectTimeout is an important timeout during node restart. If an API node or a MySQL server have subscribed to some table events, the starting node will wait for this API node or MySQL server to connect during the last phase (phase 101) of the node restart. Normally these nodes will connect very quickly and the restart will proceed. If a node subscribing to events are down, this is the time that we will wait for those nodes to connect again.
By default this timeout is set to 2 minutes (set to 120000, set in milliseconds).
Basic thread configurations#
The preferred manner of setting up the thread configuration is to use the ThreadConfig variable. This is explained in detail in a coming chapter. This is only intended to be used by very advanced users. Most of the advantages from using ThreadConfig can be achieved using automated thread configuration which is automated and the default action.
There is a wide range of configuration items that affect restarts and when node fails.
NDB is designed for automatic restart. The default behaviour of a data node is that it is executing two processes, the starting process is called the angel process and is responsible to restart the node automatically after a crash. There are three configuration parameters controlling this behaviour.
Setting StopOnError to 1 means that the data node process will stop at failures. Default is 0, thus the default is to automatically restart.
If the node restart fails we will count this failure. The MaxStartFailRetries is how many attempts we will do before we stop the data node completely. As soon as the restart is successful the fail counter will be set to 0. By default this is set to 3.
It is possible to invoke a delay before the node restart begins after a crash. This delay is set in StartFailRetryDelay, the unit is seconds. By default there is no delay.
Whenever a node starts up there is a set of timers that affect for how long we will wait for more nodes to join before we proceed with the restart.
The StartPartialTimeout affects how long time we will wait for more nodes to join, in the case when the cluster is starting up. If we discover during the heartbeat registration that the cluster is up and running we will start immediately. During the wait we will print in the node log which nodes we are waiting and which that have connected and for how long we waited.
In the case of an initial start of the cluster we will always wait for all nodes to come up except when the user have provided a set of nodes to not wait for on the command line using the --no-wait-nodes option.
This parameter defaults to 30000, this means 30 seconds since the unit is millseconds. Setting it to 0 means waiting forever until all nodes are starting up.
This parameter provides a possibility to start a cluster even with a suspected network partititioning problem. Setting this parameter to 0 means that we will never start in a partitioned state. By default this is set to 60000, thus 60 seconds.
This parameter should always be set to 0, and this is the default value. It is strongly recommended to not start up if it is suspected that the cluster is in a network partitioned state.
This parameter is by default set to 0, this means that restart will proceed until they are done.
If we know by experience that a restart should be completed in a certain time period, then we can set this parameter to a high value to ensure that we get a crash instead of an eternally hanging state of the node.
One should be very careful in setting this parameter (unit is milliseconds) given that restarts of large databases can take a substantial amount of time at times.
This parameter sets the time we will wait for nodes with no node group assigned while starting. A node has no node group if it is assigned NodeGroup equal to 65536. Nodes without node group contains no data, so they are not relevant for cluster operation, they are only relevant to extend the cluster size.
This parameter is set to 15000 (15 seconds) by default.
Heartbeat are mainly interesting to handle when a computer dies completely. When the data node process fails the OS will ensure that its TCP/IP sockets are closed and thus the neighbouring nodes will discover the failure. Heartbeats are also good to handle cases where a node is not making any progress.
For example in Mac OS X I have noticed that accessing a zero pointer leads to a hanging process. The heartbeat will discover this type of failures.
This parameter specifies the heartbeat between data nodes. It was originally set to 1500 milliseconds. Thus the node will be considered as failed in at most 6 seconds, after 3 missed heartbeats. Every missed heartbeat is logged into the node log, if the node is often close to the limit, the node log and cluster log will show this.
To ensure that RonDB works better for early users we changed the default to 5000 milliseconds, thus it can take up to 20 seconds before a node failure is discovered.
In a high availability setup with computers that are dedicated to RonDB and its applications, it is possible to decrease this configuration parameter. How low it can be set depends on how much the OS can be trusted to provide continuos operation. Some OSs can leave a thread unscheduled for a second or two even in cases when there is no real resource shortage.
In a production environment it is recommended to use the old default 1500 milliseconds or even lower than this.
It is important to understand that when changing this value, one should never increase it by more than 100% and one should never decrease to less than half. Otherwise the sending of heartbeats will cause node failures with other neighbouring nodes that haven't updated it heartbeat interval yet. The heartbeat interval in different data nodes should only differ during a rolling restart to change the configuration.
This parameter sets the heartbeat interval between an API node and a data node. The heartbeats are sent from the API node once every 100 milliseconds and it expects a return within three heartbeat periods. This parameter has a default of 1500 milliseconds.
The consequence of losing the connection to the data node is smaller than a data node losing its connection. If the API node is declared down, it will immediately reconnect again normally.
The heartbeat protocol assigns a dynamic id to each node when it starts. This id provides the order of starting nodes. Thus the node with the lowest id is the oldest node. The node with the lowest dynamic id is choosen as the master after a failure of the master node.
Heartbeats are sent to its right neighbour and received from its left neighbour. The definition of who is the right and left neighbour is defined by the order the nodes are starting up, thus through the dynamic id.
It is possible to specifically set the order of the nodes in the heartbeat protocol instead of using the default.
The MySQL manual contains some description of when it might be useful, but I am not convinced that it is a very useful feature.
This is a boolean variable that adds two more heartbeat intervals before a node is declared dead. After three failed heartbeat intervals the node will be in a suspected state and in this state the node will be checked from more than one node during the last two heartbeat intervals.
This parameter makes it possible to improve the heartbeat handling in environments where the communication latency can vary.
This is a rather new feature that is an improvement compared to only using heartbeats between neighbours. If used one should adjust the heartbeat settings since with this extra check we will not fail until after missing 5 heartbeats.
Configuring arbitration in data nodes#
There are several ways to setup arbitration in the cluster. The default behaviour is that the arbitrator is asked for its opinion when we cannot be certain that no cluster split has occurred.
To ensure that it is possible to use clusterware to enable running the cluster even with only two computers, we have implemented a special mode where external clusterware makes the decision on arbitration.
This parameter sets the timeout waiting for a response from the arbitrator. If no response comes within this time period the node will crash. The default for this period is 7500 milliseconds.
By default this string variable is set to Default. It can also be set to Disabled which means that no arbitration is performed and thus the cluster crashes if there is a risk of network partitioning.
There is a mode WaitExternal that is explained in the chapter on handling node crashes in NDB. This ensures that a message is written into the cluster log ensuring that some external clusterware can decide which part of the cluster to kill.
Configuring Deadlock Detection#
Deadlock can occur in a database using row locks to implement concurrency control. Our method of resolving deadlocks is based on timeouts. When an operation doesn't return from a read or a write we assume that it is in a lock queue and since we don't come back we assume that we're involved in a deadlock.
RonDB is designed for applications using short transactions that complete in a number of milliseconds, when an operation doesn't return within seconds we assume that something is wrong. An operation would normally return in less than a millisecond and even in a highly loaded cluster would it take more than a few milliseconds to return for an operation.
This deadlock detection timeout is also used to ensure that transactions that rely on a node that have died will be aborted. The timeout here should not be materially larger than the time that it takes to detect a node failure using heartbeats. Normally it should be much smaller than this.
Short deadlock detection timeout means that we recover quickly from deadlocks. At the same time if it is too short we might get too many false indications of deadlocks in a highly concurrent application.
This parameter sets the deadlock detection timeout. The default deadlock detection timeout is 1200 milliseconds.
Deadlock detection timers are not constantly checked. A check of all running transactions is performed with some delay in between. This delay can be increased with this parameter, it defaults to once per second and should normally not need to be changed.
This sets the time that a transaction is allowed to be inactive from the API. In this case we're not waiting for any nodes, rather we are waiting for the API to decide to proceed. By default this wait is almost forever.
Given that the transaction can have locks while waiting for API, it is a good idea to set this to a much lower value of say 1500 millisecond. This ensures that misbehaving APIs will not mess up the database.
Logging in NDB for data nodes happens in a number of places. We have the cluster log that is generated in data nodes and sent to the management server to place it into the cluster log. It is possible to also direct these messages to the node log.
We have the node log, here comes various log messages that are only written locally, normal printouts end up here as well.
The cluster log levels are controlled by a number of configuration parameters. Each such parameter controls a specific part of the logging. Each message is active at a certain log level and not on lower log levels.
The log levels that are definable are the following:
LogLevelStartup, LogLevelShutdown, LogLevelStatistic, LogLevelCheckpoint, LogLevelNodeRestart, LogLevelConnection, LogLevelCongestion, LogLevelError, LogLevelInfo
These can be changed temporarily from the NDB management client.
Whenever a data node crashes we generate a set of trace files to describe the crash. The MaxNoOfSavedMessages controls the number of crashes that will be saved. By default this is set to 25. When the 26th crash occurs it will overwrite the information in the error log and the trace files generated by the first crash.
This is the buffer used to store log events in the data node until they are sent to the cluster log in a management server. It is by default 8 kBytes.
The Diskless parameter is a useful parameter for testing when the servers don't have sufficient disk bandwidth to handle the load. Obviously cannot be combined with disk columns, can only be used when RonDB is entirely memory bound.
It can be useful when the application have no need of any recovery. For example a stock market application will have no use of database recovery since the stock market data have completely changed since the cluster failed. In this case one could use Diskless set to 1 to ensure that all writes to the file system are thrown away. It is similar to mounting the file system on /dev/null.
A better method of doing this is to create the tables with the NOLOGGING feature. This ensures that the tables are still around after a restart although the tables are empty.
RonDB data nodes is designed with a virtual machine that handles execution of signals and ensuring that we can communicate those signals within a thread, between threads and to other nodes.
The virtual machine is designed such that when a signal executes, it is up to the signal execution code to return control to the virtual machine layer. Thus as an example a run away signal that enters an eternal loop cannot be detected by the virtual machine.
Signals are supposed to execute only for a few microseconds at a time. This is by design, and so of course a bug can cause a signal to execute for much longer.
Now if a thread does execute for too long it is likely to be detected by other nodes in various ways through timers. But a deadlock in one thread could be undetected if the other threads keeps the other nodes happy.
To detect such run away threads we use a watchdog mechanism. Each thread registers with a watchdog thread. This watchdog thread wakes up every now and then and checks if the threads have moved since last time. If no movement has happened it will continue waiting. When the watchdog timer expires the node will crash.
This is the time between watchdog checks. When four such intervals have expired and no progress have been reported we will declare the node as dead. The default setting is 6000 milliseconds and thus we will detect watchdog failures after at most 24 seconds.
During memory allocation it is more common to get stuck for a long time. Thus if 24 seconds isn't enough time to allocate memory we can increase this time here. When working with very large memories this can easily happen. In a large data node it would be fairly normal that one would have to increase this variable.
Configuring index statistics#
Index statistics are maintained by the NDB data nodes, but are used by the MySQL server. The configuration of index statistics requires both a part in the data nodes and in the MySQL Server.
By default the data node will not generate any index statistics at all. It is still possible to generate index statistics by using the SQL command ANALYZE TABLE.
The actual index statistics is stored in a system RonDB table. The index statistics is only based on one fragment in the table. The reason is two-fold, the first is that the only ordering is within one fragment. There is no ordering between two fragments. The second reason is that it simply is faster to work with one fragment instead of all. Given that the hash function is randomising the placement of rows, there should be very small differences between different fragments and thus it should be sufficient to analyse one fragment.
When using RonDB as an SQL engine, it is recommended to activate those options. Running complex queries without analysing the index statistics can cause a query to execute thousands of times slower compared to when you have activated index statistics.
If this variable is set to on it will generate index statistics at create time of the index. Defaults to off.
If this variable is set the data nodes will regularly update the index statistics automatically. Defaults to off.
This parameter defines the minimum time between two updates of the index statistics. It is measured in seconds and defaults to 60 seconds.
IndexStatSaveSize and IndexStatSaveScale#
These two parameters define how much memory will be used in the index statistics table to store the samples from the index. The calculation is performed using a logarithmic scale.
By default IndexStatSaveSize is set to 32768 and IndexStatSaveScale is set to 100.
We calculate first the sample size as the sum of average key size and number of key columns plus one times 4. Assume that we have three columns that have average 16 bytes in size. In this case the sample size is 32 bytes = (16 + (1 + 3) * 4).
We multiply the sample size by the number of entries in the ordered index. Assume we have 1 million entries in the index. In this case the the total unfiltered index statistics would consume 32 MByte.
Instead we calculate size taking the two logarithm of 32 MByte. This is equal to 25. Now we multiply 25 by IndexStatSaveScale times 0.01 (thus treating IndexStatSaveScale as a percentage).
Next we multiply this number plus one with IndexStatSaveSize. In this case this means that we get 26 * 32768 = 851968 bytes. Next we divide this by sample size. This is 26624. This is the number of samples that we will try to get.
These numbers now apply to all indexes in the database. Using the logarithmic scale we ensure that very large indexes do not use too large memory sizes. The memory size must be small enough for us to store it and not only that, it has to be small enough to process the information as part of one query execution.
As can be seen here even with 1 billion entries in the table the two logarithm of 32 GByte only goes to 35. The index statistic size grows by about 38% with a 1000-fold increase in storage volume for the table. We can increase the scale by changing IndexStatSaveScale and we can move the start state by changing IndexStatSaveSize.
IndexStatTriggerPct and IndexStatTriggerScale#
Similarly we have two configuration parameters that use logarithmic scale to decide when to start the next index statistics update.
We get the number of entries in index first. We take the two logarithm of this number. In the case with 1 million entries in table, this number will be 20 in this case. This number is now multiplied by IndexStatTriggerScale times 0.01. By default this is 100, so scale down factor becomes 1+ (scale * 0.01 * 2log(number of index entries)). We keep track of number of index changes since last time we calculated the statistics. If this percentage of rows is higher than the percentage calculated. For example the defaults for a one million row table is that the calculated scale down factor is 21. Thus we start a new index statistics update when 100 / 21 = 4.76% of the index entries have changed.
By setting IndexStatTriggerScale to 0 we change to a completely linear scale. In this case IndexStatTriggerPct is used directly to decide when to activate a new index statistics update.
If IndexStatTriggerPct is set to 0 it means that there will be no index statistics updates issued. In this case the only way to get an index statistics update is to call the SQL command ANALYZE TABLE on the table.
Both IndexStatTriggerScale and IndexStatTriggerPct is by default set to 100.
Specialized configuration options#
When a signal is sent with sections it uses a long message buffer to store the sections in. By default this buffer is calculated as part of automated memory configuration and is increased with more threads in the data node.
Buffers are allocated from this pool when messages with sections arrive through the receive thread. These buffers are kept until the destination thread have consumed the signal.
Similarly when a new signal is created with one or more sections it will allocate buffer area from this pool. This buffer will be kept either until the signal destination within the same node have consumed the signal or if it is sent to another node it will happen when the signal is copied over to a send buffer.
Thus the long message buffer stores signals that have been received, but not yet consumed and signal created, but not yet sent or consumed.
Most of those signals will be waiting in a job buffer in one of the threads. There can be many thousands of signals in one of those job buffers and we can have up to a bit more than 50 such threads. We can theoretically have up to 40.000 signals in the queues. Not all these signals have sections, but there can definitely be tens of thousands of such signals executing in parallel. The size of the sections depends mostly on how much data we read and write from the database.
The long message buffer size has a limitation and the default size should be enough to handle all but the most demanding situations. In those extremely demanding situations it should be sufficient to increase to a few hundred megabytes.
In smaller setups this buffer can even be decreased to at least half without running any real risk of running out of the buffer space.
If we run out of this buffer the node will crash.
This configuration option was invented to handle a very large configuration with 30 data nodes and hundreds of API nodes running a heavy benchmark. It is now deprecated and is automated in RonDB.
The logic behind this variable can be understood by considering traffic in a large city. If one highway in the city is very efficient and lots of cars can be delivered and those cars all come into a bottleneck, then a major queue happens and the queue will eventually reach the highway as well.
Now in a traffic situation this can be avoided by slowing down traffic on the highway. For example in some cities we have traffic lights when entering a highway, this limits the amount of cars that can pass. This limitation avoids the queues in the bottleneck since cars are flowing into this bottleneck at a sustainable rate.
Limiting the traffic in an area without any bottlenecks can actually increase the traffic throughput in a city.
Similarly in a large cluster it is important for nodes to not send too often. We want nodes to not send immediately when they have something to send. Rather we want the nodes to send when they have buffered data for a while.
The analogy is here that nodes that have no bottlenecks, slow down and wait with sending data. This ensures that bottleneck nodes do not slow down the entire cluster.
The above is the explanation of how this feature solves the problem in the large cluster. We saw 3x higher throughput on a cluster level and consistent throughput with this parameter set to 500 microseconds.
Without the parameter performance in the cluster went up and down in an unpredictable fashion.
What the feature does is that when a node is ready to send to another node, it checks if it has sent to this node the last microseconds. If the number of microseconds since last send exceeds the configured value we will send, if not we will wait with sending. If we have nothing more to do in the thread and wants to go to sleep, we will send anyways.
The default memory allocation behaviour is defined by the OS. Most OSs use a memory allocation that prefers memory from the same CPU socket. Now when RonDB is used on a multi-socket server and where a data node spans more than CPU socket, it makes sense to spread memory allocation over all CPU sockets.
A data node often allocates a majority of the memory in a machine. In this case it makes sense to interleave memory allocation on all CPU sockets. By setting Numa to 1 on Linux where libnuma is available, this policy is ensured.
One consequence of setting Numa to 0 means that we might overload the memory bus on the CPU socket where the allocations were made. In a multi-socket server it makes sense to set Numa to 1. However if we have one data node per CPU socket it might work better to set Numa to 0. Numa is set by default to 1.
When we allocate memory we touch each memory page to ensure that the memory is truly allocated to our process and not that we only allocated swap space. This parameter delays touching parts of the memory until we have received the second start order from the management server.
To delay this touch of memory is default.
When using Dolphin SuperSockets to communicate internally in RonDB it is necessary to ensure that sockets use only IPv4 addresses. Dolphin SuperSockets doesn't support sockets that can use both IPv4 and IPv6. This could also be useful if nodes use an OS that can only handle IPv4. The option is available for data nodes and MySQL Servers, not for management server nodes. Thus mainly intended for the Dolphin SuperSockets use case.
We protect the fixed part of tuples with a checksum, if this flag is true we will crash when encountering a checksum that is wrong. If the flag isn't set we will only report an error to the user trying to request this row.
Various reports can be produced with some frequency in the cluster log. By default none of these reports are produced. The unit is seconds.
We can get a regular report on the usage of DataMemory. If not we will still get a report when we reach a number of thresholds like 80%, 90% and so forth.
During an initial restart we spend a long time initialising the REDO log. This report specifies how many REDO logs we have initialised and how much of the current file we have initialised.
This parameter ensures that we get backup status printed in the cluster log at regular intervals.
Here is a new [ndbd default] section based on the recommendations in this chapter.
Advanced Configuration of API and MGM nodes#
Most of the configuration of RonDB is in the data nodes and in the MySQL Servers. There is a few things that is configured also through the API section in the cluster configuration. In the cluster configuration there is no difference between API and MYSQLD sections.
For MySQL Servers it is recommended to set the node id and hostname of each node used.
Configuring send buffers#
These applies to both API and management server nodes. They will be covered in the next chapter on communication configurations.
This applies to both API nodes and management server nodes.
ArbitrationRank provides the priority for a a certain node to become arbitrator. 0 means that it will never become arbitrator. 1 means that it can become arbitrator. 2 means that it will be selected as arbitrator before all nodes with 1 set in arbitration rank.
The aim is to ensure that the arbitrator is placed on machines that can have an independent vote on which side of the cluster that should win. For example with 2 replicas and 3 computers we want to place the arbitrator on the third machine containing no data node.
We achieve this by setting ArbitrationRank to 2 on the management server on this third machine (or possibly an API node that is always up and running.
Delays the arbitration response by some amount of milliseconds. By default set to 0, no reason to change this that I can come up with.
Configuring scan batching#
These applies only to API and MYSQLD nodes.
When performing a scan the data nodes will scan many partitions in parallel unless the scan is a pruned scan that only scans one partition. The NDB API must be prepared for all partitions to send back data.
We control the sizes each data node can send back by setting the batch size that the NDB API can handle.
An NDB API user can override the configuration setting of batch size below by using the SO_BATCH option. The NDB API can also affect the execution of the scan by setting SO_PARALLEL to define the parallelism of a table scan (can not be used to influence of parallelism of sorted range scans, these will always use full scans).
The maximum scan batch size is measured in bytes and defaults to 256 kBytes. This sets a limit to how much data will be transported from the data nodes in each batch. This is not an absolute limit, each partitioned scanned can send up to one extra row beyond this limit.
This setting is used to protect the API node from both send buffer overflow as well as using up too much memory. At the same time it is desirable to have the size not too small since this will have a negative impact on performance.
The send buffer sizes should take into account that multiple scans can happen in parallel from different threads, this limit only applies to one scan in one thread.
The BatchByteSize limits the amount of data sent from one partition scan. If the size of the data from one partition scan exceeds this limit, it will return to the NDB API with the rows scanned so far and wait until the NDB API have processed those rows before it proceeds with the next batch.
It defaults to 16 kBytes. The actual limit on the maximum scan batch byte size is the parallelism multiplied by this number. If this product is higher than MaxScanBatchSize we will decrease the batch byte size to ensure that the product is within the bounds set by MaxScanBatchSize.
We will never send more than BatchSize rows from a partition scan even if the rows are very small. This is an absolute limit, the NDB API must however be prepared to receive this amount of rows from each partition scan.
These applies only to API nodes.
There is a few configuration parameters controlling the connect phase when an API node is connecting to a data node.
These parameters controls the behaviour after the connection was lost and the timing of connect attempts after a number of failed attempts.
The NDB API was designed to handle failures transparently. When the connection to a node went down for a while, it will be immediately reused as soon as the connection is back up again. It is possible to disable this by setting AutoReconnect to 0.
When an API node have already connected to at least one data node in the cluster, this parameter specifies the maximum delay between new attempts to connect to another data node.
This parameter was introduced to avoid that hundreds of API nodes constantly attempt to connect to a few data nodes. The default is to make a new attempt after 100 millisecond. For small clusters this is not an issue, but in a large cluster this can cause serious overhead on the data nodes when too many nodes try to reconnect at the same time.
It defaults to 1500 milliseconds. The time between new attempts will increase with every failed attempt until it reaches this value. If set to 0 it will not back off at all, it will continue sending with a delay of 100 milliseconds.
Regardless of how big this is set to, we will never wait for more than 100 seconds and we will use an exponential backoff algorithm that will send first after 100 milliseconds, next after 200 milliseconds, next after 400 milliseconds and so forth until either we have reached 1024 intervals or we have reached the above set maximum delay between two attempts.
When an API node isn't connected to the cluster at all, we can also use a back off timer. This defaults to not being used, it defaults to 0.
After connecting to the first data node, this configuration will be replaced by the ConnectBackoffMaxTime.
By default the cluster manager thread and the arbitrator thread is perfectly normal threads. It is possible to configure them to run at a higher priority on Linux kernels. Only applies to API nodes.
This is done by setting the HeartbeatThreadPriority equal to a string with either fifo or rr where fifo means the SCHED_FIFO and rr stands for SCHED_RR and 50 is the priority level that will be set. If 50 is not considered one can also configure as fifo, 60 to override the default priority level 50 by 60.
This parameter will be explained in the coming chapter on configuration of the REDO log. It controls what to do with operations that finds the REDO log buffer full. Only applies to API nodes.
Setting this to 2 or higher means that we get printouts when we perform meta data operations. Only to be used in debugging the NDB API. Only applies to API nodes.
See section on data nodes for more details on this configuration parameter.
Management server nodes#
There is a few configuration options that apply only to management servers.
Management servers have a special DataDir where the node log and cluster logs are placed. In addition we store the binary configuration files in this directory. It is recommended to always set this variable in the cluster configuration.
API nodes have no special node logs, the output is channeled to stdout. The MySQL server will redirect stderr to the error log and will skip output to stdout.
There are three values for LogDestination.
CONSOLE means that the cluster log is redirected to stdout from the management server. I've never used this and see very little need of ever using it.
FILE means that the cluster log is piped to a set of files stored in DataDir. This is the default behaviour and personally I never changed this default. The default name of the cluster log files are e.g. ndb_65_cluster.log where 65 is the node id of the management server.
The file is swapped to a file named ndb_65_cluster.log.1 when the file reached its maximum size, the maximum size is 1000000 bytes. At once a new empty file is created when the first one becomes full. A maximum of 6 files are kept by default.
To set any other file name or maximum size or maximum number of files the syntax used is the following
There is also an option to redirect the cluster log to the syslog in the OS.
To do this use the following syntax:
There are numerous options for where to redirect the syslog, see the MySQL manual for details on possible options.
It is even possible to use multiple options, it is possible to redirect output to CONSOLE, a FILE and the SYSLOG at the same time.
The management server uses the HeartbeatIntervalMgmdMgmd for heartbeats between two management servers. This heartbeat is used when communication between the management servers to keep track of when management servers are coming up and going down.
By default it is set to 1500 millseconds.
Here is a part of cluster configuration file focusing on how to setup arbitration in a correct manner. First set it up with ArbitrationRank set to 2 in one of the management servers and after that one more management server and MySQL server that set it to 1. The rest can set it to 0.
Advanced Communication configuration#
Configuring send buffers#
The most important configuration parameter for communication is the send buffer sizes. This is configured partly through communication parameter and partly from node parameters.
TotalSendBufferMemory and ExtraSendBufferMemory#
In the past in NDB the memory per communication channel was allocated independent of others. Now the memory is allocated globally and all transporters share this memory. The memory size of this global buffer can be calculated in two different ways. The first is to add up all the SendBufferMemory settings.
In RonDB the size of send buffers are handled as part of automatic memory configuration. We will allocate 8 MByte per send buffer and an additional 2 MByte per thread in the data node.
The default option allocates memory to ensure that we can handle the worst case when all transporters are overloaded at the same time. This is rarely the case. With nodes that have many transporters it makes sense to allocate less memory to the global pool.
ExtraSendBufferMemory is 0 and can stay that way. It is always the sum of TotalSendBufferMemory and ExtraSendBufferMemory that is used, so there is no need of using two parameters here.
The opposite could happen in rare cases if the user runs very large transactions. In this case the commit phase might generate signals that increase the send buffer beyond the SendBufferMemory limit although we abort both scans and key operations.
In this case we might decide to set TotalSendBufferMemory.
In the API nodes the global memory allocated is all there is. In the data nodes we can overcommit up to 25% more memory than the global memory allocated. This memory will be taken from SharedGlobalMemory if available.
The SendBufferMemory is normally set in TCP default section in the cluster configuration file. It defaults to 8 MByte. It can be set per communication channel where both node ids must be set and the send buffer memory size is provided.
As described in this chapter, SendBufferMemory influences the global memory allocated unless TotalSendBufferMemory is set and influences the OverloadLimit unless this is set.
If both those parameters are set, the SendBufferMemory is ignored.
Group sets the priority of one connection. This is used when selecting the transaction coordinator from the API nodes. It is handled by other configuration parameters, it is not expected that the user changes this parameter.
This is mostly a debug option. By using this option each signal sent over the network is tagged with a signal id. This adds 4 bytes to each signal header. It makes the trace files a bit more descriptive since it becomes possible to connect on a distributed scale which signals that initiated signals executed in another node. Internally in a node all signals are tagged with a signal id, but by default we don't use this option for distributed signals.
This feature can be useful to ensure that no software or hardware bugs corrupt the data in transit from one node to another. It adds a 4-byte checksum on each signal sent. This is not used by default. If a faulty checksum is found the connection between the nodes is broken and the signal is dropped with some descriptive error messages.
We have a few ways to decrease the pressure on a communication channel. The first level is to declare the channel as overloaded. When this happens any scans using this channel will automatically set the batch size to 1. Thus each scan will use the minimum amount of send buffer possible and still handle the scans in a correct manner.
This happens when the send buffer has used 60% of the OverloadLimit setting.
The default for the OverloadLimit is 80% of the SendBufferMemory setting of the communication channel (called transporter in RonDB software).
When the overload limit is reached we start aborting key operations in the prepare phase. Operations in the commit phase will continue to execute as before.
This parameter makes it possible to set the OverloadLimit independent of the setting of SendBufferMemory.
Each transporter has a receive buffer. This buffer is used to receive into from the OS. Thus we can never run out of this buffer, we will ensure that we never receive more from the OS than there is buffer to receive into.
Before we receive more, we need to handle the signals we received.
The size of this buffer is set to ensure that we can receive data from the socket rapidly enough to meet the senders speed of writing. It is set to 2 MByte by default.
Configuring OS properties#
The main reason why the possibility to change the send and receive buffer options and setting TCP_MAXSEG for the connections was to handle API nodes that was accessing NDB through a WAN (Wide Area Network) connection.
The default settings of TCP_SND_BUF_SIZE and TCP_RCV_BUF_SIZE are set to fit LAN settings where nodes are in the same building or at most a few hundred meters apart.
The number of outstanding IP messages that a socket can handle is controlled by these parameters together with TCP_MAXSEG.
The reason is that the sender must know that there is memory space in the receiver to avoid useless sending.
To handle API nodes that are longer apart these three parameters were added to the list of configurable options on each connection.
TCP_SND_BUF_SIZE and TCP_RCV_BUF_SIZE defaults to 1 MByte. The normal OS send buffer size and OS receive buffer size is around 100 to 200 kBytes. TCP_MAXSEG is set to 0 which means that the OS default is used.
For long-haul connections one might need up to a few MBytes to ensure the parallelism is kept up.
By setting the configuration parameter WAN to 1 on the communication channel the TCP_SND_BUF_SIZE and TCP_RCV_BUF_SIZE is set to 4 MBytes instead and TCP_MAXSEG_SIZE is set to 61440 bytes.
Sets the send buffer size allocated by the OS for this socket.
Sets the receive buffer size allocated by the OS for this socket.
Sets the TCP_MAXSEG parameter on the socket.
By default the server part of the connection (the data node or the data node with the lowest node id) will bind the server part of the socket to the hostname provided in the config. By setting the TcpBind_INADDR_ANY parameter to 1, we skip binding the socket to a specific host. Thus we can connect from any network interface. This can be useful when setting up RonDB in Kubernetes.