Online changes of Configuration#
NDB was designed for telecom applications to start with. A telco application is extremely well tested before it is put into production. In those situations it was expected that the configuration had already been tried out before starting it up in a production environment. In a test environment one can stop the cluster and change the configuration and run a new test.
In a telco production environment configuration changes will normally happen in conjunction with software changes and other changes. So a rolling restart of the cluster to change the configuration is ok.
The need of changing configuration while up and running haven't been a focus earlier. Rather we have used the fact that any node can be restarted without impact on the cluster operation. Thus we can reconfigure through various node restarts.
Now we have made major steps forward to make the product easier to configure. We have also removed much of the need of configuration by writing algorithms that automatically tune various configuration parameters. This is still work in progress, but every new version will have more and more focus on ease of changing while the cluster is up and running.
We have also added a couple of commands that can change some of the configuration parameters in a running cluster.
The cluster configuration contains the full configuration of all data nodes in the cluster. It describes the configuration of how to communicate between nodes in the cluster. It defines the set of nodes of various types that we have in the cluster. A few parameters is defined for the NDB API part in the API nodes. The API nodes often have their own set of configuration parameters that is application specific. We will describe the MySQL Server configurations that are specific to RonDB.
Each node that starts in the cluster retrieves the current cluster configuration from one of the management servers. The startup of the node is using this configuration. To change anything in the configuration normally a node restart of the node is required. Given that the cluster can operate even in the presence of node restarts, this isn't critical for the availability of the cluster. It means that it will take more time to reconfigure the cluster though.
Lately we have started to add commands to the management client that makes it possible to change the configuration in a node dynamically. This only applies to the data nodes.
The cluster configuration is stored in a configuration database that is maintained by the management server. The method to update this configuration database is by using the procedure described here and in a previous chapter on handling of the ndb_mgmd process. There is also two commands that can be sent towards the RonDB management server to set a new configuration or to reload a new configuration file. We have made a few of the configuration variables changeable through this interface using the management client.
Any changes to the configuration through the management client can be applied to the configuration database or it can be applied only to the current settings in the data node, in this case it will be set back to the values in the configuration database at the next restart.
Adding nodes to the cluster#
Every node sets up communication to the other nodes as part of their startup. It is currently not possible to add nodes to communicate with for an already started node. One reason we are a bit careful about this is that adding a new communication link requires new memory buffers and we try to avoid memory allocations in a live node since it can easily disturb the operations in the running node.
To add a node (or several ones), it must first be added to the configuration through a distributed configuration change. Next all nodes must be restarted such that they can communicate with the new node.
Managing the configuration database#
The configuration database is stored in the set of management servers. The easiest manner to explain how this database is managed is to describe the workings of the management server at startup.
The workings of the management servers is not so well described in the MySQL documentation, so I provide a fairly detailed description of both how it works and how it is intended to be used. Problems in changing the configuration is probably the most common question on the various forums for NDB.
When a management server starts up it will start by retrieving the command line parameters. The next step is to read the configuration.
If the startup option --skip-config-cache have been provided we will always read the user provided config.ini (or whatever name of the configuration file that was specified in the startup parameters). This mode disables the configuration database, the management servers will no longer work together to check the configuration, the configuration will not be checked against any previous configuration.
Personally I would never use this mode for any production cluster, I would only use it for setups with a single management server. I use it when running my benchmark scripts. The reason is that I always start my benchmarks from a completely new cluster, so I skip all the complexity of managing configuration versions. For the rest of the description here we will assume this setting has not been used.
Next step is to check that the configuration directory is created. This was specified in the startup parameter --configdir. The default setting is dependent on the build of RonDB. For public builds the default placement is /var/lib/mysql-cluster. It is recommended to always set this parameter when starting the RonDB management server.
Get node id#
Next we will get our node id. First we will check the startup parameters, this is the preferred method of setting the node id.
Next we will check the configuration binary files. These files have a node id in their name, as long as there is only one node id in all the files stored in this directory we can proceed after verifying that another management server isn't already started using this particular hostname.
If no binary files exist we will check in the user provided configuration file. It will look for a management server executing on the host we are starting from.
If we have multiple management servers running on the same host it is mandatory to set the node id when starting the management servers.
Initial Configuration Change#
Next we will check if the --initial option was provided. In this case we will verify the new configuration and if ok we will delete all saved configuration binary files, if not ok the management server won't start.
The intention of the --initial flag is to use it when installing a new cluster in the same place as a previous one. Thus the old configuration is removed and a new one installed without checking that the change from one configuration to another is ok. This flag only matters if --config-file was set. This option is not intended for any form of upgrade procedure. It is intended for use when creating a new cluster where we already had a previous one.
As an example the old cluster could be based on a configuration with NoOfReplicas set to 1. Now we have completed the testing with just one replica and now we want to extend the cluster to using two replicas. In this case we need to take a backup of the cluster and restart the cluster from initial state. In this case one can use the --initial flag to overwrite the old configuration without any checks of upgradability at all.
Next we attempt to read the latest binary configuration file. If no one exists we will read the user provided config.ini and create the first version of the binary configuration file. In this case we must perform a distributed initial configuration change transaction. This will not be performed until all management servers have started up. It is possible to specifically exclude one or more management servers using the --no-wait-nodes parameter.
The initial configuration change is a bit different to other configuration changes in that we apply some rules for how to set default parameters.
As mentioned this default logic will only be applied to an initial configuration change. It will not be applied to any later configuration change. In later configuration changes we will only change the variables defined by the user. If the user wants to skip the default logic one can start the management server with the option --skip-default-logic.
If neither a binary configuration file and we didn't specify a configuration file we assume that another management server has started up with a configuration. We fetch it from there, if this is not the initial version we will simply write it down to our configdir and the management server is started.
If it was fetched from a remote management server and it was the initial version we will wait to complete the startup until initial distributed configuration change transaction have completed.
For an initial configuration change to be executed all management servers must be up and running.
If one node attempts to perform an initial configuration change and another node is not attempting to perform an initial configuration change the node attempting to start an initial configuration change will stop.
To start a new configuration when an existing was around it is necessary to start up with --initial in all management servers AND all management servers must provide the same new initial configuration. Otherwise the change to a new initial configuration will fail.
When starting up the first time all management servers must be in an initial state, but only one of them is required to specify a configuration file.
When all nodes have started and it is verified that they can start an initial configuration we proceed with a change configuration request.
Normal management server startup#
If we find a binary configuration file and we didn't specify --reload the startup is done. No configuration change was requested. The management server is immediately ready to accept new requests for the configuration from starting nodes.
Now if --reload was specified we must also specify --config-file=/path/to/config, if not the --reload parameter will be ignored.
Change Configuration Request#
When a management server have been started with --reload and a new configuration file was provided through the --config-file option we will attempt to change the configuration. We will verify that the configuration actually has changed. We will write the diff of the new configuration and the old configuration to the node log of the management server. Next we will start a change configuration request.
There is two other methods available to start a change configuration request. Both are available through NDB MGM API. The first one is called set config and sends an entire new configuration to change into. The second is called reload config and sends the file name of the new configuration. The file must be present in the host of the management server this request is sent to.
A few test programs use this feature to test things related to configuration changes.
For a DBA to change the configuration the below procedure can be used to change any configuration parameter. We have also implemented a set of commands in the management client to change the configuration using commands. We will cover those in the next section.
To use the method based on restarting the management server using --reload we must first write the new configuration file. The first step here is always to print a config.ini that generates a config.ini that would create the same configuration database as is currently installed. This is performed by the following command where one want point to a running management server.
ndb_config --ndb-connectstring host[:port]
--print_config_ini > config.ini
After that one edits this file to change the parameters as desired and starts the procedure by restarting a management server with --reload and providing the --config-file parameter pointing to the new config.ini file.
Now back to what the management server does when performing the startup with those parameters.
A transaction is used to change the configuration. First a diff is built from the old to the new configuration. It is not allowed to change the type of a node. E.g. we cannot have node id 2 as a data node and then change it to become an API node. This is an illegal change.
The new generation of the configuration change must be one more than the previous. This protects against multiple changers of the configuration trying to change it at the same time. This can happen if someone performed a change of the configuration while the management server was down.
In addition the cluster name cannot change and neither are we allowed to change the primary management server. The primary management server is the last management server that performed a change of the configuration by using --reload, it is cleared if a change is performed through the NDB MGM API. Thus when using this procedure one should always use the same management server to change the config.
When changing the configuration it is important to consider a few configuration parameters that have special rules for their change. We will go through those parameters in the next section.
After these checks the management servers together execute a configuration change and if all goes well they decide jointly on a new configuration database version and each management server writes this new version into a binary configuration file using the new version number as part of the file name.
The end result of the startup is either of the following:
The management server started and is ready to receive requests
The management server has become part of an initial configuration change
The management server has started a configuration change transaction
Special configuration parameters#
The types of special rules for changes are the following. Some configuration parameters require a restart of the node to take effect. This is currently the case for every configuration parameter.
A set of configuration parameters requires a initial node restart to be changed.
There is a set of configuration parameters that require a full cluster restart to be changed correctly and there is a few configuration parameters that cannot be changed other than through a completely new cluster installation.
NoOfReplicas is a parameter that cannot change once the cluster have been started up. The placement of the data nodes into node groups depends on this parameter and it is required that all node groups have the same number of nodes the number of replicas.
NoOfFragmentLogParts is possible to change, but it requires an initial node restart to change it. The number of partitions in a table is defined by the number of log parts in the master node. Normally all nodes in the cluster share the number of log parts.
The change of number of log parts can be done and will affect the number of log parts. However it won't change the number of partitions in already changed tables. NDB supports increasing the number of partitions in a table, but it doesn't support decreasing the number of partitions in a table. In addition the fragments are mapped to a log part, this log part will not change even if the number of log parts change. Thus we cannot decrease the number of log parts since that would mean that some fragments have no place to go.
In practice changing the number of log parts can be done if it is changed to a higher number of log parts. But the change will only effect the new tables created after changing the number of log parts in all nodes through a rolling initial node restart.
A data node will fail to start if the number of log parts have been changed in an incorrect manner.
Number of fragment log files can be changed, but it requires an initial node restart to be succesful since we would forget about the old REDO log files and in addition we might fail the recovery if we change this without a proper initial node restart. The data node will stop the restart if it discovers that a change has been performed of the number of fragment log files without performing an initial node restart.
Same as for NoOfFragmentLogFiles it can be done, but requires an initial node restart.
This parameter only affects if files are to be written entirely as part of an initial node restart or not. Thus it can be changed at any time, but will only change the behaviour of initial node restarts and initial cluster starts.
The following parameters also only affect behaviour of initial node restarts and can thus be changed at any time, but the effect will only be seen on the next initial node restart/start.
This is mostly a test parameter and for use cases where we will never bother with restoring from disk. It is only possible to change through installation of a new cluster. In principle it will work also with an initial node restart.
Moving any directories around means that we will no longer find any of our files for recovery. Thus this can only be done through an initial node restart. This holds true also for the following similar parameters.
A problem with those parameters is that changing those means that we have no way of checking that the change is correct when the node starts up. Any information about what we started with in previous starts is only accessible in the recovery files and if we moved those to a new location we will not have the capability to verify that they have changed.
Here we will notice that it won't work since the node recovery will fail in one way or the other.
The port used to connect to a data node cannot be changed. This would require this to be changable dynamically since the ServerPort can only change when it is starting up. Since that requires the node to be down, the other nodes cannot also be required to be restarted to update their config.
Only when this is changable dynamically can we change this parameter.
PortNumber of a management server can be changed, but it requires the use of multiple management servers. The management can move entirely to a new location while the other management servers are up and running. While the management server is down the remaining management servers can change the configuration for the moving management server.
The node id of a data node cannot change. Similarly Nodegroup cannot change of a data node.
Most of the parameters that specifies the maximum number of objects of various kinds can be changed at any time. It is always safe to change them to a higher value. Changing them to a lower value might fail if not enough memory is available to restart the node. This mainly applies to configuration parameters relating to meta data objects.
Most of the things about ThreadConfig can be changed in a node restart. One has to pay special attention to the number of ldm threads however. The number of ldm threads cannot be set higher than the number of log parts. Normally the number of ldm threads should be equal to the number of log parts.
This applies also to HeartbeatIntervalDbApi.
This can be changed dynamically. All the nodes must have very similar numbers. The heartbeats are sent with the interval that defaults to 1500 milliseconds. After four missed heartbeats, after around 6 seconds we will fail due to missed heartbeats.
This means that doubling or cutting in half or smaller changes of the heartbeat interval should be safe. One should still change in all nodes through a rolling restart such that they have the same value.
Through multiple steps it is possible to change the heartbeat interval to any value.
Procedure to perform online config changes#
To change the configuration in the cluster requires more than one step. The first step is to change the configuration database as described above. This step doesn't change the running configuration. To change the configuration in the running cluster the nodes must be restarted to read in the new configuration.
To ensure that this change can be performed without downtime we use a rolling restart procedure. This means restarting the data nodes in a sequence. It is important to NOT restart all data nodes at the same time. This would give downtime. The idea is rather to restart one node per node group at a time. In a two-node cluster one first restarts the first data node, when this restart is completed we restart the next data node. In larger clusters one can stop one node per node group at a time to perform the rolling restart.
MCM have very good support for performing rolling restarts without much user interaction. Using the management client one can perform the rolling restart through a set of RESTART commands that will perform a graceful shutdown of the cluster (ensures that impact on running transactions is minimised).
Dependent on the change we might also need to restart the API nodes and the management servers.
It is not recommended to change the configuration in the cluster by restarting all nodes at the same time.
The reason for this is that the restart of a node with the new configuration might fail. It is not possible to check the correctness of each change without actually performing a node restart. This is especially true in the case when we want to decrease some resource. In this case it is very important to ensure that the cluster can survive such a configuration change by performing a node restart with the new configuration.
For example if we want to remove some DataMemory from each data node and allocate more memory resources for some other resource such as the DiskPageBufferMemory it is important to verify that the new DataMemory setting is ok by performing a node restart with the new setting. A cluster restart would in this case fail the entire cluster and in worst case the cluster might become unrecoverable, this would not normally happen, but restarting using node restarts is a lot safer in those cases. If any form of corruption would occur due to the configuration change one can always restore the data in the data node by using an initial node restart after changing back to the old configuration.
Normally the configuration changes are safe and should not have any dire consequences, at the same time it is hard to analyse and test every conceivable configuration change that is possible. It is a good idea to follow a safe procedure that never puts the data in NDB at risk.
One possibility might simply be that the memory is overcommitted with a new configuration and this is not possible to test before the change is made. It is usually a good idea to use the configuration parameter LockPagesInMainMemory to ensure that we don't get problems with swapping when changing the configuration.
If the node restart fails, one should analyse the node logs and error logs to see why it failed. In many cases there is a proper error message written in those cases. If it fails and it is not possible to deduce why the change failed, then change back to the original configuration.
It is a good idea to consider filing a bug report or write a forum question in the case where the error messages are not easy to understand in those cases. This will help us improving the product to make it easier to manage.
Online changes of MySQL Server configuration#
Changing the configuration of a MySQL Server can often be done as an online operation. The MySQL Server can change its configuration using various SET commands. The MySQL manual contains a detailed description of each configuration variable and how they can be changed and the scope of those variables. Some variables exist per connection and some are global in the MySQL Server. This is true also for the NDB specific configuration options in the MySQL Server.
Online add MySQL Server#
Adding a MySQL Server requires adding a new node to the configuration. The first step here is to add this new node in the configuration using the above procedure to change the configuration database. Next a rolling restart of all the data nodes is performed as described above. After this procedure the cluster is ready to use the new MySQL Servers.
Exactly the same procedure can also be used to add new API nodes.
As usual one should put special care in deciding if these new nodes should have their node id and hostname set in the configuration. Setting those provides for a safer cluster setup and a more managable cluster.
Often it is a good idea to configure with more API nodes than necessary. This commits a bit more memory to the send buffers and receive buffers in the nodes, but if that is ok, it avoids the burden of easier to add new MySQL Servers or API nodes as desired.
To add new API nodes doesn't require restart of MySQL Servers and other API nodes, also not of the management servers. There are no communication links between those nodes other than between API nodes and management server at startup to get the configuration.
Online change of data node configuration#
Changing most of the configuration parameters in the data nodes requires a change of the configuration database followed by a rolling restart of all data nodes.
An exception here is if we only want to change the configuration of one data node or a subset of the data nodes. This is rarely something that should be done since most of the time it is best to let the data nodes have the same configuration.
Online add data node#
The most advanced configuration change is adding data nodes. This requires several steps to be performed.
First the new data nodes must be added to the cluster configuration. It is required that they are added using a node group id of 65536. This says that the nodes are not belonging to a node group, so thus they will not yet store any data.
Next a rolling restart is required to ensure that the new nodes are added to the configuration. This rolling restart must also restart all API nodes, all MySQL Servers and all management servers in the cluster. These nodes also need to setup connections with their buffers to the new data nodes. It is important to restart those to prepare them to use the new data nodes.
After this configuration change the new nodes exist in the cluster. It is possible to perform this configuration change as a preparatory step and only make use of those new data nodes long after. They will consume a bit of memory in the nodes to handle send and receive buffers, but other than that there is no special cost of having them in the configuration.
When starting a cluster initially the configuration parameter StartNoNodeGroupTimeout will specify how long time we will wait for those nodes to startup before we start the cluster. They are not absolutely necessary to startup the cluster. This parameter is by default set to 15000 milliseconds (15 seconds).
The next step is to startup the new nodes. They should be started with the --initial flag to ensure that they initialise all the files before coming online.
Next we create new node groups using those new data nodes.
Here is an important thing to consider when adding data nodes. Data nodes are always added in groups at the size of NoOfReplicas. With 2 replicas we add the nodes in groups of two nodes. We can add several node groups in one go, but we cannot add single data nodes with missing nodes in the node group.
A new node group is formed of the new nodes by using a command in the management client. Assume we had 4 nodes to start with in 2 node groups. Now we want to add two more nodes with node id 5 and 6 to form a new node group. The command to do this is:
CREATE NODEGROUP 5,6
Now we have created new node groups. Existing tables don't use those new node groups at the moment. Any new tables created after this command will start using those new node groups.
Now it is possible to reorganise the tables to use those new data nodes.
The algorithm to reorganise a table is specifically developed to ensure that it doesn't increase memory usage on the existing nodes while the reorganisation is ongoing. This makes it quite a lot safer to make this change.
One table at a time is changed by using the command in the MySQL Server used for meta data operations.
ALTER TABLE table_name REORGANIZE PARTITIONS;
After completing this command for a table it is recommended to also issue the command:
This ensures that the space for the rows that was moved to the new node groups is reclaimed such that it can used also by other tables in the cluster. Otherwise the memory will remain allocated to the partitions in the existing tables.
It is recommended to perform a backup of the cluster before starting this add data node operation.
The above algorithm to reorganise data in the cluster can also be used to make use of new REDO log parts after changing the configuration parameter NoOfFragmentLogParts. This change is however not entirely safe since there is no preallocation of the memory required to performed the change. In this case the data is moved from one partition in a data node to another partition in the same data node. Since data is replicated in both partitions during the change this requires additional memory during the change.
It is not recommended to use this command at this point in time. If used it is important to ensure that sufficient memory is available to perform this change otherwise the cluster will be stuck in a situation that is very hard to roll back from. The software will attempt to roll back the change, but there is not a lot of testing of this scenario in our development, thus we do not recommend using it. It should only be used when adding new data node groups.