Meta Data Changes#
In this chapter we will cover how we can change the meta data in RonDB. In most cases it can be done as an online operation.
Any time spent altering meta data and not allowing writes to the tables in parallel means downtime. Therefore much resources have been spent to ensure that most normal meta data changes can be done online.
One problem is that the storage engine API doesn’t allow for a true online operation in the MySQL server that performs the meta data change. Thus as will be discussed in a later chapter it is necessary to perform meta data changes in a specialised MySQL server that is only used for meta data changes (at least at the time the meta data changes are performed).
Creating a new table and dropping a table is obvious online operations. All meta data operations are performed as distributed schema transactions. For example when creating a table we create a base table, zero or more ordered indexes, zero or more unique index tables, zero or more BLOB tables, zero or more foreign keys and internal triggers to keep indexes up to date. All of those are created atomically.
This requires a stepwise approach where each step in the creation process is recorded on disk to ensure that we can roll forward or roll it back in recovery situations. Due to this stepwise approach the create table statements takes a bit of time. A few tables per second can be created in RonDB. Speed of meta data operations isn’t the focus in RonDB, the availability for reads and writes during those operations is rather the focus.
If Global Replication is activated it will be an integral part of the create table statement to ensure that the create table is replicated to the backup clusters. This increases the time to execute the create table statement. The create table statement have to be synchronised with all MySQL servers that are used for binary logging and we have to synchronise writes of data and meta data changes.
From this we deduce that meta data operations are online, but they are not necessarily fast. The focus of meta data operations is that they are properly synchronised with all cluster nodes and synchronised with backup clusters. RonDB is focused on applications where meta data changes are done in relation to software upgrades that happen, but are rare compared to the number of user data transactions and queries.
Drop table is an online operation. One complication for drop table is when scans on the table is still ongoing at the time when the drop table is issued, we can also have a number of internal recovery operations that could be ongoing for the table. In this case we might have to wait for a short time until the drop table is completed. As part of drop table we remove all files that are specific to the table and extents of any tablespaces are free’d up as part of drop table. Similarly the memory connected to the table is released.
Create Ordered Index#
Adding a new ordered index starts by creating a new index table. This involves adding internal triggers on update, delete and insert operations that ensure that the ordered index is kept up-to-date.
This operation is online such that no writes will be blocked by the creation of this new index. The operation will take some time, the table have to be scanned and for each row the row will be inserted into the index and internal triggers will ensure that the index is kept up-to-date.
During an update we know if the ordered index scan build has reached the row we are updating since each row has a row id and the build scan is scanning in row id order. The index contains a sorted list of row ids. The sort order is based on the index columns and whether the index is ascending or descending.
The creation of a new ordered index can take substantial time for large tables, but it will run without interfering with user transactions other than that it will use new memory resources and will use CPU resources for the index build.
Ordered indexes are always in memory, all ordered indexes are rebuilt as part of all node restarts and cluster restarts. This rebuild is an offline build that happens when the node isn’t involved in any transaction.
The ndb_restore program have the ability to create new ordered indexes that are built as an offline operation. Thus the creation of the index is much faster, but it isn’t an online operation. A restore of a backup isn’t expected to happen in parallel with user transactions.
Drop Ordered Index#
Dropping an index only releases the memory and removes the meta data about the index. As when you drop a table we will wait for up to 5 seconds for any scans that are still using the ordered index.
Create Unique Index#
Creating a unique index means creating a real table. The table is special in that it cannot be accessed as a table, it can only be used to access the main table.
Internally it is a normal table. This table have a hash index as all other tables in RonDB. It doesn’t have any other indexes. It is possible to handle cases where a column amongst the unique index columns value can be NULL.
The table is recoverable if the main table is. There is a trigger from the main table to the unique index table ensuring that any updates on the unique index columns in the main table will be reflected in the unique index table. An update leads to both a delete and an insert. An insert leads to an insert in the unique index table and a delete leads to a delete in the unique index table.
Drop Unique Index#
Dropping a unique index is similar to dropping a table, it removes the memory, the files and the internal triggers for the unique index table.
Create/Drop Foreign Key#
A foreign key have two components. The first component is an index on the reference columns in the child table. The second component are triggers that verify that the foreign key constraint isn’t broken by any updates.
As part of creating a foreign key one needs to scan the data to verify that existing data meets the foreign key constraints.
Adding one or more new columns to a table can be an online operation that is merely changing the meta data. For the operation to be online the column must be an in-memory column and it has to use the DYNAMIC format. In addition the new columns must be NULLable columns or they need to have a default value. NULL or the default value is the value of all rows for the new column after adding the column.
The internal row structure have three variants of implementing a column. One variant is fixed size, in this case the column have a fixed size which is the maximum size of the column. The second variant is variable sized columns. In this case there is at least two bytes for the column that provides the pointer to the column within the row and if NULLable there is also a bit in the NULL bit array. Both of these variants can only be added when the table is created or when the table is altered using the offline algorithm (the offline algorithm creates a new table and copies data over to the new table).
The third variant is to place the column in the DYNAMIC part. In this part columns are only present if they have a value different from the NULL or default value. So adding a column in this part can happen by simply changing the meta data and there is immediately a new column in the row. Accessing a column in the DYNAMIC part is more costly, but a lot more flexible.
When defining a column it is possible to specify whether it should be using the FIXED format, the DYNAMIC format or if it should use the DEFAULT format. The default format is the FIXED format.
When adding a new column one can specify whether the column is an in-memory column or if it is a disk column. The default is in-memory columns, the default can be changed on the table level by setting the STORAGE to DISK on table level.
Tablespaces are used to store the disk columns. Each table with disk
columns need to use a specific tablespace. Tablespaces can be added as
part of the configuration using the
variable. In this case the tablespace is added as part of the initial
start of the cluster.
New tablespaces can be added as online operations. A tablespace cannot be used until it has been created, a tablespace can only be used in a table if the table is created after the tablespace is created.
In a tablespace we can have one or more data files. The first data file is always added as part of creating the tablespace.
When a tablespace is created one needs to define the extent size. The extent size is the unit of allocation from the tablespace to a table. When an insert to a table needs a free page and there are no more free pages in the extents connected to the table, a new extent have to be allocated from the tablespace. When creating a tablespace one sets the size of the tablespace by setting the initial size of the tablespace. This is the size of the first file added to the tablespace.
Creating a tablespace can be a lengthy process since the data file will have to be initialised. The reason for this is that otherwise the OS will not ensure that the actual disk space needed for the file is allocated and this could cause the tablespace to become full even without using the full size specified.
Add Datafile to Tablespace#
It is possible to add new data files to an existing tablespace. The only things to set here are the file name of the new data file and the initial size of the data file (the size cannot be extended after it is added, only new files can be added).
Drop tablespace and data files#
Data files and tablespaces can be dropped as online operations as well. But they can only be dropped if they are completely empty. Extents allocated to a table can only be released by dropping the table. The allocation from a table is not directed to any specific data file. It is only practical to drop an entire tablespace, individual data files can only be dropped immediately after they were added before they had been used the first time.
Create Logfile group#
Disk columns need an UNDO log in addition to the tablespace. The UNDO logs are managed by the logfile group that each tablespace need. When creating a tablespace one need to define a logfile group this tablespace is using. Currently we can only have one logfile group.
A logfile group can be created from the configuration by using the
InitialLogfileGroup. It can also be created
using an SQL command. Either way we need a first log file, we need to
set the size of the UNDO log buffer size connected to this logfile group
and we need to set the initial size of the log file.
Log files are initialised when created, the command is online, but will take some time.
Add Logfile to LogfileGroup#
New log files can be added at any time to the UNDO log. We need to set the name of the file and the initial size of the file. Also here the file needs to be initialised when added and the command is an online command.
Most users of RonDB will start out with a small 2-node cluster with 2 replicas. If the usage of the cluster increases it is important to be able to add new nodes to the cluster.
To add a new node there are two steps needed. First of all one can only add entire node groups currently. If the cluster uses 2 replicas we have to add 2 nodes at a time. After adding a new node group the existing tables need to be reorganised to use the new node group(s).
Add Node Group#
The first step in adding a new node group is to add the new nodes to the
configuration. These new nodes should be configured with
configuration parameter set to 65536. Thus the node is not part of any
node group. Thus other nodes will setup the communication paths to these
nodes, but the nodes will not yet be used to store data. New tables
created will not attempt to place table partitions into those nodes yet.
In principal it is possible to add these new nodes early on, the only problem with this is that the nodes will have to allocate memory for all send buffers to these nodes.
The change of configuration is covered in the chapter on programs and more specifically in the section on the RonDB management server.
Adding a new node group is an online operation and it is a schema transaction that is atomically done. It is executed using the ndb_mgm client. The command to add two new nodes 3 and 4 to an existing cluster with node 1 and 2 in the first existing node group is done like this:
CREATE NODEGROUP 3,4
After this command is completed the new node group is created and used when new tables are created. The creation of a new node group is a fairly quick operation.
The creation of tablespaces and logfile groups are performed during the first start of the new nodes. This is not part of the creation of new node groups. Starting up a new node for the first time can take some time to complete even though no data needs to be transferred to the node during restart.
It is possible to drop a node group, but it can only be done if the node group is empty.
Immediately after creating a new node group any new table will use the new node group.
Reorganise existing tables#
After adding one or more new node groups it is recommended to reorganise existing tables as well. This is done one table at a time by issuing the command:
This command will be done online. It will take some time since the entire table have to be changed.
The algorithm used follows this procedure.
A number of new partitions are created in the new node groups
A table scan is started in each previously existing node
For each row we decide if the row should be copied to the new partitions
Wait for scan to complete
Start a new table scan on the previously existing node
Each row that was copied over will now be deleted
Wait for scan to complete
In parallel with this we ensure that each transaction will write to both the old node group and the new node group. In effect some of the rows will be in two partitions at the same time during the change.
Fully replicated tables are reorganised in a different manner. In this case there is no need to delete any rows since a new node group will increase the number of replicas.
Optimize table is a command that removes fragmentation from the in-memory columns for the variable sized parts. A row in RonDB consists of up to 3 parts. The first part is the fixed part. The fixed part cannot be compacted since the row id is the reference to the row. Changing the row id would be very complex since it would create complex interactions with scan operations that happens in parallel with optimize table. The second part is the variable sized part of the rows. This stores all variable sized components and all dynamic columns. This part can be fragmented and this part is compacted by an optimize table operation. The third part is the disk columns part which are fixed size. Currently we don’t support returning an extent to the tablespace. This part is not compacted as part of an optimize table operation.
Fixed size columns are much faster, but DYNAMIC columns are more flexible. If compacting memory is most important, one should use the DYNAMIC keyword on columns (or on the entire table). It is possible to set a configuration parameter such that all tables are created with DYNAMIC columns as the default.
It is possible to reorganise a table also when the number of ldm threads have changed. Similarly if we change the PARTITION_BALANCE strategy from e.g. FOR_RA_BY_NODE to FOR_RP_BY_NODE a reorganise action can be performed. It is only possible to increase the number of partitions in a table, thus no decrease of number of partitions is currently possible as online operations.
An important thing to consider is that performing a reorganise table isn’t a safe operation unless we’re doing it in conjunction with add nodes. The reason is that we increase the memory size during the reorganise operation. For add node this is ok since the memory allocated is in new nodes that had no data previously. If we run out of memory during the reorganise operation we will roll back the change. This is not the most well tested part of RonDB. So one should ensure that there is sufficient amount of memory to perform the reorganise operation. Again reorganise operations in conjunction with add node are safe in this respect.
There are a few things that can only be done as offline operations. For the most part these things should be avoided. It is necessary to set a configuration variable in the MySQL server to even allow those operations. The reason is that it is necessary to be aware of that an offline operation is invoked.
An offline operation requires some careful considerations. In RonDB we don’t support locking a table in the entire cluster. The only safe method to use offline operations is to ensure that we have entered a special single user mode. If you are able to control all access to the cluster you can ensure that the table isn’t updated. If you only access from MySQL servers you can use
LOCK TABLES table_name READ;
This needs to be entered in all MySQL servers in the cluster in that case.
Single User Mode#
From the RonDB management client there is a command to enter single user
mode. This is primarily intended for safe use of
statements that use the copy algorithm.
The command is
ENTER SINGLE USER MODE 55.
Thus the API node with id 55 is now the only API node allowed to send requests to the cluster.
For offline operations you first enter single user mode for the MySQL server. When starting this special MySQL server you should ensure that it is started with the node id set in the startup parameters. Thus any changes this MySQL server does while in single user mode is safe.
After completing the necessary changes you exit single user mode through
EXIT SINGLE USER MODE.
Operations that require offline operations#
Change to/from disk column
Change character set
Change column size
Change column data type
Change column memory placement
Drop column requires an offline operation.
To move a column from in-memory to disk or to move it from being a disk column to being an in-memory column requires an offline operation. This could include creating an index on the column as well when moving to in-memory.
Changing the character set of a column or of a table requires an offline operation.
Modifying a column to a new data type or change the size of a column requires an offline operation. Also moving from fixed size part to the DYNAMIC part.
All of these operations requires using the copy algorithm, thus a new table is created and row by row is copied over to the new table. When the copy is done the new table becomes the table and the old table is dropped.
Truncate table is an offline operation that will first drop the table and next the table is created again.
Online Solutions to Offline Operations#
Many users that need everything to be online often uses RonDB Replication. They have more than one cluster that is operational. These changes are often correlated to software changes. Often such a change starts by taking a backup in the cluster to change. Next the cluster is taken offline. Next the cluster software is upgraded. Next the meta data is restored again. In this restore it is possible to perform changes of column sizes, column data types, column storage, and how the columns are stored, columns can be dropped and columns and indexes can be added.
Next the data is restored again. The cluster is brought up again using binary log from the cluster that is still operational.
Next the second cluster need to be changed to the new software and new meta data. More details about how this can be achieved in detail will be provided in the chapter on Global Replication.