Skip to content

RonDB as a SQL engine#

In order to understand the features of RonDB and its storage engine NDB we will look at the SQL commands available to define tables and their support structures such as tablespaces, UNDO log files and the parts of a table like columns, indexes, primary keys and so forth.

This is important to understand since even applications using the native RonDB APIs will query tables that have been mostly defined in SQL.

We will cover basic SQL statements, row format, foreign keys, disk data columns, BLOB handling. We will deal in some detail about the various table options that are available when creating a table in RonDB.

We will cover how various MySQL concepts such as stored procedures, views and so forth are handled in RonDB and we will cover ALTER TABLE statements.

Basic SQL statements#

Create a table#

Simplest form of table#

Creating a table happens in a normal CREATE TABLE statement in MySQL. What makes the table become a table in RonDB is that we add that ENGINE=NDB (NDBCLUSTER is a synonym of this and both are case insensitive). Here is a very simple table created in SQL.

mysql> CREATE TABLE t1 (a INT) ENGINE=NDB;

This table is about as simple as a table can be in RonDB. It contains one column which is a 32-bit integer attribute. Here is the first distinction of tables in RonDB, all tables in RonDB must have a primary key. So when this table is created it is created with two attributes, the first is the column a and the second column is the hidden primary key which is an 8-byte unique key.

Simple table with primary key#

Moving on to a bit more sophistication we create a table with a primary key.

mysql> CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=NDB;

This table contains only one column a and this has a distributed hash index which is the primary access method that all tables use. Here is now the second distinction of tables in RonDB, a table that defines a primary key index defines two indexes on the table, the first is the distributed hash index and the second is an ordered index (a T-tree index data structure) which is local to each fragment replica.

By definition a primary key cannot have any columns that can have NULL values.

Simple table with only primary key hash index#

Moving to the next level of sophistication we define a table with only a primary key index, one column a and nothing more.

mysql> CREATE TABLE t1 (a INT PRIMARY KEY USING HASH)
mysql> ENGINE=NDB;

By adding USING HASH to the index definition we clearly specify that the index will only be a hash index. The reason that normal primary key indexes are also containing an ordered index is that almost everyone using a database expects an index to be an ordered index always. Since the base primary key index in RonDB is a distributed hash index it means that we add another index to ensure that newcomers to RonDB get what they expect from an index.

Simple table with primary key hash index and an index#

Now we move on to a table with also an additional index and an additional column.

mysql> CREATE TABLE t1 (a INT PRIMARY KEY USING HASH, b INT KEY)
mysql> ENGINE=NDB;

This adds a new column b and an ordered index on this column. This is a pure ordered index and there is no hash index involved in normal indexes.

Simple table with primary key hash index and a unique index#

Now we move to a unqiue index and a primary key index.

mysql> CREATE TABLE t1
mysql>   (a INT PRIMARY KEY USING HASH, b INT UNIQUE KEY)
mysql>   ENGINE=NDB;

Unique indexes is the third sophistication of RonDB tables. Unique indexes are implemented as a special unique index table that have the unique index as primary key and the primary key is the field in this table. In addition the unique key definition adds an extra ordered index on the table in the same fashion as the primary key. The reason is that the unique key index is the normal distributed hash index used for all tables. Thus it isn't an ordered index and we add this to make life easier for newbies to RonDB.

The above definition adds the following: a table t1 with a column a and a column b, also the distributed hash index on the column a and an ordered index on the column b.

In addition another table t1_unique (it is not its real name, we simply use it here as a possible name) that have the column b as primary key using a distributed hash index and another column a.

We add an internal trigger from table t1 to table t1_unique, this trigger fires every time we perform an insert to insert into the unique table and every time there is a delete a trigger is fired to perform a delete in the t1_unique table. At updates two triggers are fired that performs an insert of the new value in the unique index table and a delete of the old value in the unique index table.

Behind this fairly simple definition of a simple table using a primary key and a unique key we have a fairly complex index structure and trigger handling to ensure that all indexes are kept in synch with each other. The unique index is updated transactionally within the same transaction as the update of the base table.

A simple table with a generated primary key#

In MySQL a concept called auto_increment is used to generate unique keys in increasing value. For InnoDB tables the step is 1 between inserted rows always. In RonDB we can have multiple APIs and multiple MySQL Servers hitting the table at the same time. We have implemented auto_increment as one row in a special system table. When we get the next key from this table we get a range of keys. By default this range is set to 1 key. This is a configurable item in the MySQL Server configuration option ndb-autoincrement-prefetch-sz.

The default here is set to ensure that people used to MySQL isn't surprised by behaviour of RonDB. However for any real RonDB application it is quite likely that this value should be set to at least 32 to ensure that the updates of the rows to get a new unique key isn't becoming a bottleneck. If insert rates on the table is beyond a thousand inserts per second and even smaller than that if latency is high between nodes in the cluster this value must be increased to avoid running into a bottleneck when creating unique keys. Setting the value to 1024 means that millions of inserts per second can be sustained on the table. It can be set to at most 65535.

Here is the syntax used to create a table with autoincrement on the primary key.

mysql> CREATE TABLE t1 (a INT PRIMARY KEY USING HASH AUTO_INCREMENT)
mysql> ENGINE=NDB;

More complex tables#

Using the simple examples we can move on to very complex tables. The limit is that we can have at most 511 columns in a table and the sum of the column sizes must be less than 8052 bytes for the fixed size columns, 30000 Bytes for the maximum row size of all columns. BLOB tables are treated a bit special, we'll cover those later in this chapter. We can have up to 64 indexes per table as well.

RonDB can store any field type available in MySQL and using any character set available in MySQL. A word of caution here is to remember that complex character sets use more than one byte per character.

The syntax used is the same syntax used for any tables in MySQL. We will describe later some specialisation available to RonDB tables.

Tables in RonDB can have GIS columns, but cannot have GIS indexes, they cannot have fulltext indexes. RonDB tables can have JSON attributes, it is possible to have stored generated columns that are indexed in RonDB, it is not possible to have an index on a virtual generated column in RonDB tables.

Altering a table#

Once a table have been created we can use ALTER TABLE to modify the table. Some of the changes to the table can be done using a copying alter table statement and some can be done as online alter table statements.

Online alter table statements#

Our aim for RonDB is that tables should always be available for read and write after its creation.

There is one caveat to this. Due to the internal software structure in the MySQL Server it isn't possible to write to a table in the same MySQL Server as the change is happening within. What this means is that in order to create a highly available cluster it is a good idea to have a special MySQL Server used to execute online ALTER TABLE statements.

The fact is that the RonDB data nodes allows for writes to be performed on a table where indexes are added, columns are added or the table partitioning is changed. However the actual MySQL Server where this change is performed only allows reads of this table. This is due to internal locking in the MySQL Server that is hard to get around. The MySQL Server assumes that even online operations are not truly online. To execute truly online ALTER TABLE statements these should be executed from a MySQL Server purely used for metadata statements.

RonDB is actually more online than what one MySQL Server can be. Other MySQL Servers can thus continue to operate on the table using reads and writes since these MySQL Servers are not aware of the locks inside the MySQL Server that performs the change.

The following things can be changed as online alter table statements.

  1. Add an ordered index

  2. Add a unique index

  3. Drop an ordered index

  4. Drop a unique index

  5. Add a new column

  6. Add a foreign key

  7. Drop a foreign key

  8. Reorganise table to use new node groups

  9. Add partitions to a table and reorganise

  10. Set/Reset the Read Backup feature on the table

  11. Rename table

Things that specifically isn't supported as online alter table statements are:

  1. Remove a primary key

  2. Change primary key columns

  3. Change columns

  4. Drop column

Primary keys are the vehicle we use to control all online changes to a table. Changing the primary key of an existing table is not possible. Drop column could be implemented, but it would require tables to use a very dynamic data structure that have a high CPU overhead.

It is likely that even more ALTER TABLE statements will be possible to perform as online alter table statements as development of RonDB moves on.

The syntax used to ensure that we only allow an online alter table statement performed is the following:

mysql> ALTER TABLE t1 algorithm=inplace, add c int

In this case we want to add a new column c to a table t1. By adding the algorithm=inplace syntax we ensure that the operation will only be successful if we can perform it as an online alter table statement.

Another method is to set the configuration option ndb-allow-copying-alter-table to 0, this has the effect of not allowing copying alter table and failing any ALTER TABLE that attempts to use a copying alter table. By default this option is set to 1.

Copying alter table statements#

In some cases we want the table recreated as a new table and the old table data to be moved to the new table. In this case the syntax used is the following:

mysql> ALTER TABLE t1 algorithm=copy, add c int

In this case the change will be performed as a copying alter table statement even if an online alter table statement is possible.

It is possible to set the configuration option ndb-use-copying-alter-table to 1 to get the same effect as providing the syntax algorithm=copy. By default this option is set to 0.

Querying a table#

Querying an RonDB table is using normal SELECT statements. Here we focus on the syntactical parts of SELECT queries and there is nothing special about SELECT queries for RonDB. They are using the same syntax as any other table in MySQL.

SELECT COUNT(*) queries#

One specific variant of queries have a special treatment in RonDB. This is the SELECT COUNT(*) from table query. In this case we want to know the number of records in the table. Actually there is no method to get the exact count of this since we cannot lock the entire table in the entire cluster. It is possible to lock a table in one MySQL Server but this doesn't prevent other MySQL Servers or API nodes to update the table.

There are two methods to get the approximate row count in the table. The first one is the optimised one that uses a local counter in each fragment to keep track of the number of rows in a partition of the table. This method is quick and provides a good estimate. The other method is to scan the entire table and count the rows in the table. This method is comparably very slow.

The optimised method is default, to use the slow method one can set the MySQL Server configuration option ndb-use-exact-count to 1.

Configuration options that affect SELECT queries#

The configuration option ndb-join-pushdown set to 1 means that we will attempt to push joins down to the RonDB data nodes for execution and setting it to 0 means that we will not even attempt to push down joins to the RonDB data nodes. It defaults to 1. More on pushdown join in a later chapter.

The option ndb-force-send can be set to 1 to ensure that we attempt to send immediately to the RonDB data nodes when we have something ready for sending in our connection thread. If not we will give the responsibility to the sender thread or send based on how much data is waiting to be sent and how many have attempted to send before us. This adaptive send method can sometimes provide better throughput but can increase latency of some queries. It is set to 0 by default.

The option ndb-batch-size gives a boundary on how many bytes that is allowed to be sent to the MySQL Server node as part of a scan query before we return to the MySQL Server to handle the returned rows. By default it is set to 32 kByte.

ndb-index-stat-enable enables use of index statistics. It is on by default.

The configuration option ndb-cluster-connection sets the number of API node connections that the MySQL Server will use, by default it is set to 4 if you have used the managed version of RonDB or installed through the cloud script. If you installed yourself the default will be 1.

For larger MySQL Server that goes beyond 8-12 CPUs it is a good idea to use multiple API node connections per MySQL Server to make the MySQL Server scale to use more CPUs. The connection threads will use different cluster connections in a round robin scheme.

Writing to a table#

There are many different ways of writing to a table in MySQL such as INSERT, DELETE, UPDATE. MySQL also supports REPLACE, INSERT INTO DELAYED and a few other variants.

These statements use the same syntax for RonDB tables as they do for normal MySQL tables. There are a few configuration options that can be used to change the behaviour of those writing SQL statements.

The configuration option ndb-use-transactions is specifically designed for large insert statements. Given that large transactions might be too big to handle for RonDB, those statements can be split into multiple transactions. RonDB can handle a number of thousand row updates (configurable how many) per transaction, but for extremely large insert statements (such as LOAD DATA .. INFILE) that loads an entire file in one statement it is a good idea to split those extremely large transactions into more moderately sized transaction sizes. When this is set to 1 (default setting) the large insert statements will be split into multiple transactions. LOAD DATA INFILE will always be split into multiple transactions.

Transactions#

Everything in RonDB is performed using transactions of varying sizes. In MySQL one can run in two modes, the first is the autocommit mode which means that each SQL statement is treated as a transaction.

The second variant is to enclose all statements belonging to a transaction by BEGIN and COMMIT (ROLLBACK if necessary to roll back transaction).

RonDB supports using autocommit and the BEGIN, COMMIT and ROLLBACK handling in MySQL. NDB doesn't support using XA transactions. NDB doesn't support savepoints.

Drop/Truncate Table#

A table can be dropped as in any normal MySQL table. We will drop tables in a graceful manner where the table is completing the running queries before the table is dropped.

No special syntax is used for dropping tables in RonDB.

MySQL supports dropping all tables in a database by dropping the database where the tables reside. This is supported by RonDB.

Indexes#

RonDB support three different types of indexes. Normal primary key indexes consists of both a distributed hash index and an ordered index on each fragment replica. Unique key indexes implemented as a separate table with the unique key as primary key and the primary key as columns in the unique index table (these also have an ordered index on the unique key). All normal indexes are ordered indexes.

Indexed columns in RonDB must be stored in main memory, columns that are stored on disk using a page cache cannot be indexed. All tables must have a main memory part and at least one primary key hash index.

It is possible to define a pure hash index for primary keys and unique keys by using the keyword USING HASH in SQL.

RonDB supports adding and dropping indexes on a table as an online alter table statement. This statement can either be an ALTER TABLE statement or a CREATE INDEX statement.

Index sizes#

In many DBMSs the size of an index is dependent on the number of fields in the index. This is NOT true for RonDB. In RonDB the size of a distributed hash index and an ordered index is a constant overhead. The reason is that we only store a row id reference in the index and not any values at all.

In the hash index we store one word per row that is used for some bits and also used to find the lock owner when we have ongoing transactions on the row. There is also one additional words with a row page id. Each row entry is 8 bytes. In addition the hash index has some overhead in how those 8 bytes are found that adds a few more bytes to the overhead. There is a container header of two words for each container and there is around 5-10 entries per container. There will also be some free space between containers since those are stored in a number of fixed places and the placement into the pages is more or less random based on the hash of the key. Around 15 bytes of overhead one should account for in the hash index per row. One important part of the hash index is that we store a part of the hash in the index. This makes it possible to skip rows that are not equal faster compared to a lookup of the primary key.

hash(PartitionKey) % Number of Hash maps = distribution_id
distribution_info(distribution_id) -> Set of nodes storing replicas
hash(PrimaryKey) >> 6 = Hash index page id
hash(PrimaryKey) & 63 = Hash index page index
tree_lookup(Hash index page id) -> Physical page reference
get_mem_address(physical_page, page index) -> Memory adddress to container

In the box above we show how the partition key is hashed to find the hash map to use for the row. The hash map is used to translate into the partition id and this will give a set of current alive nodes storing the replicas for this row.

When we get to the node we use the hash on the primary key (often the same as the partition key unless the table was created with a specific partition key). We use part of the hash to find the page id where the bucket of the key element is stored in. A second part is used to find the page index in the page. The page and page index will point to the first container in the hash bucket.

Each bucket consists of a linked list of containers, normally there is only one container, but there is no specific limit to how many containers there can be.

The container header contains references to the next container in the bucket if one exists and a few other bits. Each container contains anywhere between zero and eleven key elements. Each key element contains a reference to the row and a part of the hash of the key and a few more bits. A key element is 8 bytes in size. Most of this stores the row id, but there are also scan bits and hash bits and some other bits.

The hash pages are 8 kBytes in size, if we cannot store any more container in the same page as the linked list we can use overflow pages that only store overflow containers.

Given that the containers can be wildly different in sizes we have implemented a variant where we can have containers both starting from the left and from the right. Thus we can have 2 containers in each container entry in the page. Each container entry is 112 bytes. The sum of the sizes of the containers in the container entry is always 112 bytes or less.

At the bottom of each page we have eight overflow containers, these can also be filled from both ends.

The overhead of the hash index is around 15 bytes per row. A significant part of this comes from the element and container overhead. The rest is empty unused slots and memory areas that comes through the random nature of the hash index.

The hash index is used in all primary key lookups and unique index lookups. It is used for full table scans and a number of scans that are part of metadata changes.

image

The ordered index is implemented as a T-Tree. A T-tree is similar to a balanced binary tree. A balanced binary tree stores the pointers to rows only in the leaf nodes. The T-tree stores pointers to rows in all index pages. The T-tree index uses mini-pages that are 256 bytes in size (these mini-pages are stored in the DataMemory).

Each such mini-page contains a header, it contains a reference to a new mini-page for those keys that are smaller than the left pointer. It contains a reference to a new mini-page for those keys that are larger than the right pointer. The pointers are row ids that contain the boundary values.

For those keys that are between the left and right the reference to the row is found among the up to 22 references in the mini-page. The search within those 22 references uses a binary search algorithm.

In addition the mini-page also stores a subset of the left and right key values. These are used to avoid having to go and read the rows for the most part. This subset of the keys is an important optimisation of the index operation.

These mini-pages are stored in a binary tree structure that is kept balanced. When we are searching the index and we come to a container we have three choices. Go left, stay or go right. The most common is to go left or right since it is only at the end of the search that we stay in the same index mini-page.

The benefit of the T-tree structure compared to a binary tree is that it saves memory in that only 2 pointers are needed per 22 entries. It also brings benefits that we can avoid going to the actual rows in many cases where it is sufficient to use the subset of the key to discover whether to move left or right or stay.

The overhead of an ordered index stays at around 10 bytes per row. Most of this memory is used for the row reference.

The main use case for the ordered index is to perform range scans.

image

Foreign Keys#

RonDB support foreign keys between tables. More on this in a separate chapter on RonDB foreign keys.

Optimize/Analyze Table#

OPTIMIZE table command can be used on RonDB tables to remove any fragmentation on variable sized rows and dynamically formatted row. It cannot defragment the fixed size part in memory since this would change the row id of a row and this is a very important part of the recovery algorithms in RonDB.

The ANALYZE table command is used to build up the index statistics on an RonDB table. It can generate large speedups on some SELECT queries.

BLOBs#

RonDB supports BLOBs of any size. We will describe this in more detail in a specific chapter on BLOBs for RonDB.

Row format of RonDB tables#

RonDB supports a few row formats for its tables. We will describe those and the effect of them in the next chapter.

Disk Data Columns#

RonDB supports storing non-indexed columns on disk. We will describe this in more detail in a separate chapter.

MySQL Replication of RonDB tables#

RonDB supports using MySQL Replication in a fairly advanced manner. We'll cover this in a specific part of this text.

RonDB Row Data structure#

A database is about storing data for easy accessibility. Most DBMS have some special tricks to store data and indexes in a format suitable for the application categories that the DBMS focus on. RonDB is no different, we have heavily optimised index data structures for primary key access. We have also added a standard index for ordered access to data.

Much of the source code in RonDB is focused on the distribution aspect and reliability aspect, so we probably focus less on data structures than many other DBMSs do. In this chapter we will go through the data structures we use for rows, hash indexes and ordered indexes.

Row format#

RonDB stores rows in two different formats. FIXED format and DYNAMIC format. The term FIXED format is somewhat a misnomer. It means that fixed size columns is stored in the fixed size part of the row and variable sized columns are stored in the variable sized part of the row.

To understand the concepts here we will introduce the data structure of the rows in RonDB. Every row in RonDB has a fixed size part. Most tables also have a variable sized part. All tables with disk columns in addition have a fixed size disk data part.

The fixed size part of the row contains a row header and all the fixed size columns that are using the FIXED format.

The variable sized part contains all columns that have a variable size. The variable sized part have two parts. It has one part for columns that always have at least a length. It has a second part where all columns using the DYNAMIC format are stored.

Columns stored in the dynamic format will not be stored at all if they are NULL or have the default value. This means that new columns with dynamic format can be added to a table without any need to change the row storage at all. This is why we can support add column as an online operation. Columns using the dynamic format will have a bit more overhead when they are stored and the execution overhead will be a bit higher as well for them (around 40% more CPU used in retrieving data from row storage, this is only a small part of the execution time for a row operation except when performing scans that touch very many rows).

There is one more reason to use the dynamic format, the row storage for those columns can be compacted using OPTIMIZE TABLE. Columns that are stored in the fixed row part is not possible to move around and thus cannot be compacted.

The disk columns are always stored in a fixed format, this row part is always of the same size for all rows of a table. Disk columns can thus not be added and dropped as online operations.

image

The row header contains the following information:

The first word is used to refer to the operation currently owning the record. This operation might be part of a linked list of operations that have operated on the row in the same transaction. If the row isn't currently involved in a transaction the first word contains the special value RNIL which is a variant of NULL that is equal to 0xFFFFFF00. Internally in RonDB we use relative pointers, thus 0 is a valid pointer. Thus we had to use a special NULL value and this value is RNIL. It is heavily used in the RonDB code.

The second word contains the tuple header bits. 16 bits are used as a version number which is used by the ordered index to keep track of which row version it is. The other 16 bits are various indicators for the row such as if it has a variable sized part, if it has a disk part, if it has grown in the transaction, some bits referring to local checkpoints and so forth.

The third word contains a checksum of the fixed size part of the row.

The fourth word contains the global checkpoint identifier of the last update of the row. This is an important part used in node recovery to know which rows that need to be synchronised in the starting node.

The fifth word contains NULL bits. If no NULLable columns exists in the table this word doesn't exist. Each column that is NULLable have one bit in the NULL bits, if more than 32 NULLable columns exist in the table there will be more than one NULL bit word.

Finally we have the reference to the variable sized part of the row and the disk part of the row. Each of those are 8 bytes in size to accomodate large main memory sizes and large disk data sizes.

image

The base header that will be there for all tables is 16 bytes in size. With NULLable rows in the table there will be at least one NULL bit word extending the size to 20 bytes. Most every table uses a variable sized part extending the header size to 28 bytes. If the table uses a disk part the row header extends to 36 bytes.

The most common tables will have a 28 byte row header size (NULL columns exist, have VARCHAR fields and have no disk columns).

The fixed size part of the row is fixed in size at table creation. It doesn't need to contain any columns at all, but it will always contain the row header. A table always has a fixed size part since this is the entry point to the row. The fixed size part is addressable using a row id. This row id stays constant, a tuple is not allowed to move, the row id of all replicas of the row must use the same row id. At insert the primary replica will assign the row id and this will be sent to the backup replica to ensure that it uses the same row id. This is an important requirement for the synchronisation phase of a node restart.

The fixed size part is stored in fixed size pages.

image

The variable sized part of the row is stored in a variable sized page. Given that it is variable in size we address it through a page directory starting at the end of the page. This page directory contains one word per row, 2 bytes is the index where the row starts in the page, the second 2 bytes is the length of the variable sized part of the row.

Now the variable sized part contains two parts. The first is the variable sized columns. Each such column have 2 bytes that contains the index of the column inside the variable size. The length of the item can be calculated using the next 2 bytes. The start of the column can be quickly found.

This means that all variable sized columns will use at least 2 bytes all the time, but no more than this is needed. NULL bits for variable sized columns is stored in the fixed size part of the row.

We also have a dynamic part, in this part of the row the absence of a column means that its value is NULL. Thus NULL columns consume no space. It means that it is very easy to add a new column in the dynamic part of the row. It needs to be added with NULL or a default value as its starting value. The dynamic part is stored at the end of the variable sized part of the row.

The variable sized part also contains a reference back to the fixed part to ensure that anyone can move the variable sized part when necessary.

image

The variable sized part is stored in variable sized pages. These pages are addressed using an index to the page. This index resides at the end of the page. Each index entry is 32 bits in size, 16 bits is the size of the entry and 16 bits is a pointer inside the page to the row part. As mentioned the first 8 bytes in the variable sized part is an 8 byte reference to the fixed row part. The overhead of using a variable sized part is 12 bytes in the variable sized row part in addition to the 8 bytes in the fixed size part, thus in total 20 bytes of overhead per row.

As described above there is also a 2 byte overhead for each variable sized column to store the length of the column.

The disk format is fixed in size. All columns are stored using their maximum size.

image

Fixed row format#

Using the fixed row format means that the column (or all columns if it is applied on the table) is stored in either the fixed size part of the row or in the variable sized part of the row.

Thus the storage for the column is always present, even if set to NULL.

There are many ways to achieve that a column is stored in FIXED format. First there is a configuration variable that sets the default of the column format. This is the MySQL Server configuration option ndb-default-column-format. It can be set to FIXED or DYNAMIC.

By default it will be set to FIXED, normal tables will use the fixed size row format. One can also set the property column_format on the column when creating the table or when altering the table. Here is an example of this.

mysql> CREATE TABLE t1 (a int column_format FIXED, b int)
mysql> engine=ndbcluster;

Dynamic row format#

The Dynamic row format means that the column is stored in the dynamic part of the row. This means that it is easy to add new columns in this part. The column format have a much higher overhead in both storage and in computing necessary to read and write it. The actual read and write of it increases by 40% for the actual read/write part of the transaction. However for a primary key lookup this part is such a small part of the cost that the overhead decrease to much less than 5% whereas for a full table scan that more or is entirely about reading rows will see this 40% drop in performance. An example of this is provided below.

mysql> CREATE TABLE t1 (a int column_format DYNAMIC, b int)
mysql> engine=ndbcluster;

Foreign Key handling#

Foreign keys is a concept in SQL that enables the application to create constraints on the data in the tables.

A foreign key is defined on the table that is referred to as the child table. The parent table is the referenced table. The parent table must be created before the child table can create a foreign key referencing it.

The idea with a foreign key is that the child table have a reference in its table that gives a unique reference to a row in the parent table. The reference consists of one or more columns, normally these columns are simply the same columns as the primary key of the parent table.

We will call the columns in the child table the reference columns and the columns in the parent table the referenced columns.

Here is a simple example of a foreign key constraint.

mysql> CREATE TABLE tab_parent (a int primary key) engine NDB;
mysql> CREATE TABLE tab_child (a int primary key, b int)
mysql> INDEX (b)
mysql> FOREIGN KEY (b) REFERENCES tab_parent(a)
mysql> ON UPDATE SET NULL
mysql> ON DELETE SET NULL
mysql> ENGINE NDB;

The first table in this example, the parent table, has only one column, the primary key column a, this is the referenced column. The second table, the child table has a reference to the parent table in column b, this is the referencing column, and it has its own primary key column a.

It is required that the foreign key columns in the child table have an index defined on the referencing columns. This is required to ensure that the foreign key actions can be executed efficiently. Note that there can be multiple rows in the child table referring to the same row in the parent table. Thus the naming parent-child. There is only one parent row, but there can be multiple child rows referring to the parent.

The parent table must have a primary key or a unique key defined on the referenced columns.

In InnoDB it is sufficient to have an index defined on the parent table. But again the reference to the rows in the parent table should be referring to one and only one row, for all practical use cases the parent table should have a primary key index or a unique key index defined on the parent table for the referenced columns (also true for InnoDB).

The foreign key constraint is either checked immediately when the change is done or checked in a special pre-commit phase before the actual commit starts, but after all modifications have been prepared in the transaction. This is done through a read on the parent table, if the read fails the transaction is aborted.

If the reference columns are set to NULL no trigger will be fired, the trigger fires already in the prepare phase, the transaction coordinator either executes it immediately or stores the trigger until the pre-commit phase and when the pre-commit phase starts it will send off all stored pre-commit triggers.

Thus the child table have a trigger defined on changes of the reference columns. The trigger fires on all INSERTs and on all UPDATEs that change the reference columns. A DELETE cannot cause any inconsistency in this case, so no need to trigger at DELETEs.

The updates on the referencing columns and inserts into the child table are checked immediately except when we are using ON UPDATE NO ACTION. In the NO ACTION case the check is deferred until the pre-commit phase. In the checks in the pre-commit phase we check the final value committed and ignore any intermediate states we have passed through during the execution of the transaction.

The handling of NO ACTION differs from InnoDB. The handling in InnoDB for NO ACTION is the same as for RESTRICT. In RonDB we decided to follow many other DBMSs that use NO ACTION to indicate that checks are deferred to commit time.

Now if the row in the parent is deleted or if the referenced columns are updated it will affect the child row as well. Here we have a number of different options of what to do. We can CASCADE the changes to the child table such that they continue to refer to the correct row. We can SET NULL on the referencing columns (only if setting those to NULL is allowed). We can RESTRICT the change (the transaction will be aborted).

We define this when we define the table. The default is RESTRICT. We define ON DELETE to specify what to do when a row in the parent table is deleted. We specify ON UPDATE to specify what happens when we change the referenced columns and when we update the referencing columns in the child table and when we insert into the child table. For the child table the only option that differs is NO ACTION that defers the check to the pre-commit phase. All other options means that changes in the child table are immediately verified.

When CASCADE is used we rely on foreign key handling to remove the rows in the child table when a delete occurs in the parent table.

CASCADE is potentially a very dangerous mode. One delete could delete an entire database since the cascading delete could lead to a new delete which in turn leads to a new cascading delete and so forth. If this option is used it is important that the application developer have very good control over his changes.

ON UPDATE CASCADE is only supported if the referenced columns form a unique key. It will be rejected if the referenced columns form a primary key. The reason is that it isn't possible to update the primary key in RonDB, an update of the primary key is performed by first dropping the row and then reinserting the row.

SET NULL can trigger a fair amount of updates, in this case the change can at least not cascade into even more tables. Setting it to NULL in the foreign key trigger ensures that at least the tables have no incorrect references.

The RESTRICT is used when the application expects to take care of any changes to ensure that they start with a consistent database and ends with a consistent database. However the application developer wants the MySQL Server to ensure that the changes are consistent. This is in my view the proper usage of foreign keys in an application with requirements on predictable latency.

NO ACTION is the same as RESTRICT except that for RESTRICT the check of the foreign key relation is immediate whereas for NO ACTION it is deferred to commit time. This means that using RESTRICT you should always drop or change the child rows before the parent rows. Using NO ACTION it is sufficient to ensure that the entire transaction is consistent, the order of changes do not matter.

In RonDB, foreign keys are checked and verified in the RonDB data nodes. Thus if you define a foreign key on a table, this foreign key relation will be checked and maintained by transactions performed from applications using native RonDB API applications as well as SQL applications.

This means that updating the primary key in a parent table that have a primary key on the referenced columns is only possible using the SET NULL option. For RESTRICT and NO ACTION the DELETE trigger and INSERT trigger will ensure that the update of the primary key is disallowed and ON UPDATE CASCADE on a primary key is not even allowed.

Foreign keys have a fairly high overhead in that all changes requires checks and those checks will extend the time for the transactions. It is seldomly used for very time-critical applications. It is a great aid in quickly developing new applications that need to maintain consistency of the database.

Most applications using foreign keys in InnoDB should transfer easily into RonDB since for foreign keys the differences between InnoDB and RonDB are very small.

Foreign keys can be created when creating the table. Foreign keys can be created using the ALTER TABLE command. They can also be dropped as part of the ALTER TABLE command. Both creation and dropping of a foreign key is an online operation that can happen while the table is updated.

Dropping a table while it is involved in a foreign key relation is not allowed since that would be a breach of the foreign key constraint.

InnoDB supports a special parameter called foreign_key_checks that disables all foreign key checks if set to 0. This will not have any effect on RonDB. Once a foreign key have been defined in RonDB it will perform the foreign key checks. Foreign keys can be created and dropped as online operations in RonDB, this removes the need to use this parameter.

Foreign keys is well integrated with RonDB Replication. Even if the slave side doesn't have the foreign keys defined on the tables, the triggered actions will still be performed on the slave side since all row operations that are part of a transaction will be sent to the binlog. The binlog stores all row changes for a transaction and this means that the slave side do not need to understand how those row changes came about.

Foreign key relations can only exist between tables using the same storage engine, thus the foreign key relations from an RonDB table must refer to another RonDB table.

Foreign key considerations#

Given that one focus of RonDB is real-time applications it is important to consider how to use foreign keys. Foreign keys is a strong concept to support application constraints. Foreign keys uses triggers to secure that any inserts, updates and deletes leaves the database with its constraints still maintained.

Foreign keys can create problems in three ways. The first is that a delete can trigger a delete of a record that in turn triggers another delete and so forth (called cascading deletes). Thus transaction sizes even for a single row change can grow to become any size. Often implementing application functionality directly using foreign keys is easy, but it requires that one analyses the data usage to ensure that not too big transactions are created that create problems with real-time aspects of RonDB.

The preferred options to avoid these large transactions are RESTRICT and NO ACTION.

The second problem is when creating the foreign key on a live table. If we create the foreign key in a system that are constantly changing data, it means that we cannot be absolutely sure that foreign key constraints are valid when we created the foreign key. We worked hard on ensuring that building a foreign key is done correctly such that if data is consistent, the foreign key constraints should be ok when the foreign key is created.

The third potential problem is when dropping a foreign key. In this case we drop the triggers at some point in time, after this drop we no longer maintain the foreign key relation. Such a drop is not possible to serialise towards all other parallel changes in a distributed database without a global cluster lock which isn't desirable in a real-time database that is supposed to be AlwaysOn.

The recommended manner to use foreign keys is to create them when creating the tables and dropping them when dropping the tables and ensuring that the cascading actions don't cause extremely large transaction sizes.

We do work hard to ensure that both create and drop of foreign keys can be done in a safe way also while done in parallel with application changes of the data.

BLOB handling#

BLOBs are treated in a special manner in RonDB. The reason is that RonDB is a real-time engine. This means that working with large BLOB's isn't desirable for the RonDB data nodes. Working with large objects would create challenges around buffer management. A single read of a row would have to be scheduled in several steps.

To avoid these problems we split the BLOB into a number of parts and work with one part at a time.

Currently RonDB data nodes doesn't support storing the entire BLOBs as part of the original row. A large BLOB must be split into multiple parts and these parts are stored in different rows.

A BLOB (or TEXT) field creates a VARCHAR column that can be up to 256 bytes in the table itself. If the BLOB is smaller, no more parts are needed. If the BLOB is larger, the BLOB is split such that the first 256 bytes is stored in the table itself, the rest is stored in a separate table for the BLOB column. Each row in this table contains a VARCHAR that can store up to 2 kBytes of the BLOB. E.g. a BLOB of 12.8 kBytes will be stored partly in the table itself (256 bytes), there will be 6 rows that use the entire 2 kBytes and a final row that stores the remainder which is about 0.55 kBytes in size. The last 7 rows are stored in a special table that is only used for this BLOB column. Each BLOB column adds one more table in RonDB. It is important to remember this when considering the amount of metadata RonDB has to store.

One reason for storing the extra rows in a separate table is to make it easier to add and drop BLOB columns. Adding a BLOB column can be made as an online operation by adding the VARCHAR column to the base table using the dynamic row format and creating a new table for the BLOB parts.

The part rows use the same partitioning key as the base table and partly the same primary key as the base table with an added columns for part id. Thus the BLOB rows will be stored in the same places since the BLOB tables use the same partitioning scheme.

Some of the problems of BLOBs comes from load regulation. If we store a BLOB that is 100 MByte in size it is obvious that if someone tries to write the entire BLOB as one operation or reading the BLOB as one operation as this would put an enormous load on the network for a short time, most likely stopping many other operations from progressing. To avoid this we will definitely want to send parts of a large BLOB at a time over the network even if at some time the whole BLOB is stored inside the table row.

Using BLOBs for fields that normally isn't very large is not the optimal solution in RonDB. For example if the majority of the BLOBs are about 1 kBytes in size it makes sense to create a VARCHAR(2048) or something like that instead.

However for storing objects that can be fairly large the BLOB implementation is quite ok. We ensure that reading and writing of BLOBs is split into many operations to ensure that the network isn't overloaded by single BLOB operations.

Concurrency control of BLOB columns is a bit special due to this split of the column, this will be covered in the coming chapter on Concurrency Control.

When executing a transaction that changes a BLOB it is important that one operation record per BLOB part is needed, in the example above with a BLOB of 12.8 kByte in size there are 7 additional operation records when writing a BLOB. In MySQL 5.7 all BLOB operations from SQL either write or read the entire object.

A natural thing for many BLOB columns are to store those as disk data columns. The main consideration here is that each extra BLOB part has to store the primary key part in main memory, thus the main memory used for each BLOB part is about 50-60 bytes dependent on the size of the primary key. If we need to store about 1 TBytes of BLOBs we still need to have about 30 GBytes of main memory storage.

Disk columns#

Disk columns are columns that are stored on disk and is only cached in main memory when reading and writing them. Columns stored in main memory are pinned to a main memory location at startup and stays there as long as the node is up and running. Both the main memory columns and the disk column are still checkpointed and written to disk to be recoverable.

The main reason we decided to implement disk columns was to extend the size of databases that we can support. We have so far decided not to spend time on implementing disk-based indexes. The reason is that our focus have been on supporting applications that require predictable latency to queries and updates while maintaining a very high availability. While disk-based indexes would extend our reach to new users, it would not add to our unique selling points which are very high availability and predictable response times in a strongly consistent distributed environment with high throughput.

With the development of persistent memory, fast NVMe drives and SSD drives the picture is quickly changing. The use of persistent memory modules that come out in 2018 with Intel Cascade Lake will make it normal for a server to have 1 TByte of memory that can be used by RonDB.

The development of SSD's and NVMe's plays very well into our unique selling points. Given that they can respond in tens of microseconds and can sustain hundreds of thousands of IO operations per second (IOPS) we can use these as main storage for many more columns.

Cloud machines like the one provided through the Oracle Cloud are perfect fits for running large RonDB installations with very low and predictable latency as well as very high availability using our new features for making use of the availability domains in the cloud. These machines come equipped with 768 GByte of memory and tens of TBytes of NVMe drives.

It is strongly recommended that all tablespaces are created on SSD devices or NVMe devices. The reason is simply that this will bring a thousand times better throughput and a hundred times better latency compared to using hard drives. Access to disk for disk data is always one page at a time (currently 32 kBytes in size), thus access to the disk is always random access. Random access to hard drives is limited to around 200 accesses per second and this is very unlikely to be sufficient for normal RonDB applications. For an SSD it is possible to achieve tens of thousands of random accesses per second and for NVMe devices it is even possible to achieve hundreds of thousands of random accesses per second. Clearly one should avoid using use hard drives for disk columns for production usage.

Hard drives are still an ok fit for REDO logs, UNDO logs, backups and local checkpoint files. The reason is that those are strictly written and read using sequential access. Thus the hard drives can sustain their full disk bandwidth. Hard drives are also less likely to wear out due to constant writing.

For larger systems where write speeds beyond 100 MBytes per second are needed, SSDs are needed to be able to sustain the high write speeds for logs, backups and checkpoints.

Many applications store fairly large chunks of data in BLOB's and in variable sized character strings. There could be large amounts of statistical data in integers stored as information in the database. Often these columns are not indexed, these columns represent natural use cases for disk columns in RonDB.

One remaining limitation of disk columns is that it isn't possible to extend with new disk columns as an online alter table statement. It is also not possible to move a column from a disk column to a main memory column in an online alter table statement.

Row format for disk columns#

The main memory row has a reference to the disk data part of the row. This reference is kept consistent by the recovery algorithms such that the row is connected also after recovery. The disk part also has a back reference to the main memory row. The reason for this is to ensure that we can perform scans in disk order. Thus we read the pages on disk and find the main memory part rather than the opposite (improves sequential reading of the disk).

The disk data part of the row is currently always a fixed size row. All fields, also the variable sized fields are stored as fixed size columns with the column size as the maximum possible size of the column.

Tablespaces#

The actual data in a disk column is stored in a tablespace. Each table that have disk columns is using one tablespace. There can be several tablespaces in a cluster. Each tablespace will exist with the same size on all data nodes in the cluster. The relative file name of the tablespace file(s) will be the same on all machines.

Tables allocate extents from the tablespace. The size of an extent is decided when the tablespace is created. By default the extent size is 1 MByte. Once an extent have been allocated to a table it stays in this table until the table is dropped and a local checkpoint after the drop table have been completed.

All disk data rows are stored in pages of 32kByte in size. The pages move back and forth between disk and main memory as pages. The pages are maintained by the Page Manager and uses a page cache to store the pages while in memory. The size of this page cache is given in the RonDB configuration option DiskPageBufferMemory.

Each tablespace contains a few pages describing each extent, to support fast decisions on where to place new rows, these pages are always in the page cache. The extent pages have a 4-word header for each extent and in addition they have 4 bits per page in the extent providing some state information of how full the page currently is.

This means that with a 4 TByte tablespace with 1 MByte per extent we will have 4M extents in the tablespace and each extent will use 20 bytes in main memory. Thus the extent pages will use about 80 MByte of the page cache.

It is recommended to have at least a few gigabytes of page cache when using disk columns in a larger scale. When pages are involved in updating transactions they will be temporarily locked into main memory when the commit operation starts and will remain locked until the UNDO log have been written to disk. This is to ensure that we follow the WAL principle of writing to the log before we write the actual disk data pages.

For writes to the data pages we use the NO STEAL approach. This means that we only write committed changes to the data pages. During the transaction the data resides in memory buffers and only at commit time is the changes written into the page cache and even later the data is flushed to disk.

Tablespaces consists of one or more data files. Size of existing data files cannot be changed, but new data files at any size can be added as online operations to a tablespace.

UNDO log#

The disk data uses the WAL principle (Write Ahead Logging). Thus before we write any disk data page we must write an UNDO log record to ensure that we can roll back the pages to the state at the start of a local checkpoint.

The UNDO log files are stored in a special tablespace called Logfile Group. At the moment the implementation only supports one Logfile Group. There can be multiple log files in this Logfile group and new log files can be added as online operations.

Syntax for disk data#

When using disk data we first need to create an UNDO logfile group. Second we need to create at least one tablespace that uses this UNDO logfile group. After that we can create the tables that makes use of this tablespace.

The managed version of RonDB always create a logfile group with the name lg_1 of size 4 GByte using the configuration parameter InitialLogfileGroup and a tablespace called ts_1 using the configuration parameter InitialTablespace of size 8 GByte and using extent size 16 MByte. The UNDO log buffer size is handled automatically also in the managed version of RonDB.

The size of the UNDO log and the tablespace is too small for any larger use cases, if you are using the managed version and are planning to use large amount of disk columns you should add new UNDO log files to the lg_1 logfile group and more tablespace files to the ts_1 tablespace.

When creating the data node VMs one can set the size of the storage large enough to handle the required size of these UNDO log files and tablespace files in addition to the other files required for recovery of in-memory tables.

Allocate 150% of the memory size of the data node VM for in-memory files for recovery. In addition one need to add 64 GByte of space for REDO logs. Thus if you use a VM with 256 GByte of you need 448 GByte plus the space for UNDO logs and tablespaces. Thus if you need another 500 GByte of tablespace a storage size of 1 TByte for the VM should be sufficient. This would include a 64 GByte UNDO log space.

UNDO log syntax#

When creating an UNDO log file group we need to add at least one file to it. We provide the size of it using the INITIAL_SIZE keyword. Each logfile group will also use an UNDO log buffer, the size of this we specify when creating the UNDO log.

CREATE LOGFILE GROUP lg_1 ADD UNDOFILE 'lg_1.dat'
       INITIAL_SIZE 4G UNDO_BUFFER_SIZE 32M engine=ndb;

This creates a logfile group lg_1 that is stored in the file lg_1.dat (this file is stored in the DataDir of the data node (more control can be applied using FileSystemPath or FileSystemPathDD. The size is 4 GBytes and the UNDO buffer is 32 MBytes in size.

Given that only one UNDO log file group is allowed one might as well create this already at initial start of the cluster. This happens using the RonDB configuration variable InitialLogfileGroup.

Here is an example of how to set this variable:

InitialLogfileGroup=undo_buffer_size=32M;lg_1.dat:4G;lg2.dat:4G

This creates an initial logfile group called DEFAULT-LG with two files lg_1.dat and lg2.dat both 4GByte in size. The UNDO buffer is set to be 32 MBytes in size. When creating an initial tablespace it is possible to specify the logfile group name. But since InitialTablespace cannot refer to any other than DEFAULT-LG this is not a good idea.

At any time we can add more log files to an existing logfile group. This is done by using an ALTER LOGFILE GROUP command.

Here is an example:

mysql> ALTER LOGFILE GROUP lg_1
mysql> ADD UNDOFILE 'lg2.dat'
mysql> INITIAL_SIZE 4G
mysql> engine=ndb;

The setting of InitialLogfileGroup is equivalent to these two example queries CREATE LOGFILE GROUP and ALTER LOGFILE GROUP. The last example creates a new log file that is 4GBytes in size. It uses the same UNDO buffer, there is only one UNDO buffer per logfile group, it isn't possible to change that here.

Another good reason to use InitialLogfileGroup is that it means that the memory for the UNDO log buffer is not taken from the memory specified in the RonDB configuration option SharedGlobalMemoryPool.

Tablespace syntax#

Creating a tablespace adds the first datafile to the tablespace. When creating the tablespace one needs to refer to the logfile group to be used with the tablespace. It is necessary to set the extent size (defaults to 1MByte). Extents are allocated to a table from the tablespace, an extent is the allocation unit when allocating disk space to a table.

mysql> CREATE TABLESPACE ts_1 ADD DATAFILE 'ts_1.dat'
mysql> INITIAL_SIZE 16G
mysql> EXTENT_SIZE 16M
mysql> USE LOGFILE GROUP lg_1
mysql> engine=ndb;

The above statement creates a tablespace named ts_1 with a first data file ts_1.dat, the data file is 16 GByte in size and uses an extent size of 16 MByte and uses logfile group lg_1.

mysql> ALTER TABLESPACE ts_1
mysql> ADD DATAFILE 'ts2.dat'
mysql> INITIAL_SIZE 16G
mysql> engine=ndb;

The above statement adds another datafile to the tablespace ts_1 with the file called ts2.dat, the file will have an initial size of 16 GByte.

In the same fashion as with logfile groups it is possible to create an initial tablespace for the cluster using the RonDB configuration file. In this case the configuration variable is called InitialTablespace.

Here is an example of its use:

InitialTablespace=name=ts_1;extent_size=16M;ts_1.dat:16G;ts2.dat:16G

This creates the same tablespace as the CREATE TABLESPACE command and ALTER TABLESPACE command did together. The name is optional, if not provided the name will be DEFAULT-TS. It will always use the default log file group DEFAULT-LG.

Controlling which columns are stored on disk#

By default all columns are stored as in memory columns. One can set the storage type of a column by specifying either STORAGE MEMORY or STORAGE DISK. This can be set on the table, if it is set on the table it is applied on each non-indexed column.

Here is an example:

mysql> CREATE TABLE t1 (a int, b int storage memory)
mysql> engine=ndb
mysql> storage disk
mysql> tablespace ts_1;

This creates a table with a hidden primary key stored in memory with 2 attributes, the first a stored on disk and the second b stored in memory.

Page Manager#

The Page Manager maintains the page cache. It uses a modern page cache algorithm. We will attempt to provide a basic understanding of this algorithm.

Each page in the page cache has a page cache record in the Page Manager. Each such page cache record goes through a number of states before it has a page connected to it.

At first a main memory page must be bound to the page cache record. Once this have been done the page must be mapped to a disk page. For new pages this is trivial since the page content on disk can be overwritten for new pages. For existing pages the page is mapped to the disk page by reading the page from disk (page in).

In order to find a clean page to use for binding to a page cache record it might be necessary to first clean the page. Cleaning the page means that it might need to be written to disk before the page can be used for another disk page (page out).

A basic part of the algorithm is that we have 10x as many page cache records as we have pages in the page cache. We maintain those pages in a linked list ensuring that we know the hotness of a lot of pages that are not present in the page cache.

As a page is used by the application we increment the usage counter of the page. This counter is maintained for pages that have been paged out as long as the page cache record haven't been removed as well.

Thus we can see patterns of reuse that are longer than our page cache size can handle.

Considerations for disk data tables#

Initialisation of files#

Files can be added to both logfile groups and tablespaces as online operations. The addition of a new file will take some time. The file is created as part of the command to add the file. It will be created with its initial size. To ensure that the file system have allocated space for the entire size of the file, we will write the entire log file or tablespace file as part of adding it. Even though the operation to add a new file is an online operation, care is needed to ensure that the disk bandwidth used to initialise the file doesn't harm the ongoing transactions.

Placement of disk data files#

It is possible to specify the full path of a file when adding a file to a logfile group or tablespace. It is not necessary advisable though since all machines might not look exactly the same.

If only a file name is provided, and the path is derived from the configuration parameter in the following order.

DataDir is normally a variable that should be set in the RonDB configuration. If not, it will use the working directory of the RonDB data node process. The default is to place all the files used by the data node process under this directory.

By specifying the RonDB configuration option FileSystemPath all the disk data files will be moved to this directory. However if FileSystemPathDD is set this directory will be used instead.

The algorithm to choose directory for a data file in a tablespace is as follows. If configuration option FileSystemPathDataFiles is set use this directory, else if configuration option FileSystemPathDD is set use this directory, else if configuration option FileSystemPath is set use this directory, else if configuration option DataDir is set use this directory, else use the working data directory of the data node process.

The algorithm for choosing directory for UNDO log files is the same except for the first step where FileSystemPathDataFiles is replaced by FileSystemPathUndoFiles.

DiskIOThreadPool#

One more important thing to consider for disk data is the number of OS threads to use to read and write pages in the UNDO log files and tablespace files. This number is set through the RonDB configuration variable DiskIOThreadPool. By default it is set to 2. It is quite likely that this parameter should be increased, in particular when using SSD's and NVMe's to store the tablespaces and logfile groups.

DiskPageBufferMemory and DiskPageBufferEntries#

Two important parameters are DiskPageBufferMemory which is simply the amount of memory dedicated to the page cache. DiskPageBufferEntries is the multiplication factor for how much memory we will have of old page cache entries that have been evicted from the cache. This is set to 10 by default. Thus we have 10 page cache entries for each actual page in the page cache. Since each such record consumes roughly 84 bytes (around that) it means that we use about 2.5% of the page cache memory to keep a record of old page cache entries. We could potentially increase this slightly or decrease it slightly but most likely this value is a good trade off. The more page cache we have, the better its hit rate will be.

Local checkpoints for disk data#

Local checkpoints ensure that all dirty disk data pages are written to disk. This is important to ensure that we can cut the REDO log and the UNDO log for disk data pages. The algorithms used to handle disk data is described in a fairly detailed manner in a large comment in pgman.cpp and some comment in pgman.hpp. In Backup.cpp a very detailed description of the local checkpoint algorithm as such is provided. When a fragment replica is checkpointed, its main memory data and its disk data is checkpointed together, to ensure that the references between them stays correct after a restart. The most detailed comments are always in the latest version of RonDB.

To ensure that we don't oversubscribe too many writes due to local checkpoints we will never have more than 1 MByte of outstanding page writes due to local checkpoints (LCPs) per ldm thread.

Foreign keys in RonDB disk data tables#

It is fully possible to use foreign keys on tables with disk data attributes. It is important to consider the implications of cascading actions for these tables. If a cascading action tries to change too many rows we could easily hit the limit for transaction sizes for those tables. It is important to take this into consideration when designing your schema. Foreign key cascading actions can easily lead a simple transaction to update all rows in a table and this can cause havoc for transaction latency and it can easily lead to overload problems.

This problem is present for main-memory tables, for disk data tables it will be even worse since all deleted rows will cause disk pages to become dirty and in the need of disk writes in the next local checkpoint.

Table options#

A default table using the NDB storage engine will be created such that we get balanced reads when we always read the primary replica. We create one partition per ldm thread per each node in the cluster, all columns are stored in main memory and we use REDO logging on the table.

There are options available when creating the table that makes it possible to change these defaults.

Read Backup feature#

The transaction protocol in RonDB is designed such that we first commit the update and as part of this we release the lock on the primary replicas but we still retain the locks on the backup replicas, next we send the commit acknowledge to the application. In parallel with that we complete the transaction by removing the memory allocated for the transaction and release the locks on the backup replica.

If we read the backup replica immediately after performing a read it is theoretically possible to not see our own updates to a row if we read using Read Committed (this reads the latest committed row, a locked row is still not committed). Thus default tables always use the primary replica to read in Read Committed mode to avoid this anomaly.

When deciding how to make it possible to read the backup replicas we opted for making this a table option. For tables that use this option we will delay sending the commit acknowledge to the application until we've completed the transaction and released the locks on the backup replicas. Thus tables using this table option will always be able to read the backup replicas when reading in Read Committed mode.

An alternative would have been to always be able to read from the nearest replica and allow for random reads that will not see your own updates. This will have the best throughput for both read and write. But we decided to value the feature to see your own updates higher than the absolutely shortest response time.

An example application is an application using SQL that have two data nodes, two MySQL Servers and each MySQL Server is colocated with one data node. In this case all SELECT queries can be completely localised to one computer and we have no network bandwidth worries for the application, only updates traverse the network to the other node.

In this example application it is very clear that if the application is doing heavy read activity it is a good idea to use this Read Backup feature. For tables that do a lot of writes the choice is not as obvious since the Read Backup feature delays response to the writes. It doesn't affect the throughput, but it does affect the latency of write transactions.

In RonDB the default is that tables use the Read Backup feature.

Read Backup syntax#

When creating the table the following should be placed in the comment section of the CREATE TABLE statement.

COMMENT="NDB_TABLE=READ_BACKUP=1"

It is also possible to set this as a MySQL configuration option. By settting the configuration option ndb-read-backup to 1 all tables created in the MySQL Server will be using this property (it will not affect tables that are changed using ALTER TABLE).

It is also possible to set this property in an ALTER TABLE statement. If only this feature is changed in the ALTER TABLE statement the change will be an online alter table statement.

When used in an ALTER TABLE statement only properties actually changed in the COMMENT section will be changed. E.g. if the READ_BACKUP feature was previously set and we now change some other property in the comment section, it will not affect the value of the READ_BACKUP property after the ALTER TABLE statement.

Fully replicated tables#

A similar feature to Read Backup is fully replicated tables. A fully replicated table means that the table will have one replica in each node of the cluster instead of one replica per node in a node group. If we have an 8-node cluster with 4 node groups with 2 nodes in each, we would normally have splitted the table into at least 4 parts (possibly more parts). However for a fully replicated table each node group will contain the full table and all data will be fully replicated in all 8 nodes of the cluster.

The table is implemented as a normal table where updates use an internal triggering that always first updates the first node group and then an internal trigger ensures that all the other node groups also receive the update. The changes are always transactional, thus we always guarantee that we write all available replicas.

This is very nice for clusters that want to scale reads. We can have up to 48 data nodes that are colocated with MySQL Servers using a shared memory transporter thus scaling reads to many millions of SQL queries towards the shared data. We can still handle hundreds of thousands of updates on this data per second.

Another type of usage for fully replicated tables is so called dimensional tables, small fact tables that are often used in SELECT statements but rarely updated and used together with large tables containing references to these smaller tables.

Fully replicated tables gives a lot more options for how to design scalable applications using RonDB. In particular we expect it to enhance performance when executing complex join operations that makes use of the pushdown join feature presented in a later chapter.

Fully replicated table syntax#

When creating the table one uses the comment section to set the fully replicated table property as shown here in the CREATE TABLE or ALTER TABLE statement.

COMMENT="NDB_TABLE=FULLY_REPLICATED=1"

It is possible to set it as a configuration option in the MySQL Server with the ndb-fully-replicated variable. This will ensure that all tables in the MySQL Server use the fully replicated table property when creating new tables while this configuration option is set. This option will have no effect on ALTER TABLE statements.

Setting a table to be fully replicated also means that it will have the read backup property set. Reads on a fully replicated table can use any node in the cluster for reading.

When setting this feature in a table using ALTER TABLE it will use a copying alter table statement.

It is possible to reorganise the fully replicated table to handle new added node groups as an online alter table using the command ALTER TABLE algorithm=inplace, REORGANIZE. No downtime is needed to scale the cluster to use new node groups for fully replicated tables (similar with normal tables).

Partitioning of tables#

RonDB is a distributed DBMS, thus all tables are always partitioned even if not specifically declared to be so. The default partitioning key is the primary key. If no primary key exists a unique key is used as primary key. If neither a primary key nor a unique key is defined on the table a primary key is constructed with an extra hidden column added, this column has a unique value that is a sequence number.

The MySQL Server supports specifying partitioning. With RonDB we only support one variant of partitioning and this is PARTITION BY KEY. It is possible to specify PARTITION BY KEY() in which case the primary key is used. Otherwise a list of columns is defined as the partitioning key. These columns must be a subset of the columns used for the primary key.

It is often a good idea to be careful in selecting the partitioning key. Often it is good to have the same partitioning key in a set of tables. A good example of this is the TPC-C benchmark. All of the tables in TPC-C (except for the item table that describes the products) has a warehouse id as part of the table. The customers are linked to a warehouse, a warehouse is connected to a district, orders and order lines comes from warehouses and stock is also per warehouse.

In the TPC-C benchmark it is natural to use the warehouse id as the partitioning key. In todays world of big data it is very common to use a sharding key to split up the application. To find the proper partitioning key for a set of tables is the same problem as when selecting a sharding key for a set of databases that works together in a shard.

In HopsFS that implements a meta data server for a distributed file system on top of RonDB the partitioning key for the inodes table is id of the parent inode. Thus all inodes of a certain directory are all in the same partition.

RonDB has a number of similarities with a sharded set of databases. The main difference is that RonDB is much more tightly integrated and supports queries and transactions that spans many shards.

By default the number of partitions is set to ensure that the primary replicas of each fragment is balanced on all ldm threads in all nodes of the cluster. In a cluster with 2 replicas, 2 node groups and 4 ldm threads this means that the default table will have 16 partitions in those 16 ldm threads. Thus since each partition has two replicas it means that each ldm thread will handle one primary fragment replica and one backup fragment replica.

If we add another node group this means that we will add another 8 partitions. To perform this change one uses the command below on each table.

mysql> ALTER TABLE tab_name algorithm=inplace, REORGANIZE PARTITION;

It is possible to create a table with a specific number of partitions. In this case there is no guarantee of any balance at all among the ldm threads. In this case it is possible to add a specific number of partitions as an online operation.

The variant that is recommended is to use the keyword PARTITION_BALANCE to set a balanced number of partitions. We will describe these in a section below after some introductory remarks on how we partition.

HashMap Partitioning#

One important feature of RonDB is the ability to add new nodes to the cluster with the cluster up and running. In order to take advantage of these new nodes we must be able to perform an online reorganisation of the tables in the cluster. What this means is that we need to add more partitions to the cluster while the cluster is fully operational and can handle both reads and writes.

In order to ensure that such an online reorganisation can execute without requiring extra memory in the already existing nodes a scheme was invented called hashmap partitioning.

The partitioning happens by calculating a hash value based on the partitioning key values. Normally this value would be using linear hashing to decide which partition a row is stored in. However the linear hashing algorithm is not sufficiently balanced for our requirements. It will grow nicely, but it will not grow in steps that are adequate for our needs.

Another layer was invented called a hashmap. Each row is mapped into a hashmap, by default there are e.g. 3840 hashmaps in a table. If we have 4 partitions each partition will have 960 hashmaps. If we decide to add 4 more partitions there will be 8 partitions and in that case we will instead have 480 hashmaps per partition. Each partition will need to move 480 hashmaps to another partition (thus half of the data).

The hashmaps are balanced as long as the number of partitions is evenly divisible with the number of hashmaps. The number 3840 is equal to 2*2*2*2*2*2*2*2*3*5. For a large number of selections of number of node groups and number of ldm threads in RonDB it will be divisible with 3840 although not all. If they are not divisible it means a small imbalance (and thus loss of memory in some nodes that are less occupied than others). The impact is small even when there is an imbalance.

User Defined Partitioning#

The MySQL Server makes it possible to define Hash partitioning, List partitioning, Range partitioning and Range column partitioning. These are not supported by RonDB but they can still be made to work in a partial manner.

To use those types of partitioning one need to start the MySQL Server with the --new parameter.

When using these types of partitioning the data is only accessible from the MySQL Server. The reason is that the partitioning function is currently not possible to push down to the data nodes in RonDB. This means that the normal backup isn't sufficient to backup the tables. It is necessary to backup the frm-files in addition, where the partitioning ranges and functions are saved.

It isn't possible to change the partitioning, this means that add partition, drop partition and reorganize partition doesn't work with those tables other than as offline operations.

Since RonDB is very much focused on online operation we have decided to not support this type of partitioning although it still works.

Considerations for selecting number of partitions#

Normal tables are distributed with equal burden on all ldm threads in the cluster. To achieve this each ldm thread has one primary replica of the table and also a backup replica. Thus with a 4-node cluster with 4 ldm threads per node there will be 16 partitions in the table and thus one primary replica and one backup replica per ldm thread.

For large tables and for tables that are heavily updated this makes sense. If we have many small tables that are rarely used it would make sense to have fewer number of partitions for these tables. In this case mostly to minimise the amount of files to open and close at checkpoints and the amount of schema memory to use.

It also makes a lot of sense to decrease the number of partitions in a table where range scans are often performed on all partitions. Assume we run on a large cluster with 8 data nodes where each data node contains 16 LDM threads. In this case a default table distribution maps the table into 128 partitions. For primary key reads and writes this have no negative impact, for partition pruned index scans neither. If you have written a highly scalable application this have very little consequence.

So in other words if you have written an application that makes use of the sharding key (== partition key) in all queries, the application is scalable even with very many partitions.

For an application that uses a lot of queries that don't define the sharding key on index scans and the scan is looking for small amount of data the overhead will be high since there is a startup cost of each ordered index scan. In this example one needs to execute 256 index scans, one on each partition.

In this case if the table is small and the table isn't a major portion of the CPU load, it could make sense to decrease the amount of partitions in the table.

By setting the partition balance we can adjust the number of partitions in a table.

Selecting a high number of partitions means that we spread the load of accessing the table evenly among the ldm thread. Thus we minimise the risk of any CPU bottlenecks.

Selecting a high number of partitions increases the overhead of each index scan that is not pruned to one partition and similarly for full table scans. The choice of partitioning scheme is a balance act between balanced CPU usage and minimum overhead per scan.

Selecting the default mechanism also provides a possibility to use extra CPU threads to rebuild ordered index as part of a restart. The default mechanism is the natural selection for a highly scalable application. But for less scalable applications and special tables it might make sense to change the partitioning scheme to a scheme that have a smaller overhead for less optimised queries.

Partition Balance#

Using partition balance we can decrease the previous example to use 4 partitions instead of 256 partitions. Each partition is located in one LDM thread which can be executed by one CPU. Thus it is important to ensure that for tables with a lot of usage one should be careful to not decrease the number of partitions too far.

Even before version 7.5 it was possible to set explicit number of partitions on a table. In 7.5 we added a number of special table options to change the number of partitions in the table. These options are designed to ensure that we still get balanced load among the ldm threads. Setting specific number of partitions is still possible but this gives less balance between the nodes and the ldm threads.

We have two dimensions of balance. We can balance for usage with Read Primary replica, thus we want primary replicas to be balanced among the nodes and ldm threads. We can balance for usage with Read Any replica and in this case we balance the replicas among the node groups and ldm threads.

The second dimension is whether to balance one all ldm threads or whether to only balance between the nodes/node groups.

One primary partition per each LDM in cluster, FOR_RP_BY_LDM#

This is the default mechanism, it stores one primary fragment replica in each ldm thread in each node. Thus the total number of partitions is the number of nodes multiplied by number of ldm threads.

Balancing for read primary replica on each ldm thread is the default behaviour. Thus we will get one partition for each ldm thread in the cluster. Thus if we have 4 nodes and 4 ldm threads per node we will have 4 * 4 = 16 partitions in the table.

This will always bring perfect balance among the ldm threads and this is why this is the default option.

This option have the most partitions among the standard partitioning options. This is desirable for a default option since it makes it easier to parallelise restarts execution.

This option is called FOR_RP_BY_LDM.

One primary partition per LDM per node group, FOR_RA_BY_LDM#

This is the natural mechanism to use for read backup tables and fully replicated tables, it stores one primary fragment replica in each ldm thread in each node group. Thus the total number of partitions is the number of node groups multiplied by number of ldm threads.

If we mainly use the read backup feature and mainly use Read Committed and not SELECT ... LOCK IN SHARED MODE so much, it is sufficient to balance the load by ensuring that each ldm thread have one fragment replica per table. Since all fragment replicas are almost used as much there is no special need to have multiple fragment replicas per ldm thread.

Thus in the case of 4 nodes with 4 ldm threads we will have 16 fragment replicas and with 2 replicas this means we will have 8 partitions per table.

This partitioning balance is interesting for tables that have a fair amount of scans on all partitions that still need it to be balanced on all ldm threads.

This option is called FOR_RA_BY_LDM.

One primary partition per node, FOR_RP_BY_NODE#

This mechanism stores one primary fragment replica in one ldm thread in each node. Thus the total number of partitions is the number of nodes.

In case we decide to not balance a table among ldm threads we still want it balanced among all the nodes. Not balancing among nodes would mean that we would use different amount of CPU resources in different nodes and this would create imbalances that would be hard to predict for a user.

Under the hood we will try to balance those tables among the ldm threads although this will not be perfect in any sense.

We can still decide whether to balance for read primary replicas or for read of any replica. This one balances for read of primary replica. For a normal cluster with 2 replicas this means that each node have 2 fragment replicas in two different ldm threads. This sometimes strike a nice balance between CPU resources available for the table and minimising the number of partitions in the table.

Thus in the case of 4 nodes with 4 ldm threads we will have 4 partitions per table.

This option is called FOR_RP_BY_NODE.

One primary partition per node group, FOR_RA_BY_NODE#

This is the minimal mechanism, it stores one primary fragment replica in one ldm thread per node group. Thus the total number of partitions is the number of node groups.

This partitioning option is the one that gives the least amount of partitions. Thus it is very useful for tables that have mostly index scans on all partitions and for small tables rarely used but still desiring balance among the nodes in the cluster.

It is useful to decrease overhead of many small tables in the cluster while still maintaining a balance of memory usage on all nodes in the cluster.

Thus in the case of 4 nodes with 4 ldm threads and 2 replicas we will have 2 partitions per table.

This option is called FOR_RA_BY_NODE.

Syntax for Partition Balance#

This table option is only settable in the COMMENT section when you create a table and when you alter a table.

COMMENT="NDB_TABLE=PARTITION_BALANCE=FOR_RA_BY_NODE"

It is possible to set several properties in the comment section at the same time like this:

COMMENT="NDB_TABLE=READ_BACKUP=1, PARTITION_BALANCE=FOR_RA_BY_NODE"

Going from a partition balance to another can be done as an online alter table operation if the number of partitions increase. It will be a copying alter table statement if the number of partitions decrease.

The partitioning balance can be used for fully replicated tables. It is not possible to change the PARTITION_BALANCE as an online alter table option for fully replicated tables. It is possible to add new node groups to the fully replicated table as an online alter table statement.

Setting explicit number of partitions#

As mentioned previously it is still ok to explicitly set the number of partitions. This provides no guarantees on balance, but we will still attempt to balance the load among nodes and ldm threads as much as possible.

Once a table have set a specific number of partitions it cannot be changed to any other partitioning option as an online alter table statement. It cannot be reorganized using the REORGANIZE keyword. It can increase the number of explicit partitions as an online alter table statement.

Thus the setting of explicit number of partitions gives the user complete control over the number of partitions per table if this control is desirable. It does not provide control of the placement of these partitions.

The PARTITION_BALANCE options makes it possible to control the number of partitions even when adding new node groups and still maintaining a good balance between the nodes and the ldm threads.

Setting explicit number of partitions is done using the normal partitioning syntax. Adding e.g PARTITIONS 4 after specifying the PARTITION BY KEY and the list of fields for key partitioning gives explicit number of partitions.

No REDO logging#

Normally all tables are fully recoverable. There are applications where data changes so rapidly that it doesn't make sense to recover the data. An example is a stock exchange application where data is changed hundreds of thousands of times per second and the amount of data is very small. Thus in the case of a cluster crash and returning a few minutes later the data is no longer current.

In this case we can optimise the table by removing writes to the REDO log and removing writes to the local checkpoints for the table. This removes around 30% of the overhead in performing updates on the table.

NOLOGGING syntax#

To enable this feature can be done as a table option. This table option can not be changed as an online alter table statement. It uses the COMMENT section in the same fashion as the read backup, partition balance and fully replicated features.

COMMENT="NDB_TABLE=NOLOGGING=1"

By setting the MySQL configuration option ndb-table-no-logging to one we ensure that all tables created in this connection will use the NOLOGGING feature.

Schema Considerations#

We've already mentioned that all primary keys and unique keys by default define an extra ordered index on top of the distributed hash index that is always part of a primary key and unique key. Adding USING HASH to an index definition is a method to avoid the ordered index if it isn't needed.

We've discussed to use disk columns for columns that you don't expect to add any indexes to. This is useful if you have a fast disk, such as SSDs or NVMe's to store your tablespaces in. Using hard drives is technically ok, but the difference between the access time to a hard drive and the access time to memory is so great that it is very unlikely that the user experience will be any good.

Nowadays there are disks that can handle many, many thousands of IO operations per seconds (IOPS), while at the same time the access time is measured in tens of microseconds. These are useful even when compared to using memory.

We've discussed the possibility to use fully replicated tables in some cases to get faster access for read and various partitioning variants.

One more consideration that we haven't mentioned is to consider the recovery algorithms.

In RonDB all deletes, updates and inserts are written to a REDO log. This REDO log is a logical change log, thus the amount of log information is proportional to the actual change made. Only the changed columns are logged in the REDO log.

In addition we execute local checkpoints, these will write all the rows at checkpoint time. Rows that are updated very often will only be written once per checkpoint. The checkpoint always writes the full row.

Similarly for disk columns we write an UNDO log where the size of what we write is dependent on the total column size. The consequence of this is that the amount of data we write to the checkpoints is dependent on the size of the rows.

This means that if we have information that is very often updated there can be reasons to put this data in a separate table to avoid having to checkpoint columns that are mostly read-only. Naturally there can be many reasons for those columns to be in the same table as well and for most applications this particular issue is not a concern at all. As with any optimisation one should measure and ensure that it is worthwhile using it before applying it.

MySQL Concepts in NDB storage engine#

There are quite a number of different concepts that exists in MySQL. These concepts are developed such that they work for any storage engine. There is one thing that differs with NDB compared to a traditional MySQL storage engine. This is the fact that the NDB storage engine is connected to a cluster. In this cluster there can be many more MySQL Servers, creating something in one MySQL Server doesn't automatically create it in all MySQL Servers connected to the cluster. In this chapter we will discuss many of those concepts and how they work in RonDB.

MySQL Queries#

Queries are the normal SELECT, INSERT, UPDATE, DELETE queries and a few more special query variants. These queries are used to read and write the data. This is an area where the syntax used to query tables in e.g. InnoDB and in NDB is the same. The performance characteristics will differ and as mentioned the consistency model differs, but a query executed will be executable towards NDB and other storage engines. A query is executed in one MySQL Server and as soon as it has executed it is gone, there is no relation to other MySQL Servers for the query.

MySQL Databases#

MySQL Databases are created from the MySQL Server. A database is automatically created in the cluster when it is created in one MySQL Server. Under the hood there is no special handling of different databases in RonDB. The database is present in the internal table name of tables in RonDB. This is the only handling of databases in RonDB.

MySQL Triggers#

Triggers comes into play with writes to the database. A trigger is fired and completely handled within one MySQL Server. This is the case for RonDB.

RonDB does currently not propagate the trigger definition to all other MySQL Servers. If one wants a trigger to be executed on the table in the cluster it is necessary to add the trigger on all MySQL Servers in the cluster.

MySQL triggers is something to be a bit careful with. The reason is that it is only executed in MySQL Servers. Given that tables in RonDB can also be accessed using the NDB API, ClusterJ and other direct APIs, it is important to consider this when using triggers.

If triggers are used to maintain some sort of constraint, this constraint must be maintained also by direct RonDB applications that are not using the MySQL Servers.

MySQL Views#

Views are a method to make it appear as if a table exists that is only a view of the real data. Views could be used for security reasons or to make it simpler to access data.

Views works perfectly fine with RonDB but it is important to remember that if a view is used in all MySQL Servers in the cluster, it has to be defined in all MySQL Servers in the cluster.

If a round robin router is used in front of the MySQL servers and the queries use a view it places special requirements that one has ensured that all those views are created in all MySQL Servers before they are used.

There are no problems related to views for the cluster, the views will not be seen from the NDB API, the NDB API will only see the real tables. A view is used when executing a query to translate the query into a more complex query. Thus the view is only related to the execution of a specific query. Thus views works fine with NDB as for any other storage engine. The only thing to remember is that a view is local to one MySQL Server and is not automatically replicated to other MySQL Servers in the same cluster.

MySQL Events#

Events are a method to have special actions happen at predefined times. This again works perfectly fine with RonDB.

In this case it is not necessarily a good idea to install the event on all MySQL Servers. Events are often some actions that need to be taken at certain intervals. The action is quite likely an action for the cluster and not for one specific MySQL Server.

Since events often need to execute once per time period for a cluster it seems more likely that it is necessary to install the event in a few MySQL Servers. At the start of the event processing one could check some special event table that ensures that the event is only executed once per time period. In this manner the MySQL Servers will discover that someone else already has executed the event.

It is fairly straightforward to handle MySQL events in RonDB and it is fairly easy to even make events such that they are highly available.

MySQL Stored Procedures#

Stored procedures is defined per MySQL Server. If it is needed in all MySQL Servers, it has to be created in all MySQL Servers.

Defining a stored procedure is local to the MySQL Server it is executed on. There is nothing special with stored procedures for RonDB, they are executed as any other stored procedure in MySQL. A stored procedure is a language to execute several SQL queries with one query call. The interpreter for this SQL language executes within one MySQL Server.

The impact on RonDB by stored procedures is the same as if there was an application program executing multiple SQL queries towards one MySQL Server.

MySQL Functions#

MySQL functions is very similar to stored procedures. There is nothing special about using MySQL functions for RonDB compared to a MySQL Server using InnoDB.

MySQL User Management#

User management in RonDB can be handled as a cluster-wide property. It is also possible to handle it separately for each MySQL Server in the cluster.

To prepare the cluster for distributed user management, the user table have to be converted to an RonDB table. We won't go through the details of how to do this in this book, check the MySQL manual if you want to use distributed user management.

MySQL Transactions#

Executing transactions in MySQL translates into transactions in RonDB. Any transaction executed in RonDB involving only RonDB tables will be executing a distributed transaction in the RonDB data nodes. There is no support for transactions with multiple storage engines in one transaction.

Savepoints#

In MySQL we can declare savepoints in a transaction that we can roll back to. This is not supported for transactions involving RonDB tables, a rollback will always roll back the transaction entirely.

MySQL Prepared Statements#

Prepared statements are a way to avoid parsing and optimizing a query each time it is executed. This means that the MySQL Server will store a part of the query plan for the execution of the query and use this the next time the user wants to execute this query. This works fine for RonDB tables, there is nothing special about RonDB here. Each MySQL Server prepares queries independent of all other MySQL Servers in the cluster.

MySQL Explain statements#

MySQL has a special statement called EXPLAIN. This statement is used by the database developer to understand how a specific query is executed. There are some variants that are specific to RonDB using RonDB tables. One thing is that we can pushdown a join to the cluster. This means that a large part of the join execution is handled in the RonDB data nodes.

Explains list one line per table in a join operation and lists how this table is accessed and how it is joined to the other tables. A set of tables can be pushed down as a whole to the NDB storage engine such that they are executed together. If this pushes down evaluation of some conditions the speedup can be quite significant.

There are other things such as condition pushdown, execution of batched key access and many other techniques that will have a major impact on query execution time and that can be viewed in the EXPLAIN statement. There are many more details on this in the MySQL manual that explains in more detail how the EXPLAIN command works in RonDB.

Table locks#

At the moment the lock tables and unlock tables are only handled locally in one MySQL Server. We currently do not support lock tables that lock tables over the entire cluster. It only locks it within one MySQL Server. The only safe way of excluding access to a table is to use the single user mode described in the chapter on RonDB management client.

XA support#

RonDB doesn't support running transactions in XA level.

Altering a Table#

In this chapter we will cover how we can change the meta data in RonDB. In most cases it can be done as an online operation.

Any time spent altering meta data and not allowing writes to the tables in parallel means downtime. Therefore much resources have been spent to ensure that most normal meta data changes can be done online.

One problem is that the storage engine API doesn't allow for a true online operation in the MySQL Server that performs the meta data change. Thus as will be discussed in a later chapter it is necessary to perform meta data changes in a specialised MySQL Server that is only used for meta data changes (at least at the time the meta data changes are performed).

Online meta data operations#

Online Create Table#

Creating a new table and dropping a table is obvious online operations. All meta data operations are performed as distributed schema transactions. For example when creating a table we create a base table, zero or more ordered indexes, zero or more unique index tables, zero or more BLOB tables, zero or more foreign keys and internal triggers to keep indexes up to date. All of those are created atomically.

This requires a stepwise approach where each step in the creation process is recorded on disk to ensure that we can roll forward or roll it back in recovery situations. Due to this stepwise approach the create table statements takes a bit of time. A few tables per second can be created in RonDB. Speed of meta data operations isn't the focus in RonDB, the availability for reads and writes during those operations is rather the focus.

If RonDB Replication is activated it will be an integral part of the create table statement to ensure that the create table is replicated to the slave clusters. This increases the time to execute the create table statement. The create table statement have to be synchronised with all MySQL Servers that are used for binary logging and we have to synchronise writes of data and meta data changes.

From this we deduce that meta data operations are online, but they are not necessarily fast. The focus of meta data operations is that they are properly synchronised with all cluster nodes and synchronised with slave clusters. RonDB is focused on applications where meta data changes are done in relation to software upgrades that happen, but are rare compared to the number of user data transactions and queries.

Online Drop Table#

Drop table is an online operation. One complication for drop table is when scans on the table is still ongoing at the time when the drop table is issued, we can also have a number of internal recovery operations that could be ongoing for the table. In this case we might have to wait for a short time until the drop table is completed. As part of drop table we remove all files that are specific to the table and extents of any tablespaces are free'd up as part of drop table. Similarly the memory connected to the table is released.

Online Create Ordered Index#

Adding a new ordered index starts by creating a new index table. This involves adding internal triggers on update, delete and insert operations that ensure that the ordered index is kept up-to-date.

This operation is online such that no writes will be blocked by the creation of this new index. The operation will take some time, the table have to be scanned and for each row the row will be inserted into the index and internal triggers will ensure that the index is kept up-to-date.

During an update we know if the ordered index scan build has reached the row we are updating since each row has a row id and the build scan is scanning in row id order. The index contains a sorted list of row ids. The sort order is based on the index columns and whether the index is ascending or descending.

The creation of a new ordered index can take substantial time for large tables, but it will run without interfering with user transactions other than that it will use new memory resources and will use CPU resources for the index build.

Ordered indexes are always in memory, all ordered indexes are rebuilt as part of all node restarts and cluster restarts. This rebuild is an offline build that happens when the node isn't involved in any transaction.

The ndb_restore program have the ability to create new ordered indexes that are built as an offline operation. Thus the creation of the index is much faster, but it isn't an online operation. A restore of a backup isn't expected to happen in parallel with user transactions.

Online Drop Ordered Index#

Dropping an index only releases the memory and removes the meta data about the index. As when you drop a table we will wait for up to 5 seconds for any scans that are still using the ordered index.

Online Create Unique Index#

Creating a unique index means creating a real table. The table is special in that it cannot be accessed as a table, it can only be used to access the main table.

Internally it is a normal table. This table have a hash index as all other tables in RonDB. It doesn't have any other indexes. It is possible to handle cases where a column amongst the unique index columns value can be NULL.

The table is recoverable if the main table is. There is a trigger from the main table to the unique index table ensuring that any updates on the unique index columns in the main table will be reflected in the unique index table. An update leads to both a delete and an insert. An insert leads to an insert in the unique index table and a delete leads to a delete in the unique index table.

Online Drop Unique Index#

Dropping a unique index is similar to dropping a table, it removes the memory, the files and the internal triggers for the unique index table.

Online Create/Drop Foreign Key#

A foreign key have two components. The first component is an index on the reference columns in the child table. The second component are triggers that verify that the foreign key constraint isn't broken by any updates.

As part of creating a foreign key one needs to scan the data to verify that existing data meets the foreign key constraints.

Online Add Column#

Adding one or more new columns to a table can be an online operation that is merely changing the meta data. For the operation to be online the column must be an in-memory column and it has to use the DYNAMIC format. In addition the new columns must be NULLable columns or they need to have a default value. NULL or the default value is the value of all rows for the new column after adding the column.

The internal row structure have three variants of implementing a column. One variant is fixed size, in this case the column have a fixed size which is the maximum size of the column. The second variant is variable sized columns. In this case there is at least two bytes for the column that provides the pointer to the column within the row and if NULLable there is also a bit in the NULL bit array. Both of these variants can only be added when the table is created or when the table is altered using the offline algorithm (the offline algorithm creates a new table and copies data over to the new table).

The third variant is to place the column in the DYNAMIC part. In this part columns are only present if they have a value different from the NULL or default value. So adding a column in this part can happen by simply changing the meta data and there is immediately a new column in the row. Accessing a column in the DYNAMIC part is more costly, but a lot more flexible.

When defining a column it is possible to specify whether it should be using the FIXED format, the DYNAMIC format or if it should use the DEFAULT format. The default format is the FIXED format.

When adding a new column one can specify whether the column is an in-memory column or if it is a disk column. The default is in-memory columns, the default can be changed on the table level by setting the STORAGE to DISK on table level.

Online Create Tablespace#

Tablespaces are used to store the disk columns. Each table with disk columns need to use a specific tablespace. Tablespaces can be added as part of the configuration using the InitialTablespace configuration variable. In this case the tablespace is added as part of the initial start of the cluster.

New tablespaces can be added as online operations. A tablespace cannot be used until it has been created, a tablespace can only be used in a table if the table is created after the tablespace is created.

In a tablespace we can have one or more data files. The first data file is always added as part of creating the tablespace.

When a tablespace is created one needs to define the extent size. The extent size is the unit of allocation from the tablespace to a table. When an insert to a table needs a free page and there are no more free pages in the extents connected to the table, a new extent have to be allocated from the tablespace. When creating a tablespace one sets the size of the tablespace by setting the initial size of the tablespace. This is the size of the first file added to the tablespace.

Creating a tablespace can be a lengthy process since the data file will have to be initialised. The reason for this is that otherwise the OS will not ensure that the actual disk space needed for the file is allocated and this could cause the tablespace to become full even without using the full size specified.

Online Add Datafile to Tablespace#

It is possible to add new data files to an existing tablespace. The only things to set here are the file name of the new data file and the initial size of the data file (the size cannot be extended after it is added, only new files can be added).

Drop tablespace and data files#

Data files and tablespaces can be dropped as online operations as well. But they can only be dropped if they are completely empty. Extents allocated to a table can only be released by dropping the table. The allocation from a table is not directed to any specific data file. It is only practical to drop an entire tablespace, individual data files can only be dropped immediately after they were added before they had been used the first time.

Online Create Logfile group#

Disk columns need an UNDO log in addition to the tablespace. The UNDO logs are managed by the logfile group that each tablespace need. When creating a tablespace one need to define a logfile group this tablespace is using. Currently we can only have one logfile group.

A logfile group can be created from the configuration by using the configuration parameter InitialLogfileGroup. It can also be created using an SQL command. Either way we need a first log file, we need to set the size of the UNDO log buffer size connected to this logfile group and we need to set the initial size of the log file.

Log files are initialised when created, the command is online, but will take some time.

Online Add Logfile to LogfileGroup#

New log files can be added at any time to the UNDO log. We need to set the name of the file and the initial size of the file. Also here the file needs to be initialised when added and the command is an online command.

Online Add Node#

Most users of RonDB will start out with a small 2-node cluster with 2 replicas. If the usage of the cluster increases it is important to be able to add new nodes to the cluster.

To add a new node there are two steps needed. First of all one can only add entire node groups currently. If the cluster uses 2 replicas we have to add 2 nodes at a time. After adding a new node group the existing tables need to be reorganised to use the new node group(s).

Add a new Node Group#

The first step in adding a new node group is to add the new nodes to the configuration. These new nodes should be configured with NodeGroup configuration parameter set to 65536. Thus the node is not part of any node group. Thus other nodes will setup the communication paths to these nodes, but the nodes will not yet be used to store data. New tables created will not attempt to place table partitions into those nodes yet.

In principal it is possible to add these new nodes early on, the only problem with this is that the nodes will have to allocate memory for all send buffers to these nodes.

The change of configuration is covered in the chapter on programs and more specifically in the section on the RonDB management server.

Adding a new node group is an online operation and it is a schema transaction that is atomically done. It is executed using the ndb_mgm client. The command to add two new nodes 3 and 4 to an existing cluster with node 1 and 2 in the first existing node group is done like this:

CREATE NODEGROUP 3,4

After this command is completed the new node group is created and used when new tables are created. The creation of a new node group is a fairly quick operation.

The creation of tablespaces and logfile groups are performed during the first start of the new nodes. This is not part of the creation of new node groups. Starting up a new node for the first time can take some time to complete even though no data needs to be transferred to the node during restart.

It is possible to drop a node group, but it can only be done if the node group is empty.

New tables#

Immediately after creating a new node group any new table will use the new node group.

Reorganise existing tables#

After adding one or more new node groups it is recommended to reorganise existing tables as well. This is done one table at a time by issuing the command:

mysql> ALTER TABLE tab_name algorithm=inplace, REORGANIZE PARTITION;

This command will be done online. It will take some time since the entire table have to be changed.

The algorithm used follows this procedure.

  1. A number of new partitions are created in the new node groups

  2. A table scan is started in each previously existing node

  3. For each row we decide if the row should be copied to the new partitions

  4. Wait for scan to complete

  5. Start a new table scan on the previously existing node

  6. Each row that was copied over will now be deleted

  7. Wait for scan to complete

In parallel with this we ensure that each transaction will write to both the old node group and the new node group. In effect some of the rows will be in two partitions at the same time during the change.

Fully replicated tables are reorganised in a different manner. In this case there is no need to delete any rows since a new node group will increase the number of replicas.

Optimize Table#

Optimize table is a command that removes fragmentation from the in-memory columns for the variable sized parts. A row in RonDB consists of up to 3 parts. The first part is the fixed part. The fixed part cannot be compacted since the row id is the reference to the row. Changing the row id would be very complex since it would create complex interactions with scan operations that happens in parallel with optimize table. The second part is the variable sized part of the rows. This stores all variable sized components and all dynamic columns. This part can be fragmented and this part is compacted by an optimize table operation. The third part is the disk columns part which are fixed size. Currently we don't support returning an extent to the tablespace. This part is not compacted as part of an optimize table operation.

Fixed size columns are much faster, but DYNAMIC columns are more flexible. If compacting memory is most important, one should use the DYNAMIC keyword on columns (or on the entire table). It is possible to set a configuration parameter such that all tables are created with DYNAMIC columns as the default.

Reorganise Table#

It is possible to reorganise a table also when the number of ldm threads have changed. Similarly if we change the PARTITION_BALANCE strategy from e.g. FOR_RA_BY_NODE to FOR_RP_BY_NODE a reorganise action can be performed. It is only possible to increase the number of partitions in a table, thus no decrease of number of partitions is currently possible as online operations.

An important thing to consider is that performing a reorganise table isn't a safe operation unless we're doing it in conjunction with add nodes. The reason is that we increase the memory size during the reorganise operation. For add node this is ok since the memory allocated is in new nodes that had no data previously. If we run out of memory during the reorganise operation we will roll back the change. This is not the most well tested part of RonDB. So one should ensure that there is sufficient amount of memory to perform the reorganise operation. Again reorganise operations in conjunction with add node are safe in this respect.

Offline meta data operations#

There are a few things that can only be done as offline operations. For the most part these things should be avoided. It is necessary to set a configuration variable in the MySQL Server to even allow those operations. The reason is that it is necessary to be aware of that an offline operation is invoked.

An offline operation requires some careful considerations. In RonDB we don't support locking a table in the entire cluster. The only safe method to use offline operations is to ensure that we have entered a special single user mode. If you are able to control all access to the cluster you can ensure that the table isn't updated. If you only access from MySQL Servers you can use

LOCK TABLES table_name READ;

This needs to be entered in all MySQL Servers in the cluster in that case.

Single User Mode#

From the RonDB management client there is a command to enter single user mode. This is primarily intended for safe use of ALTER TABLE statements that use the copy algorithm.

The command is ENTER SINGLE USER MODE 55.

Thus the API node with id 55 is now the only API node allowed to send requests to the cluster.

For offline operations you first enter single user mode for the MySQL Server. When starting this special MySQL Server you should ensure that it is started with the node id set in the startup parameters. Thus any changes this MySQL Server does while in single user mode is safe.

After completing the necessary changes you exit single user mode through the command:

EXIT SINGLE USER MODE

Operations that require offline operations#

  1. Drop column

  2. Change to/from disk column

  3. Change character set

  4. Change column size

  5. Change column data type

  6. Change column memory placement

  7. Truncate table

Drop column requires an offline operation.

To move a column from in-memory to disk or to move it from being a disk column to being an in-memory column requires an offline operation. This could include creating an index on the column as well when moving to in-memory.

Changing the character set of a column or of a table requires an offline operation.

Modifying a column to a new data type or change the size of a column requires an offline operation. Also moving from fixed size part to the DYNAMIC part.

All of these operations requires using the copy algorithm, thus a new table is created and row by row is copied over to the new table. When the copy is done the new table becomes the table and the old table is dropped.

Truncate table is an offline operation that will first drop the table and next the table is created again.

Online solutions to offline operations#

Many users that need everything to be online often uses RonDB Replication. They have more than one cluster that is operational. These changes are often correlated to software changes. Often such a change starts by taking a backup in the cluster to change. Next the cluster is taken offline. Next the cluster software is upgraded. Next the meta data is restored again. In this restore it is possible to perform changes of column sizes, column data types, column storage, and how the columns are stored, columns can be dropped and columns and indexes can be added.

Next the data is restored again. The cluster is brought up again using binary log from the cluster that is still operational.

Next the second cluster need to be changed to the new software and new meta data. More details about how this can be achieved in detail will be provided in the chapter on Multi-Site RonDB.