Table options#

A default table using the NDB storage engine will be created such that we get balanced reads when we always read the primary replica. In RonDB we introduced the use of query threads as the default setup. This means that a table always 2 partitions per node in the cluster. Query threads can perform committed reads when not disk columns are involved. RonDB 21.10 extends this to also handle locked reads in query threads.

There are options available when creating the table that makes it possible to change these defaults.

Read Backup Feature#

The two-phase commit transaction protocol in RonDB runs in 3 phases:

Prepare: Lock rows and write data
Commit: Add a Global Checkpoint Index (GCI) to the written data
Complete: Release locks and transaction memory

Within each row, each phase runs in linear order between the transaction coordinator (TC) and replicas. I.e. the TC passes the prepare message to the primary replica which passes it to the first backup replica and so on. The commit phase is run in reverse linear order - from the "last" backup replica to the primary replica. Since the primary knows that all other replicas have also committed, it can safely release the locks during the commit phase.

The default lock-less Read Committed read in RonDB will always read the latest committed row. However, a locked row is not considered committed. Therefore, reading an updated row after the commit phase will return different results depending on the replica that is read. Only reading from the primary will guarantee read-after-write consistency.

If one wants read-after-write consistency for all replicas, one must send the transaction acknowledgment (ACK) after the complete phase. This is called the Read Backup feature, which is settable as a table option and default in RonDB.

Since replicas can be placed in different geographical locations, being able to read from backup replicas can decrease the latency of the read. The latency can also decrease if the read spans multiple partitions. One does not have to query the primary for each partition. On the other hand, returning an ACK directly after the commit phase will decrease the latency of writes.

The following table lists the effects of the different options. The choice of architecture essentially becomes a question of whether to prioritize reads or writes.

Only Primary Reads

Read-after-write consistency
Higher latency reads with multi-location setup
Higher latency for cross-partition reads
Lower latency writes

This is the NDB original.

Read-after-write consistency
Higher latency reads with multi-location setup
Higher latency for cross-partition reads
Higher latency writes

Primary or Backup Reads

No Read-after-write consistency
Lower latency reads with multi-location setup
Lower latency for cross-partition reads
Lower latency writes

Read-after-write consistency
Lower latency reads with multi-location setup
Lower latency for cross-partition reads
Higher latency writes

This is the Read Backup feature. It is the RonDB default.

An example application is an application using SQL that has two data nodes, two MySQL Servers and each MySQL Server is colocated with one data node. Using the Read Backup feature, all SELECT queries can be completely localized to one computer and we have no network bandwidth worries for the application. Only updates traverse the network to the other node. If a table experiences a lot of writes, using this feature may however be questionable.

Whether the Read Backup feature is disabled or not, the throughput of writes is not affected.

Read Backup syntax#

Note using the Read Backup option means that updating transactions has a slightly lower latency. In this case, the below syntax can be used to disable the Read Backup option. When creating the table the following should be placed in the comment section of the CREATE TABLE statement.

COMMENT="NDB_TABLE=READ_BACKUP=0"

It is also possible to set this as a MySQL configuration option. By settting the configuration option ndb-read-backup to 1 all tables created in the MySQL Server will be using this property (it will not affect tables that are changed using ALTER TABLE). Setting it to 0 means that tables will not use the read backup option.

It is also possible to set this property in an ALTER TABLE statement. If only this feature is changed in the ALTER TABLE statement the change will be an online alter table statement.

When used in an ALTER TABLE statement only properties changed in the COMMENT section will be changed. E.g. if the READ_BACKUP feature was previously set and we now changed another property in the comment section, it will not affect the value of the READ_BACKUP property after the ALTER TABLE statement.

Fully replicated tables#

A similar feature to Read Backup is fully replicated tables. A fully replicated table means that the table will have one replica in each node of the cluster instead of one replica per node in a node group. If we have an 8-node cluster with 4 node groups with 2 nodes in each, we would normally have splitted the table into at least 4 parts (possibly more parts). However for a fully replicated table each node group will contain the full table and all data will be fully replicated in all 8 nodes of the cluster.

The table is implemented as a normal table where updates use an internal triggering that always first updates the first node group and then an internal trigger ensures that all the other node groups also receive the update. The changes are always transactional, thus we always guarantee that we write all available replicas.

This is very nice for clusters that want to scale reads. We can have up to 48 data nodes that are colocated with MySQL Servers using a shared memory transporter thus scaling reads to many millions of SQL queries towards the shared data. We can still handle hundreds of thousands of updates on this data per second.

Another type of usage for fully replicated tables is so called dimensional tables, small fact tables that are often used in SELECT statements but rarely updated and used together with large tables containing references to these smaller tables.

Fully replicated tables gives a lot more options for how to design scalable applications using RonDB. In particular we expect it to enhance performance when executing complex join operations that makes use of the pushdown join feature presented in a later chapter.

Fully replicated table syntax#

When creating the table one uses the comment section to set the fully replicated table property as shown here in the CREATE TABLE or ALTER TABLE statement.

COMMENT="NDB_TABLE=FULLY_REPLICATED=1"

It is possible to set it as a configuration option in the MySQL Server with the ndb-fully-replicated variable. This will ensure that all tables in the MySQL Server use the fully replicated table property when creating new tables while this configuration option is set. This option will have no effect on ALTER TABLE statements.

Setting a table to be fully replicated also means that it will have the read backup property set. Reads on a fully replicated table can use any node in the cluster for reading.

When setting this feature in a table using ALTER TABLE it will use a copying alter table statement.

It is possible to reorganise the fully replicated table to handle new added node groups as an online alter table using the command ALTER TABLE algorithm=inplace, REORGANIZE. No downtime is needed to scale the cluster to use new node groups for fully replicated tables (similar with normal tables).

Partitioning of tables#

RonDB is a distributed DBMS, thus all tables are always partitioned even if not specifically declared to be so. There are a number of things that affect the partitioning. First one has to select the partition key. By default the primary key is the partition key, but it can also be a subset of the primary key columns.

RonDB uses horizontal partitioning and all tables are partitioned using hash partitioning based on the partition key. Thus all records with the same partition key will be in the same partition, two records with different partition key are partitioned independent of each other and are unlikely to be in the same partition.

It is also important to select the number of partitions appropriately. In RonDB a partition is located within one LDM thread, only this LDM thread can perform writes to the data in the partition. Reads can be performed by many LDM threads, up to 8 different LDM threads could perform reads on a partition in parallel.

Selection of number of partitions is a trade off, increasing the number of partitions means that a table can handle more writes per second. Each partition will be able to handle about 100.000 writes per second (more with small rows and less with very large rows). However increasing the number of partitions can have a negative impact on scan performance.

The best scans to perform are scans where the partition key is a single value. In this case the scan is performed in a single partition and this scales very well. We call this partition pruned scans. These are not affected negatively by the number of partitions.

Scans that don’t use the partition key in the WHERE clause will require the scan to be performed in all partitions. This means that the startup cost of a scan will be multiplied by the number of partitions.

Thus the default setting tries to find a balance between write throughput and scan performance.

Partition Key#

The default partitioning key is the primary key. If no primary key exists a unique key is used as primary key. If neither a primary key nor a unique key is defined on the table a primary key is constructed with an extra hidden column added, this column has a unique value that is a sequence number.

The MySQL Server supports specifying partitioning. With RonDB we only support one variant of partitioning and this is PARTITION BY KEY. It is possible to specify PARTITION BY KEY() in which case the primary key is used. Otherwise a list of columns is defined as the partitioning key. These columns must be a subset of the columns used for the primary key.

It is often a good idea to be careful in selecting the partitioning key. Often it is good to have the same partitioning key in a set of tables. A good example of this is the TPC-C benchmark. All of the tables in TPC-C (except for the item table that describes the products) has a warehouse id as part of the table. The customers are linked to a warehouse, a warehouse is connected to a district, orders and order lines comes from warehouses and stock is also per warehouse.

In the TPC-C benchmark it is natural to use the warehouse id as the partitioning key. In todays world of big data it is very common to use a sharding key to split up the application. To find the proper partitioning key for a set of tables is the same problem as when selecting a sharding key for a set of databases that works together in a shard.

In HopsFS that implements a meta data server for a distributed file system on top of RonDB the partitioning key for the inodes table is id of the parent inode. Thus all inodes of a certain directory are all in the same partition and can use partition pruned scans.

RonDB has a number of similarities with a sharded set of databases. The main difference is that RonDB is much more tightly integrated and supports queries and transactions that spans many shards.

Number of partitions per table, PartitionsPerNode#

By default the number of partitions is set to provide a decent number of ldm threads to handle the writes, but also to avoid increasing the cost of range scans not using the partition key. This is controlled by the configuration parameter PartitionsPerNode.

In RonDB we try to make it possible to use a distributed DBMS even when the application isn’t perfectly written for scalability. This means that we will create 2 partitions times the number of nodes in the cluster. It is possible to change this by setting PartitionsPerNode in the data node configuration.

The number of nodes in RonDB includes all data nodes, also those that currently are not active through setting NodeActive to 0 in configuration. In Managed RonDB we always configure clusters with 3 replicas and if the user only wants to use 2 replicas we will set one of the nodes in each node group to inactive. Later the 3rd node in the node group can be added to the RonDB cluster as an online operation.

RonDB allows all LDM threads to act as query threads to ensure that we still can spread execution on all CPUs accessible in the RonDB data node. Query threads can read partitions from other LDM threads, thus reads can scale to millions of reads per second even with a single partition.

It is possible to create a table with a specific number of partitions. In this case it is possible to add a specific number of partitions as an online operation.

The variant that is recommended is to use the keyword PARTITION_BALANCE to set a balanced number of partitions. We will describe these in a section below after some introductory remarks on how we partition.

Adding a node group and reorganising a table#

If we add another node group this means that we will add another set of partitions. To perform this change one uses the command below on each table.

ALTER TABLE tab_name ALGORITHM=inplace, REORGANIZE PARTITION;

HashMap Partitioning#

One important feature of RonDB is the ability to add new nodes to the cluster with the cluster up and running. In order to take advantage of these new nodes we must be able to perform an online reorganisation of the tables in the cluster. What this means is that we need to add more partitions to the cluster while the cluster is fully operational and can handle both reads and writes.

In order to ensure that such an online reorganisation can execute without requiring extra memory in the already existing nodes a scheme was invented called hashmap partitioning.

The partitioning happens by calculating a hash value based on the partitioning key values. Normally this value would be using linear hashing to decide which partition a row is stored in. However the linear hashing algorithm is not sufficiently balanced for our requirements. It will grow nicely, but it will not grow in steps that are adequate for our needs.

Another layer was invented called a hashmap. Each row is mapped into a hashmap, by default there are e.g. 3840 hashmaps in a table. If we have 4 partitions each partition will have 960 hashmaps. If we decide to add 4 more partitions there will be 8 partitions and in that case we will instead have 480 hashmaps per partition. Each partition will need to move 480 hashmaps to another partition (thus half of the data).

The hashmaps are balanced as long as the number of partitions is evenly divisible with the number of hashmaps. The number 3840 is equal to 2*2*2*2*2*2*2*2*3*5. For a large number of selections of number of node groups and number of ldm threads in RonDB it will be divisible with 3840 although not all. If they are not divisible it means a small imbalance (and thus loss of memory in some nodes that are less occupied than others). The impact is small even when there is an imbalance.

Considerations for selecting number of partitions#

It makes a lot of sense to decrease the number of partitions in a table where range scans are often performed on all partitions. Assume we run on a large cluster with 8 data nodes where we have set PartitionsPerNode to 16 and we have 16 ldm threads per data node. In this case a default table distribution maps the table into 128 partitions. For primary key reads and writes this have no negative impact, for partition pruned index scans neither. If you have written a highly scalable application this have very little consequence and your writes will scale to handle tens of millions writes per second.

So in other words if you have written an application that makes use of the sharding key (== partition key) in all queries, the application is scalable even with very many partitions.

For an application that uses a lot of queries that don’t define the sharding key on index scans and the scan is looking for small amount of data the overhead will be high since there is a startup cost of each ordered index scan. In this example one needs to execute 128 index scans, one on each partition.

Startup cost of a range scans means traversing the index tree to find the first row to scan in the range. Finding the next row is cheaper since this simply follows the links in the index tree. Startup cost of full table scan is cheaper but still requires a fair amount of setup of variables and sending and receiving messages on each partition.

In this case it could make sense to decrease the amount of partitions in the table. We can do this by changing the PartitionsPerNode to a smaller number. This will affect all new tables created.

By setting the partition balance we can adjust the number of partitions in a specific table. This is very useful when we don’t want to change the default setting of PartitionsPerNode but have a few tables that need specific settings of number of partitions.

Selecting a high number of partitions means that we spread the load of accessing the table evenly among the ldm threads. Thus we minimise the risk of any CPU bottlenecks.

Selecting a high number of partitions increases the overhead of each index scan that is not pruned to one partition and similarly for full table scans. The choice of partitioning scheme is a balance act between balanced CPU usage and minimum overhead per scan.

The number of partitions can also have an impact on recovery times, having more partitions means more CPUs can be used to rebuild indexes in the restart phase.

To change the number of partitions for a single table one can either use a specific number of partitions when creating the table or using partition balance which we will cover in the next section.

Partition Balance#

We have two dimensions of balance. We can balance for usage with Read Primary replica, thus we want primary replicas to be balanced among the nodes and ldm threads. We can balance for usage with Read Any replica and in this case we balance the replicas among the node groups and ldm threads.

The second dimension is whether to balance one ldm threads or whether to only balance between the nodes/node groups.

More options on this can be found in the chapter on Node Groups.

FOR_RP_BY_LDM#

This is the default mechanism, it creates number of partitions as PartitionsPerNode times the number of RonDB data nodes,

FOR_RA_BY_LDM#

This creates number of partitions as PartitionsPerNode times the number of node groups.

FOR_RP_BY_NODE#

This mechanism stores one primary fragment replica in one ldm thread in each node. Thus the total number of partitions is the number of nodes.

FOR_RA_BY_NODE#

This is the minimal mechanism, it stores one primary fragment replica in one ldm thread per node group. Thus the total number of partitions is the number of node groups.

This partitioning option is the one that gives the least amount of partitions. Thus it is very useful for tables that have mostly index scans on all partitions and for small tables rarely used but still desiring balance among the nodes in the cluster.

It is useful to decrease overhead of many small tables in the cluster while still maintaining a balance of memory usage on all nodes in the cluster.

Thus in the case of 4 nodes with and 2 replicas we will have 2 partitions per table.

Other FOR_RP_BY_LDM settings#

It is also possible to use FOR_RP_BY_LDM_X_2, FOR_RP_BY_LDM_X_4, FOR_RP_BY_LDM_X_6, FOR_RP_BY_LDM_X_8 and FOR_RP_BY_LDM_X_16.

E.g. FOR_RP_BY_LDM_X_8 means that we will create:

PartitionsPerNode * number of nodes * 8 partitions. Thus by default 48 partitions per node group if we have 3 replicas.

Other FOR_RA_BY_LDM settings#

It is also possible to use FOR_RA_BY_LDM_X_2, FOR_RA_BY_LDM_X_3 and FOR_RA_BY_LDM_X_4.

E.g. FOR_RA_BY_LDM_X_2 means that we will create:

PartitionsPerNode * number of node groups * 2 partitions. Thus by default 4 partitions per node group independent of number of replicas.

Syntax for Partition Balance#

This table option is only settable in the COMMENT section when you create a table and when you alter a table.

COMMENT="NDB_TABLE=PARTITION_BALANCE=FOR_RA_BY_NODE"

It is possible to set several properties in the comment section at the same time like this:

COMMENT="NDB_TABLE=READ_BACKUP=1, PARTITION_BALANCE=FOR_RA_BY_NODE"

Going from a partition balance to another can be done as an online alter table operation if the number of partitions increase. It will be a copying alter table statement if the number of partitions decrease.

The partitioning balance can be used for fully replicated tables. It is not possible to change the PARTITION_BALANCE as an online alter table option for fully replicated tables. It is possible to add new node groups to the fully replicated table as an online alter table statement.

Setting explicit number of partitions#

As mentioned previously it is still ok to explicitly set the number of partitions. This provides no guarantees on balance, but we will still attempt to balance the load among nodes and ldm threads as much as possible.

Once a table have set a specific number of partitions it cannot be changed to any other partitioning option as an online alter table statement. It cannot be reorganized using the REORGANIZE keyword. It can increase the number of explicit partitions as an online alter table statement.

Thus the setting of explicit number of partitions gives the user complete control over the number of partitions per table if this control is desirable. It does not provide control of the placement of these partitions.

The PARTITION_BALANCE options makes it possible to control the number of partitions even when adding new node groups and still maintaining a good balance between the nodes and the ldm threads.

Setting explicit number of partitions is done using the normal partitioning syntax. Adding e.g PARTITIONS 4 after specifying the PARTITION BY KEY and the list of fields for key partitioning gives explicit number of partitions.

No REDO logging#

Normally all tables are fully recoverable. There are applications where data changes so rapidly that it doesn’t make sense to recover the data. An example is a stock exchange application where data is changed hundreds of thousands of times per second and the amount of data is very small. Thus in the case of a cluster crash and returning a few minutes later the data is no longer current.

In this case we can optimise the table by removing writes to the REDO log and removing writes to the local checkpoints for the table. This removes around 30% of the overhead in performing updates on the table.

NOLOGGING syntax#

To enable this feature can be done as a table option. This table option can not be changed as an online alter table statement. It uses the COMMENT section in the same fashion as the read backup, partition balance and fully replicated features.

COMMENT="NDB_TABLE=NOLOGGING=1"

By setting the MySQL configuration option ndb-table-no-logging to one we ensure that all tables created in this connection will use the NOLOGGING feature.

Schema Considerations#

We’ve already mentioned that all primary keys and unique keys by default define an extra ordered index on top of the distributed hash index that is always part of a primary key and unique key. Adding USING HASH to an index definition is a method to avoid the ordered index if it isn’t needed.

We’ve discussed to use disk columns for columns that you don’t expect to add any indexes to. This is useful if you have a fast disk, such as SSDs or NVMe’s to store your tablespaces in. Using hard drives is technically ok, but the difference between the access time to a hard drive and the access time to memory is so great that it is very unlikely that the user experience will be any good.

Nowadays there are disks that can handle many, many thousands of IO operations per seconds (IOPS) and even millions of IO operations in a single server is achievable, while at the same time the access time is measured in tens of microseconds. These are useful even when compared to using memory.

We’ve discussed the possibility to use fully replicated tables in some cases to get faster access for read and various partitioning variants.

One more consideration that we haven’t mentioned is to consider the recovery algorithms.

In RonDB all deletes, updates and inserts are written to a REDO log. This REDO log is a logical change log, thus the amount of log information is proportional to the actual change made. Only the changed columns are logged in the REDO log.

In addition we execute local checkpoints, these will write all the rows at checkpoint time. Rows that are updated very often will only be written once per checkpoint. The checkpoint always writes the full row.

Similarly for disk columns we write an UNDO log where the size of what we write is dependent on the total column size. The consequence of this is that the amount of data we write to the checkpoints is dependent on the size of the rows.

This means that if we have information that is very often updated there can be reasons to put this data in a separate table to avoid having to checkpoint columns that are mostly read-only. Naturally there can be many reasons for those columns to be in the same table as well and for most applications this particular issue is not a concern at all. As with any optimisation one should measure and ensure that it is worthwhile using it before applying it.