A default table using the NDB storage engine will be created such that we get balanced reads when we always read the primary replica. In RonDB we introduced the use of query threads as the default setup. This means that a table always 2 partitions per node in the cluster. Query threads can perform committed reads when not disk columns are involved. RonDB 21.10 extends this to also handle locked reads in query threads.
There are options available when creating the table that makes it possible to change these defaults.
Read Backup Feature#
The two-phase commit transaction protocol in RonDB runs in 3 phases:
Prepare: Lock rows and write data
Commit: Add a Global Checkpoint Index (GCI) to the written data
Complete: Release locks and transaction memory
Within each row, each phase runs in linear order between the transaction coordinator (TC) and replicas. I.e. the TC passes the prepare message to the primary replica which passes it to the first backup replica and so on. The commit phase is run in reverse linear order - from the "last" backup replica to the primary replica. Since the primary knows that all other replicas have also committed, it can safely release the locks during the commit phase.
The default lock-less Read Committed read in RonDB will always read the latest committed row. However, a locked row is not considered committed. Therefore, reading an updated row after the commit phase will return different results depending on the replica that is read. Only reading from the primary will guarantee read-after-write consistency.
If one wants read-after-write consistency for all replicas, one must send the transaction acknowledgment (ACK) after the complete phase. This is called the Read Backup feature, which is settable as a table option and default in RonDB.
Since replicas can be placed in different geographical locations, being able to read from backup replicas can decrease the latency of the read. The latency can also decrease if the read spans multiple partitions. One does not have to query the primary for each partition. On the other hand, returning an ACK directly after the commit phase will decrease the latency of writes.
The following table lists the effects of the different options. The choice of architecture essentially becomes a question of whether to prioritize reads or writes.
|ACK after commit phase
|ACK after complete phase
|Only Primary Reads
This is the NDB original.
|Primary or Backup Reads
This is the Read Backup feature. It is the RonDB default.
An example application is an application using SQL that has two data
nodes, two MySQL Servers and each MySQL Server is colocated with one
data node. Using the Read Backup feature, all
SELECT queries can be
completely localized to one computer and we have no network bandwidth
worries for the application. Only updates traverse the network to the
other node. If a table experiences a lot of writes, using this feature
may however be questionable.
Whether the Read Backup feature is disabled or not, the throughput of writes is not affected.
Read Backup syntax#
Note using the Read Backup option means that updating transactions has a
slightly lower latency. In this case, the below syntax can be used to
disable the Read Backup option. When creating the table the following
should be placed in the comment section of the
CREATE TABLE statement.
It is also possible to set this as a MySQL configuration option. By
settting the configuration option ndb-read-backup to 1 all tables
created in the MySQL Server will be using this property (it will not
affect tables that are changed using
ALTER TABLE). Setting it to 0
means that tables will not use the read backup option.
It is also possible to set this property in an
ALTER TABLE statement.
If only this feature is changed in the
ALTER TABLE statement the
change will be an online alter table statement.
When used in an
ALTER TABLE statement only properties changed in the
COMMENT section will be changed. E.g. if the
READ_BACKUP feature was
previously set and we now changed another property in the comment
section, it will not affect the value of the
ALTER TABLE statement.
Fully replicated tables#
A similar feature to Read Backup is fully replicated tables. A fully replicated table means that the table will have one replica in each node of the cluster instead of one replica per node in a node group. If we have an 8-node cluster with 4 node groups with 2 nodes in each, we would normally have splitted the table into at least 4 parts (possibly more parts). However for a fully replicated table each node group will contain the full table and all data will be fully replicated in all 8 nodes of the cluster.
The table is implemented as a normal table where updates use an internal triggering that always first updates the first node group and then an internal trigger ensures that all the other node groups also receive the update. The changes are always transactional, thus we always guarantee that we write all available replicas.
This is very nice for clusters that want to scale reads. We can have up to 48 data nodes that are colocated with MySQL Servers using a shared memory transporter thus scaling reads to many millions of SQL queries towards the shared data. We can still handle hundreds of thousands of updates on this data per second.
Another type of usage for fully replicated tables is so called dimensional tables, small fact tables that are often used in SELECT statements but rarely updated and used together with large tables containing references to these smaller tables.
Fully replicated tables gives a lot more options for how to design scalable applications using RonDB. In particular we expect it to enhance performance when executing complex join operations that makes use of the pushdown join feature presented in a later chapter.
Fully replicated table syntax#
When creating the table one uses the comment section to set the fully
replicated table property as shown here in the
CREATE TABLE or
ALTER TABLE statement.
It is possible to set it as a configuration option in the MySQL Server with the ndb-fully-replicated variable. This will ensure that all tables in the MySQL Server use the fully replicated table property when creating new tables while this configuration option is set. This option will have no effect on ALTER TABLE statements.
Setting a table to be fully replicated also means that it will have the read backup property set. Reads on a fully replicated table can use any node in the cluster for reading.
When setting this feature in a table using ALTER TABLE it will use a copying alter table statement.
It is possible to reorganise the fully replicated table to handle new
added node groups as an online alter table using the command
ALTER TABLE algorithm=inplace, REORGANIZE. No downtime is needed to
scale the cluster to use new node groups for fully replicated tables
(similar with normal tables).
Partitioning of tables#
RonDB is a distributed DBMS, thus all tables are always partitioned even if not specifically declared to be so. The default partitioning key is the primary key. If no primary key exists a unique key is used as primary key. If neither a primary key nor a unique key is defined on the table a primary key is constructed with an extra hidden column added, this column has a unique value that is a sequence number.
The MySQL Server supports specifying partitioning. With RonDB we only support one variant of partitioning and this is PARTITION BY KEY. It is possible to specify PARTITION BY KEY() in which case the primary key is used. Otherwise a list of columns is defined as the partitioning key. These columns must be a subset of the columns used for the primary key.
It is often a good idea to be careful in selecting the partitioning key. Often it is good to have the same partitioning key in a set of tables. A good example of this is the TPC-C benchmark. All of the tables in TPC-C (except for the item table that describes the products) has a warehouse id as part of the table. The customers are linked to a warehouse, a warehouse is connected to a district, orders and order lines comes from warehouses and stock is also per warehouse.
In the TPC-C benchmark it is natural to use the warehouse id as the partitioning key. In todays world of big data it is very common to use a sharding key to split up the application. To find the proper partitioning key for a set of tables is the same problem as when selecting a sharding key for a set of databases that works together in a shard.
In HopsFS that implements a meta data server for a distributed file system on top of RonDB the partitioning key for the inodes table is id of the parent inode. Thus all inodes of a certain directory are all in the same partition.
RonDB has a number of similarities with a sharded set of databases. The main difference is that RonDB is much more tightly integrated and supports queries and transactions that spans many shards.
By default the number of partitions is set to provide a decent number of ldm threads to handle the writes, but also to avoid increasing the cost of range scans not using the partition key.
In RonDB we try to make it possible to use a distributed DBMS even when the application isn’t perfectly written for scalability. This means that we will create 2 partitions times the number of nodes in the cluster. It is possible to change this by setting PartitionsPerNode in the data node configuration.
RonDB introduces query threads to ensure that we still can spread execution on all CPUs accessible in the RonDB data node.
If we add another node group this means that we will add another set of partitions. To perform this change one uses the command below on each table.
It is possible to create a table with a specific number of partitions. In this case it is possible to add a specific number of partitions as an online operation.
The variant that is recommended is to use the keyword PARTITION_BALANCE to set a balanced number of partitions. We will describe these in a section below after some introductory remarks on how we partition.
One important feature of RonDB is the ability to add new nodes to the cluster with the cluster up and running. In order to take advantage of these new nodes we must be able to perform an online reorganisation of the tables in the cluster. What this means is that we need to add more partitions to the cluster while the cluster is fully operational and can handle both reads and writes.
In order to ensure that such an online reorganisation can execute without requiring extra memory in the already existing nodes a scheme was invented called hashmap partitioning.
The partitioning happens by calculating a hash value based on the partitioning key values. Normally this value would be using linear hashing to decide which partition a row is stored in. However the linear hashing algorithm is not sufficiently balanced for our requirements. It will grow nicely, but it will not grow in steps that are adequate for our needs.
Another layer was invented called a hashmap. Each row is mapped into a hashmap, by default there are e.g. 3840 hashmaps in a table. If we have 4 partitions each partition will have 960 hashmaps. If we decide to add 4 more partitions there will be 8 partitions and in that case we will instead have 480 hashmaps per partition. Each partition will need to move 480 hashmaps to another partition (thus half of the data).
The hashmaps are balanced as long as the number of partitions is evenly divisible with the number of hashmaps. The number 3840 is equal to 2*2*2*2*2*2*2*2*3*5. For a large number of selections of number of node groups and number of ldm threads in RonDB it will be divisible with 3840 although not all. If they are not divisible it means a small imbalance (and thus loss of memory in some nodes that are less occupied than others). The impact is small even when there is an imbalance.
Considerations for selecting number of partitions#
It makes a lot of sense to decrease the number of partitions in a table where range scans are often performed on all partitions. Assume we run on a large cluster with 8 data nodes where each data node contains 16 LDM threads. In this case a default table distribution maps the table into 128 partitions. For primary key reads and writes this have no negative impact, for partition pruned index scans neither. If you have written a highly scalable application this have very little consequence.
So in other words if you have written an application that makes use of the sharding key (== partition key) in all queries, the application is scalable even with very many partitions.
For an application that uses a lot of queries that don’t define the sharding key on index scans and the scan is looking for small amount of data the overhead will be high since there is a startup cost of each ordered index scan. In this example one needs to execute 256 index scans, one on each partition.
In this case if the table is small and the table isn’t a major portion of the CPU load, it could make sense to decrease the amount of partitions in the table.
By setting the partition balance we can adjust the number of partitions in a table.
Selecting a high number of partitions means that we spread the load of accessing the table evenly among the ldm thread. Thus we minimise the risk of any CPU bottlenecks.
Selecting a high number of partitions increases the overhead of each index scan that is not pruned to one partition and similarly for full table scans. The choice of partitioning scheme is a balance act between balanced CPU usage and minimum overhead per scan.
Selecting the default mechanism also provides a possibility to use extra CPU threads to rebuild ordered index as part of a restart. The default mechanism is the natural selection for a highly scalable application. But for less scalable applications and special tables it might make sense to change the partitioning scheme to a scheme that have a smaller overhead for less optimised queries.
In RonDB we have introduced query threads that can be used for read queries to minimise the problems with partitioning. PartitionsPerNode can be used to define a higher or lower amount of partitions on a cluster scale.
To change the number of partitions for a single table one can either use a specific number of partitions when creating the table or using partition balance.
We have two dimensions of balance. We can balance for usage with Read Primary replica, thus we want primary replicas to be balanced among the nodes and ldm threads. We can balance for usage with Read Any replica and in this case we balance the replicas among the node groups and ldm threads.
The second dimension is whether to balance one ldm threads or whether to only balance between the nodes/node groups.
One primary partition per each LDM in cluster, FOR_RP_BY_LDM#
This is the default mechanism, it creates number of partitions as PartitionsPerNode times the number of nodes,
One primary partition per LDM per node group, FOR_RA_BY_LDM#
, this creates number of partitions as PartitionsPerNode times the number of node groups.
One primary partition per node, FOR_RP_BY_NODE#
This mechanism stores one primary fragment replica in one ldm thread in each node. Thus the total number of partitions is the number of nodes.
One primary partition per node group, FOR_RA_BY_NODE#
This is the minimal mechanism, it stores one primary fragment replica in one ldm thread per node group. Thus the total number of partitions is the number of node groups.
This partitioning option is the one that gives the least amount of partitions. Thus it is very useful for tables that have mostly index scans on all partitions and for small tables rarely used but still desiring balance among the nodes in the cluster.
It is useful to decrease overhead of many small tables in the cluster while still maintaining a balance of memory usage on all nodes in the cluster.
Thus in the case of 4 nodes with 4 ldm threads and 2 replicas we will have 2 partitions per table.
Syntax for Partition Balance#
This table option is only settable in the COMMENT section when you create a table and when you alter a table.
It is possible to set several properties in the comment section at the same time like this:
Going from a partition balance to another can be done as an online alter table operation if the number of partitions increase. It will be a copying alter table statement if the number of partitions decrease.
The partitioning balance can be used for fully replicated tables. It is not possible to change the PARTITION_BALANCE as an online alter table option for fully replicated tables. It is possible to add new node groups to the fully replicated table as an online alter table statement.
Setting explicit number of partitions#
As mentioned previously it is still ok to explicitly set the number of partitions. This provides no guarantees on balance, but we will still attempt to balance the load among nodes and ldm threads as much as possible.
Once a table have set a specific number of partitions it cannot be changed to any other partitioning option as an online alter table statement. It cannot be reorganized using the REORGANIZE keyword. It can increase the number of explicit partitions as an online alter table statement.
Thus the setting of explicit number of partitions gives the user complete control over the number of partitions per table if this control is desirable. It does not provide control of the placement of these partitions.
The PARTITION_BALANCE options makes it possible to control the number of partitions even when adding new node groups and still maintaining a good balance between the nodes and the ldm threads.
Setting explicit number of partitions is done using the normal partitioning syntax. Adding e.g PARTITIONS 4 after specifying the PARTITION BY KEY and the list of fields for key partitioning gives explicit number of partitions.
No REDO logging#
Normally all tables are fully recoverable. There are applications where data changes so rapidly that it doesn’t make sense to recover the data. An example is a stock exchange application where data is changed hundreds of thousands of times per second and the amount of data is very small. Thus in the case of a cluster crash and returning a few minutes later the data is no longer current.
In this case we can optimise the table by removing writes to the REDO log and removing writes to the local checkpoints for the table. This removes around 30% of the overhead in performing updates on the table.
To enable this feature can be done as a table option. This table option can not be changed as an online alter table statement. It uses the COMMENT section in the same fashion as the read backup, partition balance and fully replicated features.
By setting the MySQL configuration option ndb-table-no-logging to one we ensure that all tables created in this connection will use the NOLOGGING feature.
We’ve already mentioned that all primary keys and unique keys by default define an extra ordered index on top of the distributed hash index that is always part of a primary key and unique key. Adding USING HASH to an index definition is a method to avoid the ordered index if it isn’t needed.
We’ve discussed to use disk columns for columns that you don’t expect to add any indexes to. This is useful if you have a fast disk, such as SSDs or NVMe’s to store your tablespaces in. Using hard drives is technically ok, but the difference between the access time to a hard drive and the access time to memory is so great that it is very unlikely that the user experience will be any good.
Nowadays there are disks that can handle many, many thousands of IO operations per seconds (IOPS) and even millions of IO operations in a single server is achievable, while at the same time the access time is measured in tens of microseconds. These are useful even when compared to using memory.
We’ve discussed the possibility to use fully replicated tables in some cases to get faster access for read and various partitioning variants.
One more consideration that we haven’t mentioned is to consider the recovery algorithms.
In RonDB all deletes, updates and inserts are written to a REDO log. This REDO log is a logical change log, thus the amount of log information is proportional to the actual change made. Only the changed columns are logged in the REDO log.
In addition we execute local checkpoints, these will write all the rows at checkpoint time. Rows that are updated very often will only be written once per checkpoint. The checkpoint always writes the full row.
Similarly for disk columns we write an UNDO log where the size of what we write is dependent on the total column size. The consequence of this is that the amount of data we write to the checkpoints is dependent on the size of the rows.
This means that if we have information that is very often updated there can be reasons to put this data in a separate table to avoid having to checkpoint columns that are mostly read-only. Naturally there can be many reasons for those columns to be in the same table as well and for most applications this particular issue is not a concern at all. As with any optimisation one should measure and ensure that it is worthwhile using it before applying it.