Routing of reads and writes#

An important part of a distributed DBMS is how to route reads and writes within the various nodes in the distributed DBMS.

In RonDB we distribute data using a hash algorithm using a hash map. Each partition of each table is fully replicated within one node group of the cluster. Each partition is also located within one ldm thread within the nodes in this node group. The location of data is decided once we have created a table.

However reads can be routed differently if we are using the Read Backup feature or if we are using the fully replicated feature.

The transaction coordinator can be placed in any node in the cluster. Any MySQL Server can be choosen to execute a query in RonDB.

Choosing the MySQL Server#

With RonDB every MySQL Server is capable of handling each transaction. There is no specific need to choose a special MySQL Server for read and write transactions. We have special MySQL Servers for replicating between clusters and it is possible to have special MySQL Servers for handling meta data changes (although not required). But there is no need to use a special MySQL Server for normal read and write transactions.

The natural choice here is some sort of normal round robin scheme. There are many methods to implement the handling of this round robin scheme. One method is to use methods in the MySQL APIs. For example the JDBC API towards MySQL supports listing many different MySQL Server and performing a round robin on them.

The second option is to use the MySQL Router.

One more option is to use a standard load balancer in front of the MySQL Server. Thus the set of MySQL Servers used have a single IP address and a single port to access. In e.g. the Oracle Infrastructure Cloud this is a normal service that you can connect to your infrastructure. It is possible to set this load balancer up such that it is highly available.

If you are developing an application that use the NDB API, it is necessary to have a similar mechanism on top of your application nodes.

From a MySQL client we have no way of guessing the best MySQL Server to use. This requires knowledge only available through the NDB API.

The principle to use to get best scaling is to spread the load among the MySQL Servers in the cluster.

Choosing the node of the transaction coordinator#

It is possible to choose any node as transaction coordinator. But there is great value in choosing a transaction coordinator that is close to the data of the transaction. The principle of choosing transaction coordinator based on locality means that we are minimising the latency of the transaction and avoiding as much as possible oversubscribed network links that can cause bottlenecks.

Choosing at NDB API level#

Using the NDB API it is possible to provide a hint about where to place the transaction coordinator. The hint we provide is the table object and the primary key to use for access. Provided the primary key and the table object we can calculate the partition identity and through this and the table object we can deduce the nodes where the replicas are stored. This is not exact knowledge since e.g. an ALTER TABLE REORGANIZE can change this on the fly without immediately informing the NDB API. It is only a hint, the impact of an erroneus hint is that the transaction will take slightly longer time to execute.

A key hint provided to start the transaction#

We will select a node from one of the data nodes that have a replica. The selection depends on a few things, it depends on if we are using the read backup feature, it depends on if we are using the fully replicated feature.

If the table have the read backup feature we can select any of the data nodes with a replica. We will select the one that is considered most close. This is using a configuration parameter called Group on each transporter (this parameter should normally not be set by user). By default on a TCP transporter this is set to 55. When we start an API and the transporter is an TCP transporter we will check if the TCP transporter can use a localhost connection. If it can it drops the Group down to 54. We can change the Group all the way down to 5 if we have set the node to be our neighbour. This is done by setting the configuration parameter for the MySQL Server --ndb-data-node-neighbour to the node id of our closest neighbour.

All of this assumes that we have set --ndb-optimized-node-selection to 3 which is the default value. This parameter is there to be able to get backwards compatible behaviour with older versions. This parameter should normally not be changed.

In short we will select the node among the replicas and we will prioritize first our data node neighbour, next any node that can connect using localhost, and finally any other node.

If the table is fully replicated we will perform the same algorithm, but this time using all nodes that have a replica (normally all data nodes in the cluster).

No key was provided#

When no key is provided we have no hints, we will select a node based on the closeness of the node as described above. This time the nodes to choose among is the full set of data nodes in the cluster.

Choosing at the MySQL Server level#

From a MySQL Server we have integrated the hint mechanism of the NDB API in the following manner. When starting a transaction the first query will decide the key to provide in the hint. If the primary key of a table is given in the first query we will use this as the key in the hint mechanism. For range scans, full table scans and pushdown joins we will not have any key provided.

Choosing the tc thread#

The tc thread is always choosen after the node have been choosen using a round robin scheme.

Choosing the data node for reading#

For tables without the read backup feature we always use the primary replica to read with the exception for some BLOB reads as explained in the BLOB chapter.

For tables with the read backup feature we first attempt to read from a replica residing in the same data node as the transaction coordinator is placed in. Otherwise we always read the primary replica.

Choosing the ldm thread#

The ldm thread is decided by the row used. We will use the same ldm in all nodes in the node group for a certain partition. Thus there is nothing to choose here.

Choosing the data nodes for writing#

This is completely determined by the row to write. We will always write all available replicas and these replicas have a strict order where one replica is primary replica and the other are backup replicas that are organised in a certain order. Thus there is nothing to choose here.