Skip to content

Multi-Site RonDB#

RonDB is designed for Local Area Network (LAN) environments. It is possible to have nodes a few kilometers away from each other and still get a good user experience as is the case within a cloud region.

There are a number of reasons to make use of multiple clusters that contain the same data. Each cluster uses synchronous replication between nodes in the cluster. Global Replication uses asynchronous replication between the clusters.

A few reasons to use Global Replication are:

  1. The application must survive even earthquakes and other such drastic problems. An earthquake will impact a fairly large area and other types of problems such as local conflicts, local unrest is affecting larger areas than a LAN or cloud region will cover.

    For the absolutely highest availability it is of utmost importance to have availability beyond what a single cluster can provide.

    This is the reason many of the high-profile users of NDB use Global Replication. A telecom operator cannot accept that the phone service or mobile internet service is disrupted completely by one catastrophic event. The application has to be built to handle such catastrophic events.

  2. The application needs 100% uptime in all types of software upgrades. Many applications upgrade not only the DBMS when upgrading, they might decide to upgrade their application and they might even decide to move one data center from one place to another. A fail-over cluster can be very useful in all of those cases.

    RonDB is designed to handle most software changes without downtime. But there is still some things we don’t support that might be needed such an online drop column. In this case another cluster can be used to perform an online upgrade. Next the clusters are swapped and the first cluster is brought online and after that the second cluster can be brought down to upgrade that as well.

  3. Global Replication can be used to build a true globally distributed database where updates can be performed in any cluster anywhere in the world. Given the latency between systems on a global scale it is not possible to wait for commit until all clusters have accepted the transaction. Instead we use a conflict detection scheme. RonDB supports a number of different conflict detection schemes.

    The globally distributed database setup can be used in companies that operate over multiple continents and need a cluster on each continent that can act as primary cluster.

    Even with a single primary cluster one can use conflict detection to ensure that failover to a new primary cluster can be done without downtime. In this case the conflicts only occur in the short time that two clusters are both primary clusters.

  4. Replication to a different storage engine of MySQL. We will not cover this in any detail in this book. It should be straightforward to do with the descriptions in this chapter.

MySQL Replication#

The asynchronous replication between clusters is built on top of MySQL Replication. It reuses most of the basic things about MySQL Replication such as a binlog, a relay log, source MySQL Servers and replica MySQL Servers. There is a set of differences as well.

First of all RonDB only supports row-based replication. The meta data changes are replicated using statements, but all data changes are replicated using row changes.

The row changes are not generated in the MySQL Server (remember that RonDB can be accessed also from NDB API applications), rather they are generated inside the data nodes. The data node generates events for row changes and transports them to a specific MySQL Server that handles replication between clusters.

RonDB uses a concept called epochs to transport changes over to the backup cluster. Epochs are a sort of group commit, it groups all changes to the cluster committed in the last time period (100 milliseconds by default) into one epoch. This epoch is transported to the backup cluster and applied there as one transaction.

MySQL using InnoDB relies on GTIDs to handle replication consistency. RonDB gets its replication consistency from its use of epochs.

In research papers about database replication, there are two main themes on how to replicate. One is the epoch style, this requires only sending over the epochs and apply them to the backup cluster. The actual replication is simpler in this case. The problem with this approach is that one need to create epochs, this puts a burden on the primary cluster. In RonDB this isn’t a problem since we already have an architecture to generate epochs efficiently through our global checkpoint protocol. Thus using epochs is a natural choice for RonDB.

The other approach has smaller impact on the source, it instead requires that one keep track of the read and write sets and ensures that the replica applier has access to this information. This is a natural architecture for a DBMS where it is harder to change the transaction architecture to create epochs.

Both architectures have merits and disadvantages, for RonDB epochs was a very natural choice. For MySQL using InnoDB the approach using GTIDs to serially order transactions (still lacks the read sets) was the more natural approach.

Global Replication actually supports more features compared to MySQL Replication. The features we support in this area are very advanced. The lead developer in this area took his Ph.D in this area about 20 years ago and presented a distributed replication architecture at the VLDB conference in Rome more than 15 years ago.

This is an area that we have developed in close cooperation with our telecom users that all require this feature and have built a lot of support functionality around it.

Architecture of Global Replication#

As a first step we will describe the nodes required to setup replication between clusters. Given that almost all applications of RonDB is high availability applications, we will only consider replication channels that have a primary and a backup channel.

A replication channel always depends on a source MySQL Server and a replica MySQL Server. Thus to have a primary replication channel and a backup replication channel means that we need at least two source MySQL Servers and two replica MySQL Servers to replicate from one RonDB Cluster to another RonDB Cluster.

In active-active architectures replication goes in both directions. This can be accomplished with the same set of MySQL Servers or it can be accomplished with indepedent MySQL Servers. There is no difference in this choice, in this description we will assume that the four MySQL Servers are responsible for only replicating in one direction.

Thus active-active replication will require four or eight MySQL Servers for two RonDB Clusters. Circular replication require four MySQL Servers per RonDB Cluster.

Specialised MySQL Servers for replication#

The optimal design for the MySQL Servers used for replicating to a backup cluster and also for the MySQL Servers that applies the binlogs in the backup cluster, is that those MySQL Servers are specialised servers only used for replication. These specialised MySQL servers can also be used for performing schema changes.

Base architecture#

The base architecture is replicating from one RonDB cluster to another RonDB cluster. Here we have a set of local MySQL Servers and a set of data nodes in each cluster that are local to that cluster. There are two MySQL Servers in the primary cluster that receives the events from the primary cluster and both stores the events in their binlog.

There is a backup cluster, also with a set of local MySQL Servers and data nodes. In addition it has two replica MySQL Servers. These two MySQL Servers are receiving replication events from one of the MySQL Replication Servers in the primary cluster. Only one of those is active at any time. Failover between those repluca MySQL Servers happens on epoch boundaries. Thus if one replication channel fails, the other can deduce which epoch that have been applied and start from the next epoch after that.

image

If the primary cluster fails or if both replication channels fail, we will fail over to the backup cluster. Again using epochs this is straightforward, we discover which is the last epoch we received in its entirety and we will continue from there.

The application must also be redirected towards the new primary cluster before the failover is completed.

Active-Standby Clusters#

Active-Standby clusters is the most basic configuration used primarily for high availability, complex software upgrades, hardware upgrades and so forth. No conflict detection is required since there is only one active cluster at a time.

Read-Replica Clusters#

In this case there is no failover handling. The replication is done to increase read throughput. We could have more than backup cluster.

Multi-level Read-Replica Clusters#

With high throughput of writes it can be hard for one primary cluster to replicate to all other backup clusters. In this case one might have multiple levels of replication.

Thus the primary cluster replicates to a set of clusters, each of those clusters in its turn can replicate onto another set of RonDB Clusters.

image

Active-Active Clusters#

Active-Active clusters replicate in both directions. This means that we can get conflicts. Thus it is important to decide how to handle those conflicts, we will go through the options available in the chapter on globally distributed databases using RonDB and in the chapter on extremely available solutions with RonDB.

A special case of this architecture is when we normally run in Active-Standby mode, but during the failover we run in Active-Active mode. To support this, is one of the reasons we support conflict detection in the Active-Active setup.

Circular Active-Active Clusters#

With more than two active clusters we build the replication in circular fashion. This still requires conflict detection handling. We will cover this in the chapter on globally distributed databases.

Tables used by Global Replication#

Global Replication uses a set of internal tables to provide information about replication status. These tables are created when installing a MySQL Server.

The ndb_binlog_index table is an InnoDB table that is local on each MySQL Server used to replicate cluster tables to another cluster or another storage engine.

The ndb_apply_status, ndb_schema and ndb_replication tables are RonDB tables.

All those tables belong to the mysql database.

ndb_binlog_index table#

The ndb_binlog_index table contains an index to what have been written in the local binlog in a MySQL Server handling Global Replication. This table is created by the MySQL installation program.

One row is created per epoch executed. Only epochs that actually perform changes are recorded in this table. There is a configuration parameter --ndb-log-empty-update that can be used to also write entries to this table even for empty epochs. We recommend NOT to use this feature. It produces a lot of rows in the ndb_binlog_index. It is better that the failover solution adds a heartbeat table that is constantly updated on a regular basis (we will discuss in some detail in later chapters). This ensures that we minimize the overhead of keeping the replication channel active such that we know when to fail over to the backup replication channel.

CREATE TABLE ndb_binlog_index
(
  Position bigint unsigned NOT NULL,
  File varchar(255) NOT NULL,
  epoch bigint unsigned NOT NULL,
  inserts int unsigned NOT NULL,
  updates int unsigned NOT NULL,
  deletes int unsigned NOT NULL,
  schemaops int unsigned NOT NULL,
  orig_server_id int unsigned NOT NULL,
  orig_epoch bigint unsigned NOT NULL,
  gci int unsigned NOT NULL,
  next_position bigint unsigned NOT NULL,
  next_file varchar(255) NOT NULL,
  PRIMARY KEY (`epoch`,`orig_server_id`,`orig_epoch`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The Position field specifies the start position of the epoch in the binlog and File is the filename of the file where the start position is placed.

The epoch lists the 64-bit id of the epoch number consisting of a 32 bit GCI number and a 32-bit micro-GCI. The GCI is also reported in its own column.

inserts, updates, deletes and schemaops counts how many of those operations that the epoch consist of.

next_file and next_position point to the position and file of the next epoch after this one.

orig_server_id is a variable updated when the ndb-log-orig MySQL server variable is set to 1 (defaults to 0). In this case it is set to the originating server id of this epoch and orig_epoch contains the epoch number in the originating cluster. Each MySQL replication server must have its own unique server id. The responsibility to make it unique and setup the replication configuration lies on the DBA. The server id is set when you start up the MySQL replication server using the --server-id parameter.

The server id is a number that must be unique among all the MySQL Replication servers in a global replication setup. This number is in no way related to any other number, thus not related to node ids. These numbers can be selected in any fashion.

When circular replication is used, each epoch will generate as many entries in this table as there are clusters in the circle. Assume that epoch 13 is originated in cluster A, now this epoch is applied to cluster B in epoch 209. Thus epoch 209 will generate two rows in this table. One that is the epoch 13 originated from cluster A, this will have epoch set to 209, orig_epoch set to 13 and server_id set to the server id of the originating MySQL replication server in cluster A. The second is the epoch generated from transactions in cluster B. This will generate a row with epoch set to 209, also orig_epoch is set to 209 and server_id is set to the server id in cluster B where this transaction is written to the binlog.

These multiple entries in the ndb_binlog_index table are used by the conflict detection code as we will later discuss.

ndb_apply_status#

The ndb_apply_status table contains rows that describe epochs that are written into the cluster by a replica applier. Updates to this table is not generated by events in the cluster, it is generated by a MySQL replication server acting as a binlog server. Thus in the primary cluster this table will be empty, the table will be filled in the backup cluster when applying the epoch in the replica applier. This table has a primary key that is the server id, thus each epoch will overwrite the values in this tables. Only the latest applied epoch is represented in this table for each server id.

Each epoch with changes contains one write into this table.

CREATE TABLE ndb_apply_status
(
  server_id int unsigned NOT NULL,
  epoch bigint unsigned NOT NULL,
  log_name varchar(255) binary NOT NULL,
  start_pos bigint unsigned NOT NULL,
  end_pos bigint unsigned NOT NUL
  PRIMARY KEY (server_id) USING HASH
) ENGINE=NDB DEFAULT CHARSET=latin1;

server_id contains the server id from the MySQL Server generating the binlog entry for the epoch. Thus it is the server id of the MySQL replication server in the primary cluster. The information stored in this record is the following.

epoch is the epoch number in the primary cluster. The log_name is the file name of the start position of the epoch. start_pos and end_pos are the start and end position in the binlog of the epoch.

Given that transactions are applied atomically and this write is part of the epoch transaction, it is possible to deduce if an epoch have been applied in the backup cluster by looking at this table.

Normally the writes to this table isn’t replicated from the backup cluster. But for globally distributed database implementations using multi-primary it is necessary to also replicate these changes on towards the next cluster to ensure that conflict detection can work correctly.

This is controlled by the ndb-log-apply-status variable. If the server id is the same as written it will be ignored to avoid replication loops.

ndb_schema#

Schema changes (CREATE TABLE, RENAME TABLE, DROP TABLE, ALTER TABLE, ...) are replicated as well. They are replicated through the aid of the ndb_schema table. When a schema change is executed a row is inserted into the ndb_schema table. This table is never replicated to the replicas.

Each MySQL Server listens to events on this table. Thus each MySQL Server will hear about any schema modifications as part of the schema change process. In addition each MySQL Server listens to events that provide detailed information about the resulting table(s) from the event.

The schema change is not necessarily executed in the same MySQL Server as where the binlog is written. By writing a row in the ndb_schema table we transport information to the MySQL replication server(s) about the schema change. We also ensure that the schema change is executed in a specific epoch.

The table provides a column called slock, this column is a bitmap that is updated by each MySQL Server when they hear about the schema change. Thus through this table we ensure that all MySQL servers are informed of all metadata changes. If a MySQL Server is down when the schema modifications occurs, it can find about this schema change in the recovery phase by looking into this table.

CREATE TABLE ndb_schema (
    db varbinary(63) NOT NULL,
    name varbinary(63) NOT NULL,
    slock binary(32) NOT NULL,
    query blob NOT NULL,
    node_id int unsigned NOT NULL,
    epoch bigint unsigned NOT NULL,
    id int unsigned NOT NULL,
    version int unsigned NOT NULL,
    type int unsigned NOT NULL,
    PRIMARY KEY USING HASH (db,name)
) ENGINE=NDB DEFAULT CHARSET=latin1;

In this table we record the database name in db, the table name in name, the actual query executed is inserted into the query column. The query might be a bit rewritten for multi-table schema changes since ndb_schema will only contain single-table schema changes.

node_id contains the node id of the MySQL Server that executed the schema change. id contains the table id, version contains the table version and type contains table type. epoch contains the epoch number of the schema change.

ndb_replication#

The ndb_replication table has two major use cases. The first is that it can be used in normal replication setups to control which tables that are to be replicated and what data to replicate. The table doesn’t exist by default, thus the user must create the table and fill it with data for it to be used. If the table doesn’t exist the MySQL server settings through the options control what to replicate and by default no conflict detection is used.

The table should be placed in the mysql database and below is the CREATE TABLE command used to create it.

CREATE TABLE ndb_replication (
  db varbinary(63),
  table_name varbinary(63),
  server_id int unsigned,
  binlog_type int unsigned,
  conflict_fn varbinary(128),
  PRIMARY KEY USING HASH (db, table_name, server_id)
) ENGINE=NDB PARTITION BY KEY(db,table_name);

db#

Database name, could contain % and ? such that the database name is regular expression. In this case the columns will apply to all tables that have a database that fit the regular expression.

table_name#

Table name, can also contain regular expressions.

server_id#

If this is set to 0 it applies to all MySQL replication servers in the cluster. If set to a non-zero value it only applies to the server id listed in the table.

binlog_type#

The values used here are defined in sql/ndb_share.h. The values are used to set the way we binlog. There are three configuration options that control this for the default setting. These are ndb-log-update-as-write, ndb-log-updated-only, and ndb-log-update-as-minimal.

ndb-log-update-as-minimal are OFF for all options except 8 and 9 below for which they are set. The 8 and 9 settings can only be used in conjunction with the conflict detection functions NDB$EPOCH2 and NDB$EPOCH2_TRANS.

Setting this to 0 (NBT_DEFAULT) means that the server defaults rules. This is default if no row for a table exists or if the entire ndb_replication table is missing.

Setting it to 1 (NBT_NO_LOGGING) means that the table or set of tables will not be logged in the binary log.

Setting the value to 2 (NBT_UPDATED_ONLY) means the same thing as when ndb-log-updated-only and ndb-log-update-as-write are set. Thus only changed values are sent in events to MySQL replication servers and thus only changed columns are written into binlog. This is the recommended value for tables that are replicated using RonDB without conflict detection. Using default settings of the configuration options will use this binlog type.

Setting this to 3 (NBT_FULL) lead to that the full row is sent in an event from the data node and written into binlog. We will still only write the after image. Same behaviour as when ndb-log-updated-only is set to OFF and no other changes to default settings.

Settting this to 4 (NBT_USE_UPDATE) or to 6 (NBT_UPDATED_ONLY_USE_UPDATE) lead to that updates are logged as updates and inserts as writes. This is the setting one gets by setting ndb-log-update-as-write to OFF and ndb-log-updated-only to ON.

These settings are the one to use for replication to another storage engine.

Setting this to 7 (NBT_FULL_USE_UPDATE) lead to the same as setting 6 but the ndb-log-updated-only is set to OFF.

This is the setting when using conflict detection with either of the functions NDB$OLD(column_name) and NDB$MAX_DELETE_WIN(column_name).

8 (NBT_UPDATED_ONLY_MINIMAL) is the setting used for replication with conflict detection using either of the conflict detection functions ndb$EPOCH2 or NDB$EPOCH2_TRANS. It writes updates as updates and writes only the changed columns in the after image and only the primary key columns of the before image.

9 (NBT_UPDATED_FULL_MINIMAL) is the same as 8 except that all non-primary key columns are written in the after image. This can be used to replicate to older versions of RonDB.

conflict_fn#

This specifies the conflict function, this will be covered in the chapter on globally distributed database.

Epochs#

RonDB uses a concept called epochs to replicate transactions. An epoch is a group of transactions that are serialised to execute before the next epoch and after the previous epoch as described in earlier chapters.

We achieve these transaction groups by blocking transactions from starting to commit when we receive a prepare message from the source node, after the source node have received a response from all nodes it will send a message that transactions are allowed to start committing again.

Transactions can be in three states. They can be running operations such as read, insert, update and delete, but haven’t yet reached the commit point. The second state is that they have reached the commit point and the third state is that they are executing the commit. Transactions can continue committing if they have passed the commit point and they can continue executing in the prepare phases. The performance isn’t affected a lot by creating these transaction groups that are the epochs used for replicating to a backup cluster.

Epochs are created at a configurable interval. By default this interval is 100 milliseconds. By settting the cluster configuration parameter TimeBetweenEpochs we can decrease or increase the interval for the creation of epochs.

Some of the epochs are also used to create a global checkpoint that the cluster can restore from in the case of a complete cluster crash. This is configurable by the cluster configuration parameter TimeBetweenGlobalCheckpoints. This is set to 2000 milliseconds by default.

As mentioned before the epoch is executed as one transaction in the backup cluster.

Setting up a replication channel between two clusters#

In order to handle the more complex replication scenarios we will first show the basic building block of setting up a single replication channel. To make it even easier this channel is brought up before starting any operations on the primary cluster.

At first it is highly recommended to start by setting the MySQL server parameter replica-allow-batching to 1 in the MySQL Server in the backup cluster where the binlog is applied. This allows the replica thread to execute many operations in parallel and is very important to keep up with the update rates in the primary cluster.

Starting up the MySQL replication servers#

At first the MySQL Server in the primary cluster must be started with a number of MySQL server options set properly. The source must set a unique server id using the parameter server-id, it must activate the binlog through setting the log-bin option to the base filename of the binlog files.

The NDB storage engine must also be activated through setting the option ndbcluster. RonDB only supports row based replication, it is necessary to set the option binlog-format to ROW. This will automatically use row based replication for normal operations.

To activate replication from the NDB storage engine it is also necessary to set the configuration parameter ndb-log-bin to 1. This is by default set to 0 and log-bin is set by default. Thus to not enable binlog in a MySQL Server it is necessary to set disable-log-bin. Without the ndb-log-bin set to 1, the only actions that will be in the binlog is the metadata statements.

The ndb-log-updated-only option is important to have ON (which it is by default), this will ensure that the RonDB data nodes will only send changed columns to the MySQL replication servers after a write transaction is completed.

ndb-log-update-as-write converts updates to writes to ensure that execution of the binlog is idempotent. This is on by default and should be kept on.

So the default behaviour is to record all updates as writes with only the changed columns recorded. This setting is good unless you are using any of the conflict detection setups, this will be covered in a later chapter. If you replicate to another storage engine it is important to set ndb-log-update-as-write to 0.

Setting ndb-extra-logging to 99 might be useful if you want more understanding of what goes on in the MySQL replication server.

In the source MySQL server we should ensure that file writes are synchronised to disk to ensure that the epochs are written to disk as soon as they are completed. This means we need to set sync-binlog to 1.

Thus starting up the MySQL replication server in the primary cluster should at least contain the following options.

mysqld --server-id=$id \
       --log-bin=mysql-bin \
       --ndb-log-bin=1 \
       --sync-binlog=1 \
       --ndbcluster \
       --binlog-format=ROW \
       --binlog-cache-size=1M

The MySQL replication server in the backup cluster must also set a unique server-id, it must set the ndbcluster option. If it only operates as a replica it doesn’t need to configure the binlog. To avoid that the replica replication starts before we are ready it is important to add the option skip-replica-start. The replica applier should use batching to increase speed of the applier, setting the replica-allow-batching to 1 means that we can use batching of operations at the replica.

In RonDB 21.04 one still have to use --skip-slave-start since RonDB 21.04 is based on MySQL Cluster 8.0.23. RonDB 22.10 is based on MySQL Cluster 8.0.34 and thus will accept --skip-replica-start. Similarly RonDB 21.04 requires the use of --slave-allow-batching.

Thus starting up the MySQL replication server in the backup cluster should at least contain the following options.

mysqld --server-id=$id \
       --ndbcluster \
       --skip-replica-start \
       --replica-allow-batching

Creating a user on the source MySQL Server#

Given that the replica server needs to connect to the source MySQL server, it is necessary to setup a special replication user on the source MySQL Server. This is recommended for security reasons.

GRANT REPLICATION SLAVE ON *.*
  'user'@'hostname'
  IDENTIFIED BY 'password';

Set the replication channel starting point on the replica side#

Before the replica applier can start it needs to know the following things:

  1. Hostname of source MySQL Server

  2. Port number of the source MySQL Server

  3. Username that the replica will use on the source MySQL Server

  4. Password of this user on the source MySQL Server

  5. The starting binlog file on the source MySQL Server

  6. The starting position on the source MySQL Server.

When starting replication from a new cluster an empty file name means the starting binlog file and this file starts at position 4. The command executed in the replica MySQL Server now is the following.

CHANGE REPLICATION SOURCE TO
SOURCE_HOST='source_host',
SOURCE_PORT=source_port,
SOURCE_USER='replica_user',
SOURCE_PASSWORD='replica_password',
SOURCE_LOG_FILE='mysql-bin.000001',
SOURCE_LOG_POS=4;

There is also a set of options in the CHANGE REPLICATION SOURCE TO command to setup a secure communication between the replica and the source using SSL options.

Start replica applier#

Now we are prepared to start our replication channel. We execute the command

START REPLICA;

in the replica MySQL Server. Now the replication channel have started and any transactions executed in the primary cluster will be applied in the backup cluster through the replica MySQL Server.

Importing a backup from the primary cluster#

Bringing up a replication channel in a live cluster is not that much more complex. All the preparations are exactly the same except for one thing. We must first import the entire data set from the primary cluster followed by catching up with the binlog in the primary cluster.

Eight steps are required here:

  1. Ensure that the binlog has been started on the source MySQL Server

  2. Execute a backup in the source cluster

  3. Create any databases needed in the replica cluster not yet present there

  4. Reset the replica in the replica MySQL Server

  5. Restore the backup in the replica cluster

  6. Discover the last epoch that was in the backup

  7. Discover the start position in the binlog for the first epoch not in backup

  8. Set this position in the CHANGE REPLICATION SOURCE TO command

  9. Start the replica applier

The binlog is started immediately when the MySQL Server is started with log-bin set.

To execute a backup in RonDB is very simple, we provide more details on this in the chapter on backups in RonDB. It is enough to connect to the RonDB management server using a management client and issue the command START BACKUP. After some time the command is completed.

Now before restoring the backup ensure that any CREATE DATABASE commands are issued on a MySQL Server connected to the backup cluster. Databases are not stored in RonDB backups. If metadata was restored using data from mysqldump that also restored databases this is not necessary.

To restore the backup in the backup cluster is covered in detail in the chapter on Restore in RonDB. Essentially we install a backup from the primary cluster, we ensure that the ndb_apply_status table is up to date and we ensure that the restore is not logged to any binlogs.

Now ensure that the replica is reset before we setup the replication channel to ensure that the old replica settings aren’t interfering. This is performed through the following command in the replica MySQL Server.

RESET REPLICA;

Now to discover the last epoch in the backup we use the command in replica MySQL Server (or any other MySQL Server connected to the replica cluster).

SELECT @latest:= MAX(epoch) FROM mysql.ndb_apply_status;

We have now read the last epoch number that have been restored in the replica cluster.

Equipped with this information we can now use the ndb_binlog_index table in the source MySQL Server to discover the binlog position where to start the replica applier.

We execute the following query in the source MySQL Server where @latest comes from the previous command and @file and @pos is what we are looking for.

The epoch number written into the backup isn’t necessarily a real epoch number, this means that we need to start the replication from the position of the next binlog entry after the epoch which is the highest epoch which is equal to the epoch in the ndb_apply_status table or smaller than that.

SELECT @file:=SUBSTRING_INDEX(next_file, '/', -1),
       @pos:=next_position
FROM mysql.ndb_binlog_index
WHERE epoch <= @latest
ORDER BY epoch DESC LIMIT 1;

Now we take those two parameters back to the replica MySQL Server and use those to set the binlog file and binlog position from where the replica MySQL Server should start applying the binlog.

CHANGE REPLICATION SOURCE TO
SOURCE_HOST='source_host',
SOURCE_PORT=source_port,
SOURCE_USER='replica_user',
SOURCE_PASSWORD='replica_password',
SOURCE_LOG_FILE='@file',
SOURCE_LOG_POS=@pos;

Note that the above is pseudo code for how to setup the replication channel. In reality these commands are executed on different MySQL Servers and thus one need to use shell script, Python, Perl or some other manner of executing those commands.

Now we are now ready to start the replica applier.

START REPLICA;

After starting the replica applier it will take some time before the backup cluster has caught up with the primary cluster. The backup cluster will never entirely catch up since replication is asynchronous. But the replication lag will start with as long time as it takes to restore the backup and this could potentially take a fairly long time, definitely minutes and potentially even hours in larger clusters. The replication lag should be possible to get down to seconds.

The MySQL manual contains an example script written in Perl called reset-replica.pl that automates parts of this process.

Point-in-Time backups#

Normally the second cluster is used for fail over cases. If the primary cluster is able to sustain a high load and the backup cluster can only survive the update load, in this case it could be interesting to use the backup cluster to provide a point-in-time backup solution for RonDB.

To install such a point-in-time backup at the primary cluster one starts with a backup, next one uses the replica cluster to catch up unto the point in time that should be restored.

The MySQL manual have a detailed description how to handle this.

Read Replica Clusters#

The above setup could be directly used for Read Replica Cluster setups. In this case it might even be ok to avoid a fail-over replication channel between the primary cluster and the backup cluster. Loss of the replication channel means that the backup cluster cannot be used until it has been restarted and come back again.

For a backup cluster used only for extra read capacity this can be agreeable. The primary cluster can thus sustain more clusters instead of sustaining fail-over replication channels.

Characteristics of replica applier for RonDB#

The replica applier is the replica applier used by MySQL Replication. This applier is single-threaded. At the same time this single thread works on very large batches where transactions spanning the last 100 milliseconds is applied in one large transaction.

Given that the NDB API can sustain very high levels of batching, the performance of this single thread is extremely good. More than a hundred thousand operations per second can be handled through this single thread.

Global Replication can be used for most applications using RonDB. The exception is applications requiring many hundreds of thousands of write operations per second or even going into millions of write operations per second. Those cases are possible to handle still, but will require setting up parallel replication channel for each database as described in later chapters.

Purging MySQL Replication files#

MySQL Replication uses binlog files that have a specific size, after completing the write of one binlog file, the file is closed and a new one is written to instead. It goes on like this as long as the replication continues. A binlog file can also be closed and a new one created through restarting the MySQL Server.

MySQL will never delete any binlog files automatically, it can delete them on command, but the command must be executed by the DBA. Thus the DBA must always have a strategy for how to delete the binlog files.

In case the backup clusters are not used for fail-over scenarios it would be sufficient to keep the binlog files until all backup clusters have applied the binlog file.

Thus to calculate which files to delete at any point in time one would issue the following query towards the ndb_apply_status table in each backup cluster to deduce what epoch have been applied in the backup cluster.

SELECT @latest:= MAX(epoch) FROM mysql.ndb_apply_status;

Now take the minimum of all the latest from all backup clusters. Next use the following query on the ndb_binlog_index table in the primary cluster.

SELECT @file:=SUBSTRING_INDEX(next_file, '/', -1),
       @pos:=next_position
FROM mysql.ndb_binlog_index
WHERE epoch <= @minimum_latest
ORDER BY epoch DESC LIMIT 1;

Now the file variable contains the last file that need to be retained. All files before this one can be deleted.

Now if we have a fail-over scenario it is essential to be able to fail-over after a complete crash of the primary cluster. It is important to be able to handle simultaneous crash of the primary cluster and the backup cluster used for fail-over. In this case we can restart one of the clusters using a backup from the primary cluster followed by applying the binlog up to the end of the binlog.

To ensure that the binlog can be used together with a backup one checks the StopGCP number output from the backup at completion. In the example below the StopGCP is 3845.

ndb_mgm> START BACKUP
Waiting for completed, this may take several minutes
Node 2: Backup 1 started from node 1
Node 2: Backup 1 started from node 1 completed
 StartGCP: 3839 StopGCP: 3845 
 #Records: 193234 #LogRecords: 12349
 Data: 20519789 bytes Log: 1334560 bytes

An epoch number is formed by using the StopGCP as the 32 most significant bits of a 64-bit number and 0 as the 32 least significant numbers. Use this epoch number in the same manner as above to calculate the binlog file that can be deleted using the above query on the ndb_binlog_index table.

In the fail-over case we have to ensure that all backup clusters can survive a primary cluster failure and we need to be able to handle a restore from backup, thus we need to retain files needed for both the above cases.

When we are deleting a binlog file it is important to also delete all entries in the ndb_binlog_index table for this file as well. Running the PURGE BINARY LOGS will ensure that this is done in a correct manner.

PURGE BINARY LOGS TO 'mysql-bin-009';

Thus an important script in the DBA toolbox for Global Replication is a script that automates the above to ensure that we purge binlog files in the MySQL replication servers in the primary cluster.