Global Fail-over Architecture#
This is probably the most common reason to replicate between different clusters. In this scenario it is necessary to setup more than one replication channel between the clusters. We use this scenario for increased availability, thus it is important that the replication itself doesn't hurt availability and thus we need at least two replication channels. In this chapter we will always discuss how to use two replication channels, one is the primary replication channel and one is the backup replication channel. To setup with 3 or even 4 replication channels is not any major concern and the required changes to the decription in this chapter should be small.
We assume here that both clusters are designed such that they can be used as the primary cluster after a fail-over.
Setup for fail-over scenario#
The setup for a fail-over scenario is exactly the same as the setup for a single replication channel. The only difference is that we set it up for two pairs of MySQL Servers instead of only one. We will call the operational pair the primary replication channel and the fail-over replication channel we will call backup replication channel.
It is important to start up both MySQL replication servers in the primary cluster. These two MySQL replication servers will have very similar content in their binlog files. We can however never trust they use exactly the same positions. Thus when setting up the replica MySQL Server on the backup replication channel we always have to use data from queries executed in the source MySQL Server on the backup replication channel and vice versa for the primary replication channel.
The last two commands, the CHANGE REPLICATION SOURCE TO and the START REPLICA command must only be performed on the primary replication channel.
These commands cannot be executed on the backup replication channel. Doing so would lead to two replica appliers trying to insert the same data in parallel and could cause all sorts of troubles.
The backup replication channel is not used until it is time to perform a fail-over from the primary replication channel to the backup replication channel.
Discovering failure of a replication channel#
What could cause a replication channel cutover to start using the backup replication channel?
If either the source MySQL Server or the replica MySQL Server on the primary replication channel fails, we need to cut over to the backup replication channel.
A cutover can also be caused by a network failure that stops the source MySQL server and the replica MySQL server on the primary replication channel from communicating with each other.
Indeed it can also be caused by a lost connection from the source MySQL Server towards the data nodes in the primary cluster. If this happens it will insert a special gap event called LOST_EVENT. When the replica applier finds this record in the binlog it will immediately stop applying the binlog and thus stop the replica.
Given that these failures can be complex and require a view from several independent systems it can even be worthwhile to make the decision to cutover as a manual decision.
Important things to watch#
In both the VM of the source MySQL Server and the VM of the replica MySQL Server it is important to see if the process is still up and running. In Linux this can be achieved with a fairly simple script using the ps program. In the iClaustron hobby project the following command is used to watch if a process is up and running.
ps -p $SEARCHED_PID -o command | \ grep $SEARCHED_PROCESS | \ wc -l | sed -e 's/ //g'`
SEARCHED_PID is the pid of the MySQL server process in this case, SEARCHED_PROCESS is mysqld in this case. If this returns 1 the process is still running, if it returns 0 it means that the process has for some reason stopped.
Returning 0 here means that the cut-over should be initiated. If 1 is returned it simply means that we need to gather more information, the process is still running, but is it making progress is another question we need to ask.
An alternative method is to issue a simple command to the MySQL Server, if it returns a result it is up and running and if it doesn't we can attempt to reconnect and if we cannot connect to the MySQL Server it is likely that it has stopped.
SELECT @latest:= MAX(epoch) FROM mysql.ndb_apply_status;
If the value of this have changed in the replica cluster since the last call we know that we are making progress. If the value is the same we have made no progress, but that is still not necessarily an indication that the replication channel have failed.
Now if we discover that no epochs are applied in the backup cluster we can check the status of the replica applier by issuing the command SHOW REPLICA STATUS in the replica MySQL Server. This will most likely provide details on why the replication channel is not working properly.
One reason the replica has not advanced the epoch number could be a large transaction running. In this case the SHOW REPLICA STATUS command will show that the replica applier is busy executing. In this case we will simply wait one more time slot.
Another thing the command could show is that the replica has stopped. In almost all cases this should lead to a cut-over to start.
To ensure that the replication channel is constantly active we can create a simple table with a primary key and a counter. We can update this table from the script that checks for failovers.
Thoughts on an automatic discover fail-over script#
A simple script to handle fail-over is to have a connection to the replica MySQL Server.
Next step is to decide on the time to wait between checks of the replication channel. Let's say we use 10 seconds here.
Every 3 seconds perform the following.
Issue a write towards a MySQL Server in primary cluster
Issue a read of latest epoch from ndb_apply_status in MySQL Server in the backup cluster
If epoch has changed since last 3 second check we stop with OK
We issue a SHOW REPLICA STATUS to check the state of the replica applier
If the state is stopped we start a cut-over of the replication channel
Add 1 to a counter, if the count is above some limitation start a cut-over
If the state is that we are executing we will stop with OK
The script to check for failover can execute in both the primary replica applier and in the backup replica applier. Both will issue a stop on the replication in the primary replica applier when a cut-over is initiated. However only the script on the backup replication channel will start the replication channel on the backup replication channel.
The check if any epochs have been applied will discover all types of failures of the primary replication channel. In a sense this command is sufficient to perform a health check of the state of the primary replication channel. Given that a new epoch is created once per 100 milliseconds we simply need to decide how long time we think that it is ok to wait for an epoch to complete. This is affected by the size of the largest transactions executed, it is also dependent on what type of ALTER TABLE statements that are replicated across the channel.
Thoughts on a replication channel cut-over script#
The first step is to ensure that the replica applier in the replica MySQL server is stopped. Normally this is simple enough using the STOP REPLICA command. It is a bit more complicated if the replica MySQL Server doesn't respond. If so we can ensure that the replica is stopped by killing the replica MySQL Server.
After ensuring that the replica applier in the currently active replication channel has stopped we perform the usual logic to discover from where to start the backup replication channel.
This means issuing the following command in the replica MySQL server in the backup replication channel.
SELECT @latest:= MAX(epoch) FROM mysql.ndb_apply_status;
Next we use the \@latest in the command towards the source MySQL Server in the backup replication channel.
SELECT @file:=SUBSTRING_INDEX(next_file, '/', -1), @pos:=next_position FROM mysql.ndb_binlog_index WHERE epoch = @latest;
If this command returns the empty string it could be because this MySQL replication server is a bit behind, if this is suspected we could check more states or simply wait a bit more. A more common reason is that returning an empty string simply means that the replication channel didn't record this epoch. In this case the replication channel is broken since both the primary and the backup replication channel is broken.
A broken replication channel means that one have to start from a backup again, the backup must be taken after the stop since we lack binlog information for some period around the stop time.
Now in the successful case where we did find the epoch we use the \@file and \@pos from this query in the following command towards the replica MySQL server in the standby replication channel.
CHANGE REPLICATION SOURCE TO SOURCE_HOST='source_host', SOURCE_PORT=source_port, SOURCE_USER='replica_user', SOURCE_PASSWORD='replica_password', SOURCE_LOG_FILE=@file, SOURCE_LOG_POS=@pos;
We are now ready to resume replica applier operations in the backup cluster by issuing the START REPLICA command in the replica MySQL Server in the backup replication channel.
Now the backup replication channel have been promoted to primary replication channel.
Handling GAP events#
GAP events or LOST_EVENT is something inserted in the binlog when the MySQL replication server have lost the ability to write the binlog. The most common reason for this, using RonDB, is that the binlog server have lost connection to the primary cluster or that the binlog server could not keep up with the replication and ran out of event buffer memory.
When the replica encounters a GAP event in the binlog it will stop. When this happens we will soon discover that the replication channel have stopped. After discovering that this happened due to a GAP event we will perform a normal cutover to the other replication channel.
If the backup replication channel had a GAP event even before the primary replication channel we will not find any record for the current epoch in the ndb_binlog_index table in the binlog server in the standby replication channel. Thus the replication channel is broken and we will need to start it up from scratch again.
If the backup replication channel had a GAP event at exactly the same place as the primary replication channel we will be able to start the backup replication channel. But it will stop immediately on the GAP event. In this case we will compare the binlog position we have stopped at, if this is the same as when we started it, we will not attempt to cutover the replication channel since we know that it will not work.
If the backup replication channel had a GAP event further on in the binlog we will treat it as the first GAP event and will execute the above procedure for that GAP event as well. There is no way of knowing if this GAP event is enough to enable us to cutover the original primary replication channel, thus we will try and see if it works.
Discover failure of a cluster#
To fail over to the backup cluster entirely after a failure of the primary cluster requires discovery logic of when the primary cluster have failed.
To discover that a cluster have failed is a fairly complex process given that there are many potential variants where we can have partial failures.
We need to have a perfectly working backup cluster for a fail over to be of any use. Thus we need to check the operations of both the primary cluster as well as the backup cluster.
There are two simple methods of checking that a cluster is up and running. The first uses a MySQL client to perform a simple operation and the other uses an RonDB management client to check the cluster status.
Checking status through MySQL client#
We can use the MySQL replication servers to also check the status of the cluster. A simple SELECT query towards the cluster will show if the cluster is up. If the query succeeds the cluster must be up.
SELECT a_column FROM an_ndb_table WHERE pk=a_key;
If the cluster fails we can attempt the same query also towards the other MySQL replication servers.
Checking status through RonDB management client#
Another possibility to check cluster status is using the SHOW command in the RonDB management client. This command reports the status as it is seen from the management server we contact to issue the SHOW command.
ndb_mgm> SHOW Cluster Configuration --------------------- [ndbd(NDB)] 2 node(s) id=1 @192.168.1.9 (RonDB-21.04.1, Nodegroup: 0, *) id=2 (not connected, accepting connect from 192.168.1.10) [ndb_mgmd(MGM)] 1 node(s) id=65 @192.168.1.8 (RonDB-21.04.1) id=66 @192.168.1.9 (RonDB-21.04.1) [mysqld(API)] 2 node(s) id=67 @192.168.1.9 (RonDB-21.04.1) id=68 @192.168.1.10 (RonDB-21.04.1) id=69 (not connected, accepting connect from 192.168.1.9) id=70 (not connected, accepting connect from 192.168.1.10)
If the cluster is up and running at least one of the data nodes (ndbd(NDB) above) will be presented with its hostname, its RonDB version and its node group id. Nodes that are considered down are listed as not connected, data nodes can also be listed as being in a start phase.
If none of them are up and running this management node considers the cluster as down.
Building blocks for discover cluster failure script#
A natural building block is to use a MySQL client at first to check the cluster status. As long as this is successful we need not do anything more. If it is unsuccessful we can try the same thing from another MySQL replication server in the same cluster. If this is not successful either we move on to attempt with the first management server in the cluster. If it considers the cluster as down as well we proceed and check with the remaining management servers connected to the cluster.
If all MySQL replication servers and all RonDB management servers connected to the cluster all report the cluster as not up, we will deduce that the cluster is down.
This will be a signal to fail over to the backup cluster.
Application assistance in fail-over discovery#
Applications can consider the cluster useless even if the cluster is still up and running. One such case happens when we run out REDO log (the application receives error code 410), this turns RonDB into a read-only database that is normally not an acceptable state and can lead to a fail-over to the backup cluster.
We can discover this state by turning the heartbeat query into an update query (possibly even updating disk columns).
We can also run out UNDO log for disk columns, there are various memory buffers that one can run out of that causes intermittent failures.
The application logic can also participate in the discovery process to discover when a cluster is no longer in an acceptable state and thus requiring a fail-over to the backup cluster.
Cluster fail over#
If we discover that the primary cluster have failed we need to fail over to the backup cluster. This requires the replica applier to complete the application of the log records that was produced in the primary cluster before it failed.
Once this is completed we will stop the replica using the STOP REPLICA command.
Now the backup cluster is up and running, the primary cluster is down, at this moment the live data is in the backup cluster. There is one major problem still though. All applications are still connected to the primary cluster where no action is happening.
The next step is application specific, all application logic must be transferred from the primary cluster to the backup cluster. For web server applications this usually involves redirecting the load balancers to the MySQL Servers of the backup cluster.