Fail-over clusters#

Failing over clusters means being able to make the backup cluster the primary cluster. This can be done if the primary cluster fails, or if one wants to perform maintenance work on the primary cluster whilst minimizing downtime. Examples of maintenance work include:

changing the application software
changing the RonDB software version
changing the MySQL schema

In this tutorial, we will be showing how to prepare the backup cluster to take over the primary cluster’s role and how to gracefully switch roles between the two.

(Optional) Preparing backup cluster#

When failing over to the backup cluster, we may want to be able to add a new backup cluster to the system (or restart the primary cluster). This cluster would then replicate from the current backup cluster (the new primary). One can accelerate this process by preparing the (current) backup cluster in advance.

A simple way of doing so is to add binlog servers to the backup cluster. If one does so, the initial start of the backup cluster would look identical to the primary cluster, described in the Prepare primary tutorial. The only difference is that one would use server_id=202 and server_id=203 instead of server_id=100 and server_id=101. After having set up replication, the system would look as follows:

Primary cluster		Backup cluster
`server_id=100`	➡️	`server_id=200`
`server_id=101`	➡️	`server_id=201`
	(⬅️)	`server_id=202`
	(⬅️)	`server_id=203`

Need for downtime#

Currently, we are not running active-active replication (yet) since the primary cluster does not have replica appliers. Neither have we set up any conflict detection. Both of these elements are introduced in later tutorials.

The fact that we are only running an active-passive setup means that we can only write to one cluster at a time. Therefore, we will need a load balancer to switch between the clusters when failing over. This is outside the scope of this tutorial. However, since we do not have any conflict detection in place, we must first wait for the replica appliers to apply all log records before being able to use the backup cluster for writes. Hence, we will experience downtime when failing over. Usually, this downtime should be less than a minute and sometimes only a few seconds.

Defining cluster failures#

A cluster failure can be defined in multiple levels:

Query servers (e.g. MySQL servers) are down
All data nodes of one node group are down
Data nodes are out of memory or disk space (read-only now)
Data is missing or inconsistent

whereby each level increases in severity. Unless deeper issues are at hand, the first two levels can often be rehabilitated within seconds or minutes. One may want to take this into account when deciding to fail over between clusters.

How a cluster failure is detected is however outside the scope of this tutorial. This is part of operating a cluster and is described here.

Graceful switch of active role#

The following figure shows all the failure and responsibility zones before our primary cluster fails or is shut down:

A stopped primary cluster is an entire responsibility zone that is down. Hence, we will be required to take action on a global level. For simplicity, we call the initial primary cluster cluster A and the initial backup cluster cluster B:

Wait for replica applier in cluster B to apply all log records
Stop replica applier in cluster B

Either run STOP REPLICA or kill the MySQL Server process.
Switch load balancer between clusters

Cluster B is now the primary cluster.
Create backup in cluster B
Restore cluster A from backup
Restart cluster A’s binlog servers from scratch
Start cluster A’s replica appliers