Create RonDB cluster#
In the above link the documentation for how to create a RonDB cluster as part of Hopsworks is presented. In this interface the user is able to decide on the following parameters
Number of Data Nodes
VM Type for the Data nodes
Storage Size for Data nodes
Number of MySQL Servers
VM Type for the MySQL Servers
Storage Size for MySQL Servers
In the Detailed section it is also possible to decide on the following parameters.
Number of Replicas
Number of API Nodes
VM Type of API Nodes
Storage Size for API nodes
The number of data nodes must be a multiple of the number of replicas.
The first decision is whether to use 1,2 or 3 replicas (4 replicas is also possible). If high availability isn't required, then 1 replica is sufficient. If high availability is required 2 replicas at least should be choosen.
For the most part the number of data nodes should be equal to the number of replicas. One reason for having more data nodes than replicas is that we need a larger database size than can be accomplished through one set of data nodes.
In a cluster with the same amount of data nodes as replicas the database size is limited by the memory size of the data node VM. The largest standard VM is the r5-metal which can house around 768 GByte of memory. This means that around 650 GBytes of memory is available for in-memory rows and indexes.
RonDB supports storing columns on disk, thus if disk columns are used this size can be even bigger. There are also even bigger VM sizes that we haven't tried out very much yet.
Thus size of the database could potentially affect the choice of the number of data nodes.
It is unlikely that the requirement on database processing requires more data nodes. If this is the case one should most likely first verify that the requirement is larger than what can be accomplished with one node group of data nodes.
The number of MySQL Servers should be choosen based on the amount of query processing is required towards RonDB. In Sysbench benchmarks we find that we need about twice as many CPUs on the MySQL Servers as we need on the data nodes. If we want high availability we should have at least 2 MySQL Servers.
We should not create VMs for MySQL Servers with more than 32 VCPUs at the moment.
Storage size of MySQL Server and API nodes can normally use the minimum amount possible to set.
RonDB data nodes requires around 50% as much storage space as the memory size of the VM. In addition it requires around 128 GByte of storage space for REDO logs and UNDO logs. Some amount of storage space should also be allocated for any disk column storage. Thus a minimum of 256 GByte is required for a data node. A very large data node VM with 768 GByte of memory would require at least 2 TByte of storage space and potentially more if disk columns are heavily used.
Logical Clocks provides a script that can be used to create a RonDB cluster using a simple script. This script can be used to create a RonDB cluster in Google Cloud (GCP) and in Azure. It can also be used to create a RonDB cluster using your own computers.
The script can be used in either an interactive mode or in a non-interactive mode. We will first go through the parameters used when starting a RonDB Cluster in Azure and next starting a RonDB cluster in GCP.
The cloud script can be used to create RonDB clusters in Azure and in GCP. In Azure the script relies on using the Azure command line program. Thus before a cluster can be created in Azure it is necessary to download the Azure command line to your local computer. A prerequisite in the current version of the script is that it is executed from a Linux OS. Thus if your computer uses Mac OS X or Windows you need to execute those scripts from a local Linux VM on the machine.
The script is downloaded using the following command:
Use search term Azure CLI download on Google and you will quickly find out how to install the Azure CLI.
After installing the Azure CLI the first step is to authenticate the CLI towards Azure. This is performed by the command:
This command will launch a web page from where you are able to log into Azure. After this command has been executed the CLI will be able to use the authenticated user on your machine.
The next step is to configure your Azure CLI environment. This will among other things set up the default region to use. This configure command is executed through the command:
Finally it is necessary to set up a VPN in the Azure user and an Azure resource group to use for the experiments. After those preparations one is prepared to execute the script that creates the RonDB cluster.
The first step is the same as for Azure, it is to install the CLI for Google Cloud. Before proceeding with the installation there needs to be a GCP user, a GCP project to connect your CLI to. After ensuring this exists and installing this CLI the environment is setup using the command:
This command will set up the GCP account to use, the GCP project to use, the compute region and zone to use and some other things. These configurations only apply to your local environment.
After these preparations you are ready to create your first RonDB cluster in GCP.
Create a RonDB cluster in Azure#
In my testing I usually create a quick start script that I usually call rondb-quick-start.sh. Before executing the script to start a new RonDB cluster I edit this file. I will explain how to create a RonDB cluster in Azure by explaining how to write such a quick start script.
The RonDB cluster created through this script will always have 1 head node, this head node will contain the RonDB management server. It can also be used as the VM from where benchmarks are executed as we will show in another chapter.
The number of data nodes need to be a multiple of number of replicas. The normal number of replicas are 2, and the normal number of data nodes is also 2. It is possible to run with lowered availability using only 1 replica or increasing the availability even further by using 3 replicas.
The number of MySQL Servers is flexible and can scale up, in an SQL benchmark the MySQL Servers will normally use twice as many CPUs as the RonDB data nodes.
This is always the configuration of a RonDB cluster created through this script.
When executing a benchmark in RonDB one can either use a 3-tiered approach using the head node as the benchmark driver, the MySQL Server as the second tier handling requests to the data layer, the third tier, handled by the RonDB data nodes. It is also possible to execute 2-tiered benchmarks by executing the benchmark on the MySQL Server VMs and only starting the benchmark programs from the head node using SSH links.
Thus it is possible to decide whether one wants to use a 3-tiered approach or a 2-tiered approach.
The next important thing to decide is which VM types to use for the head node, the RonDB data nodes and the MySQL Servers. These are set through the options azure-head-instance-type, azure-data-node-instance-type and api-node-instance-type. In our testing we have seen that v4 VM types have superior performance and we have mainly Standard_Dx_v4 for head node and MySQL Servers where x is replaced by the number of VCPUs to use in the VM. The data nodes mainly uses Standard_Ex_v4 in a similar fashion.
It is necessary to set a prefix to VM names to ensure that VMs get a unique name.
For head nodes and MySQL Servers we recommend to use the default boot size. For RonDB data nodes we recommend to set the boot size to twice the size of the memory in the VM. Thus in the case of Standard_E16_v4 with 16 VCPUs and 128 GByte memory we recommend using a boot size of 256 (GBytes).
The script should set the availability-zone parameter to ensure that low latency is achieved in the RonDB cluster. This is a number between 1 and 3. In some large regions it might be necessary to create a proximity placement group and specify this when creating the VMs.
The install action specifies whether to setup a full cluster or whether everything should be set up in a single VM. For the single VM case the install action should be localhost instead of cluster.
It is also possible to select the OS image to use when creating the VM.
The script executes the following steps. First it creates the necessary VMs required for the cluster. It creates one VM at a time, the managed version creates all VMs in parallel instead. After performing this step the head node controls the installation of all the necessary software components. In the managed version we have already created OS images with all software components prepared. Thus the managed version will be considerably faster in creating the RonDB cluster.
The actual command to create the RonDB cluster is seen below showing the source code of a shell script that I call rondb-quick-start.sh.
HEAD_INSTANCE_TYPE=Standard_D16_v4 DATA_NODE_INSTANCE_TYPE=Standard_E16_v4 API_INSTANCE_TYPE=Standard_D16_v4 NUM_DATA_NODES=2 NUM_API_NODES=2 NUM_REPLICAS=2 VM_NAME=test CLOUD=azure INSTALL_ACTION=cluster VPN=name_of_vpn PPG=name_of_proximity_placement_group RESOURCE_GROUP=name_of_resource_group DATA_NODE_BOOT_SIZE=256 ZONE=3 OS_IMAGE=Canonical:UbuntuServer:18.04-LTS:latest ./rondb-cloud-installer.sh \ --non-interactive \ --cloud $CLOUD \ --install-action $INSTALL_ACTION \ --vm-name-prefix $VM_NAME --azure-resource-group $RESOURCE_GROUP \ --azure-virtual-network $VPN --azure-head-instance-type $HEAD_INSTANCE_TYPE \ --azure-data-node-instance-type $DATA_NODE_INSTANCE_TYPE \ --azure-api-node-instance-type $API_NODE_INSTANCE_TYPE \ --num-data-nodes $NUM_DATA_NODES \ --num-api-nodes $NUM_API_NODES \ --num-replicas $NUM_REPLICAS \ --availability-zone $ZONE \ --database-node-boot-size $DATA_NODE_BOOT_SIZE #--debug \ #--proximity-placement-group $PPG \ #--os-image $OS_IMAGE
After the above shell script is completed, the actual installation is still ongoing. After this one can log in to the head node using the command (replace external_IP_address with the IP address shown at the end of the script execution. This IP address is the IP address of the head node where the RonDB management server resides and from where one can execute benchmarks towards RonDB.
Before proceeding allow the installation process to complete. The installation process can be followed through executing the command in the head node:
tail -f installation.log
When this command log contains the words DAG IS DONE, the RonDB cluster is created and is up and running. The start of the cluster is part of the installation of the RonDB cluster. At this point we are ready to use the RonDB cluster, we will show in another chapter how to e.g. execute a benchmark towards RonDB.
At this point you can show the status of the cluster through a command using the RonDB management client. However before issuing this command we should log in to the mysql user.
sudo su - mysql ndb_mgm -e "show"
Recommended VM types in Azure#
There are lots of other VM types in Azure other than the one mentioned here. These are the VM types that we have tested and verified that they work well with RonDB. In particular we have seen that one shoud use VMs that are v4 VMs. v3 VMs have significantly higher network latency and thus much higher latency and lower throughput compared to the v4 VMs.
All VMs recommended use remote block storage for the file system. Later we will work on VMs also with a local filesystem based on NVMe drives that have a much higher storage bandwidth.
Head node VM#
The head node contains the RonDB management server. This process requires very little CPU and memory. Thus a normal 1-2 VCPU VM will be good enough. However if the VM is also used to execute benchmarks it is required that it contains enough VCPUs to drive the benchmark load. Expect to use around 25% of the number of VCPUs in the MySQL Server in the head node for driving the benchmark.
Data node VM#
The data node VM needs to have sufficient amount of memory AND sufficient amount of VCPUs to handle the load. We recommend using the E category that have 8 GByte of memory per VCPU for data node VMs. About 75% of the memory is available for database rows and indexes (a bit more in larger VMs).
MySQL Server VM#
The MySQL Server VMs are clients only in RonDB. Thus they mainly consume CPU resources. Thus the focus should be on a VM focused on CPU resources. This means that the D category of VMs in Azure should work fine.
D category VM#
The Standard_D4_v4 is a VM in the category. It contains 4 VCPUs and 16 GByte of memory. The number of VCPUs can be 2, 4, 8, 16 and 32. It is currently not recommended to use MySQL Server VMs that are larger than 32 VCPUs. For head nodes one can use even larger VMs if required.
E category VM#
The Standard_E4_v4 is a VM in the category. It contains 4 VCPUs and 32 GBytes of memory. The number of VCPUs can be 2, 4, 8, 16, 20, 32, 48 and 64.
Install RonDB on localhost#
When RonDB is installed on localhost it means that all RonDB nodes are placed on a single VM. In this case it is only the head VM that is created. That is we need only define the head VM instance type, the boot size of the head VM and the VM name prefix. The command below shows how it can be performed in Azure.
HEAD_INSTANCE_TYPE=Standard-E4-v4 VM_NAME=test CLOUD=azure INSTALL_ACTION=localhost HEAD_NODE_BOOT_SIZE=64 ZONE=3 ./rondb-cloud-installer.sh \ --non-interactive \ --cloud $CLOUD \ --install-action $INSTALL_ACTION \ --vm-name-prefix $VM_NAME --azure-head-instance-type $HEAD_INSTANCE_TYPE \ --availability-zone $ZONE \ --head-node-boot-size $HEAD_NODE_BOOT_SIZE #--debug \
Create a RonDB cluster in GCP#
Creating a RonDB cluster in GCP is very similar to how one creates it in Azure. There is a bit more parameters defined when setting the GCP CLI. Thus the command line to create a RonDB cluster is a bit shorter. These parameters use gcp instead of azure in some parameters. Below is an example of a script that will create a RonDB cluster.
The default OS used on GCP is CentOS, on Azure the default was Ubuntu 18.04.
HEAD_INSTANCE_TYPE=n2-highcpu-16 DATA_NODE_INSTANCE_TYPE=n2-highmem-16 API_INSTANCE_TYPE=n2-highcpu-16 NUM_DATA_NODES=2 NUM_API_NODES=2 NUM_REPLICAS=2 VM_NAME=test CLOUD=gcp INSTALL_ACTION=cluster DATA_NODE_BOOT_SIZE=256 OS_IMAGE=centos-7-v20210401 ZONE=3 ./rondb-cloud-installer.sh \ --non-interactive \ --cloud $CLOUD \ --install-action $INSTALL_ACTION \ --vm-name-prefix $VM_NAME --gcp-head-instance-type $HEAD_INSTANCE_TYPE \ --gcp-data-node-instance-type $DATA_NODE_INSTANCE_TYPE \ --gcp-api-node-instance-type $API_NODE_INSTANCE_TYPE \ --num-data-nodes $NUM_DATA_NODES \ --num-api-nodes $NUM_API_NODES \ --num-replicas $NUM_REPLICAS \ --availability-zone $ZONE \ --database-node-boot-size $DATA_NODE_BOOT_SIZE #--os-image $OS_IMAGE #--debug \
After executing the script one logs into the head VM in the same fashion as in Azure to verify that the installation process completed successfully.
Recommended VM types in GCP#
We found that it was imperative to use the second generation VMs in the Google Cloud. There are 3 main types of VMs in the second generation. n2-standard-x, n2-highmem-x, n2-highcpu-x where x is the number of VCPUs in the VM.
The data nodes are recommended to use n2-highmem-x that have 8 GByte of memory per VCPU. The head VM and MySQL Server VMs are recommended to use the n2-highcpu-x VM type. This type have 1 GByte of memory per VCPU.
x can be the numbers 2, 4, 8, 16, 32, 48, 64 and 80.
We will test more VM types in GCP as well moving forward, these are the types we currently recommend. If you have recommendations for VMs that we should try out feel free to provide us with this information.