Yahoo Cloud Serving Benchmarks (YCSB)#

YCSB is widely used for the comparison of cloud data serving systems. You can run YCSB in two configurations

Using JDBC API that connects to RonDB via MySQL Server(s)
Using native ClusterJ API that connects directly to RonDB data nodes

Uisng ClusterJ saves computer resources needed for MySQL servers. ClusterJ also brings a bit better performance and latency on key-value lookups and writes. On the other hand, the MySQL Server is a bit more efficient in performing complex range scan operations as tested in Workload E.

Setting up an YCSB benchmark environment#

The source code for YCSB is available at github.com/logicalclocks/YCSB So first clone this tree. To build the YCSB tree it is necessary to install Java and Maven.

git clone http://github.com/logicalclocks/YCSB.git
cd YCSB

Build rondb-binding to run the benchmark using the native ClusterJ API.

mvn -pl site.ycsb:rondb-binding -am clean package

Otherwise, build jdbc-binding to run the benchmark using JDBC driver.

mvn -pl site.ycsb:jdbc-binding -am clean package

Extra actions before using ClusterJ#

The rondb-binding uses the latest stable ClusterJ version. To use a different version of the ClusterJ library edit the pom.xml file at the top level in the YCSB tree before compiling. There set the rondb.clusterj.version to the version you want to use.

After compiling the YCSB tree it is necessary to point the NDB API library that ClusterJ uses. In the managed RonDB installation this is found in the directory /srv/hops/mysql/lib. Also make sure that the Java engine gets enough memory to execute the YCSB benchmark by setting the JAVA_OPTS variable.

export JAVA_OPTS="-Xmx4g"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/srv/hops/mysql/lib

Create YCSB table#

Each YCSB client uses a single table, but it is possible to use different tables from different clients. One table has limited scaling for writes since the number of partitions per table is 2 * number of data nodes. Thus in a 2-node standard setup we will have 4 partitions per table.

When we create the table we decide how many columns of payload data to use, we decide the size of the payload columns. In addition we can put columns in-memory and we can put them on disk columns.

The default name for the table used by YCSB is usertable. It is possible to run multiple simultaneous instances of YCSB clients operating on different tables. When you create a table with a different name then you can pass the table name to YCSB using the table parameter using -p command parameter or by adding the file describing the workload as described more below.

The example below create a YCSB table with a single payload column which is a 4 kB binary data column.

CREATE TABLE usertable (YCSB_KEY VARCHAR(255) PRIMARY KEY, FIELD0 varbinary(4096))
ENGINE = NDB;

Below is the standard YCSB table using 10 payload columns of VARCHAR data type storing up to 100 characters. The default character set in RonDB is UTF8MB4. This means that each character can take up to 4 bytes of space.

CREATE TABLE usertable (YCSB_KEY VARCHAR(255) PRIMARY KEY,
                        field0 varchar(100),
                        field1 varchar(100),
                        field2 varchar(100),
                        field3 varchar(100),
                        field4 varchar(100),
                        field5 varchar(100),
                        field6 varchar(100),
                        field7 varchar(100),
                        field8 varchar(100),
                        field9 varchar(100))
ENGINE = NDB;

Creating a table with payload columns stored in disk columns requires a bit more effort. First it is necessary to create a logfile group that stores the UNDO log for disk parts. An UNDO log file of 64 Gbyte is sufficiently large to handle all YCSB workloads.

Next one need to create a tablespace and map a data file to this tablespace. Here is an important consideration to make. If one uses a set of NVMe drives, then it is better to have one file per NVMe drive rather than setting up the NVMe drives using a RAID setup.

The blog post at http://mikaelronstrom.blogspot.com/2020/08/setting-up-nvme-drives-on-oracle-cloud.html gives a lot of details of how one can set up an optimal disk setup for use by RonDB except that as mentioned above it is sufficient to have 64 GByte of UNDO log.

When creating the tablespace we need to connect it to the logfile group and set the size of the initial data file. Using the command ALTER TABLESPACE one can then add more data files as required.

Finally when creating the table one connects it to the tablespace and sets the non-indexed columns to use disk storage. Thus in this case the primary key will still use in-memory storage, but the payload column will be stored on disk.

CREATE LOGFILE GROUP lg1
ADD UNDOFILE 'undofile.dat'
INITIAL_SIZE 64G
ENGINE=NDB;
CREATE TABLESPACE ts1
ADD DATAFILE 'datafile.dat'
USE LOGFILE GROUP lg1
INITIAL_SIZE 256G
ENGINE=NDB;
ALTER TABLESPACE ts1
ADD DATAFILE 'datafile2.dat'
INITIAL_SIZE 256G
ENGINE=NDB;
CREATE TABLE usertable (YCSB_KEY VARCHAR(255) PRIMARY KEY, field0 varbinary(4096))
TABLESPACE ts1 STORAGE DISK ENGINE = ndb;

The final example show how you can create a table with a specified number of table partitions. The default in RonDB is that we create 2 * number of data nodes. In managed RonDB we will always have 3 nodes setup for each node group. Thus if one has setup a benchmark in the managed RonDB and created 2 replicas and 2 data nodes, then actually we will create 2 * 3 table partitions.

In most applications with a fair amount of reads and many tables this will work perfectly fine. However in benchmarks like YCSB that hits a single and does a lot of write operations, the performance can be limited by that writes in one partition can only be handled by 1 ldm thread. So to create more write bandwidth one can create the table with more table partitions. In the example below we create a table with 16 partitions. This would be a good setting if we execute towards data nodes with 64 VCPUs and thus 16 ldm threads per data node.

Another manner to handle this is by using different tables for different YCSB clients.

CREATE TABLE usertable (YCSB_KEY VARCHAR(255) PRIMARY KEY, field0 varbinary(4096))
ENGINE = NDB PARTITIONS BY KEY() PARTITIONS 16;

Executing a YCSB benchmark using ClusterJ#

Before executing the benchmarks it is necessary to edit the workload file, if one is executing Workload A then the file is found at workloads/workloada in the YCSB tree.

To use the ClusterJ client one need to set the following parameters:

rondb.connection.string=MGM_SERVER_IP:1186
rondb.schema=ycsb
rondb.fieldcount=1
rondb.fieldlength=4096
fieldnameprefix=field
threadcount=32

The first parameter sets the RonDB connection string that needs to point the RonDB management server in the cluster setup. Next one sets the database to be used by the client. We set the number of fields stored in the table and the size of the fields. The field names are named FIELD0, FIELD1 and so forth. It is possible to use a different field name base by setting fieldnameprefix.

We probably want to change the recordcount and operationcount as well. The recordcount sets the number of records to insert and to choose from when executing an operation. operationcount sets the number of operations to execute in one benchmark execution.

The threadcount is a parameter that specifies how many threads that YCSB should use in executing the benchmark. This is likely a parameter to add as command parameter since it will change in different runs.

Now to execute load phase where we fill the usertable with rows is performed using the following command:

./bin/ycsb load rondb -s -P workloads/workloada

Next we execute the actual benchmark command:

./bin/ycsb run rondb -s -P workloads/workloada

It is possible to set parameters using the -p command option when starting the ycsb binary instead as well. The -s parameter says that we want to have status updates during the execution of the benchmark.

Executing a YCSB benchmark using JDBC#

There is a set of parameters to set in the workloads file before executing the benchmark. The reasoning above on threadcount holds true here as well that it is probably best to use it as a command parameter.

db.driver=com.mysql.cj.jdbc.Driver
db.url=jdbc:mysql://MYSQL_IP:3306/ycsb
db.user=root
db.passwd=passwd
threadcount=32
db.batchsize=1000
jdbc.fetchsize=10
jdbc.autocommit=true
jdbc.batchupdateapi=false

db.driver specifies using the MySQL JDBC driver.

db.url specifies where we find the MySQL Server, the hostname and its port (defaults to 3306). Also the database to connect is provided here. The MySQL JDBC supports load balancing, thus it is possible to specify a list of MySQL Servers in the URL. See MySQL documentation of JDBC driver for more details on this.

db.user and db.passwd is used to log in to the MySQL server.

Next is a set of parameters that can be set up for how JDBC is to perform batching and when to commit.

Finally when executing the benchmark we need to ensure that we have access to the JAR file of the MySQL JDBC driver. This can be downloaded from MySQL.

./bin/ycsb load rondb -s -P workloads/workloada -cp /path/to/mysql-connector.jar

Next we execute the actual benchmark command:

./bin/ycsb run rondb -s -P workloads/workloada -cp /path/to/mysql-connector.jar