RonDB REST API#

The RonDB REST API has two endpoints, the pkread endpoint and the batch endpoint.

The pkread issues a read using a primary key lookup on a table provided a number of columns to read.

To execute this benchmark one can simply start a cluster using MTR. The simplest manner is to install your binary tarball somewhere where you want to run it. Next go into the directory mysql-test and start a cluster. Before starting you likely want to change the contents in mysql-test/include/default_ndbd.cnf. You most likely want to set AutomaticMemoryConfig and AutomaticThreadConfig to 1, you should also set NumCPUs to the number of CPUs you want each of the two RonDB data nodes to use. Next you will want to remove the line about ThreadConfig. It is also a good idea to increase the following parameters, DataMemory, SharedGlobalMemory, RedoBuffer, FragmentLogFileSize and LongMessageBuffer. One should also add the parameter TotalMemoryConfig.

The reason you need to change so many parameters is that the default setting is to run functional test cases and these mostly do execute very small amounts of concurrent transactions and schema changes. A benchmark requires more memory resources and CPU resources. In my case I usually use the following settings (there is nothing scientific about these settings, they are simply good enough):

AutomaticMemoryConfig=1
AutomaticThreadConfig=1
NumCPUs=6
DataMemory=2500M
TotalMemoryConfig=5G
FragmentLogFileSize=1G
SharedGlobalMemory=200M
RedoBuffer=48M
LongMessageBuffer=80M
SpinMethod=StaticSpinning

In a default setup one would use SpinMethod=LatencyOptimisedSpinning. However using StaticSpinning means that we can rely on top to see how much CPU the RonDB data nodes actually use. Spinning improves the throughput in all cases, but it does show a higher CPU usage and sometimes even much higher than the actual CPU used for real work.

These numbers assume you have at least 16 CPUs and 16 GB of memory in your machine. With more or less CPUs you can adjust NumCPUs appropriately and with more memory you can increase DataMemory and TotalMemoryConfig.

Starting the RonDB cluster uses the following command executed in the mysql-test directory in the binary distribution after updating the configuration file as directed above.

./mtr --suite=rdrs2-golang --start-and-exit

This command will start the cluster The benchmarks for REST API are executed by Go programs found in storage/ndb/rest-server2/server/test_go. The following command will list the available test cases for the REST API.

Now this will also create two files in the var directory, rdrs.1.1.config.json and rdrs.2.1.config.json. Now we first need to stop these two REST API servers to and reconfigure them. So first kill them using kill -9 as shown below. Next edit the configuration and place them in a path known to you. Finally move to the bin directory of the RonDB binary distribution and run the two REST API servers with the proper configuration.

killall -9 rdrs2
./rdrs2 --config /path/rdrs.1.1.config.json
./rdrs2 --config /path/rdrs.2.1.config.json

Here is the configuration we used in the benchmarks below (rdrs.1.1.config.json):

{
  "PIDFile": "/path/rdrs.1.1.pid",
  "REST": {
    "ServerIP": "0.0.0.0",
    "ServerPort": 13005,
    "NumThreads": 64
  },
  "RonDB": {
      "Mgmds": [ { "IP": "localhost",
                   "Port": 13000 } ],
      "ConnectionPoolSize": 2
  },
  "Security": {
    "TLS": {
      "EnableTLS": false,
      "RequireAndVerifyClientCert": false,
    },
    "APIKey": {
      "UseHopsworksAPIKeys": false
    }
  },
  "Rondis": {
    "Enable": true,
    "ServerPort": 6379,
    "NumThreads": 64,
    "NumDatabases": 2,
    "Databases":
      [
        {
          "Index": 0,
          "DirtyINCRDECR": true
        },
        {
          "Index": 1,
          "DirtyINCRDECR": true
        }
      ]
  },
  "Log": {
    "Level": "INFO"
  },
  "Testing": {
    "MySQL": {
        "Servers": [ { "IP": "localhost",
                       "Port": 13001 } ],
      "User": "root",
      "Password": ""
    }
  }
}

The second REST API server uses port 13006 and 6380 instead of 13005 and 6379.

Now before running the benchmarks you need to point to the configuration file of the REST API server with the environment variable RDRS_CONFIG_FILE.

So executing the following command will benchmark the pkread endpoint.

./script.sh bench hopsworks.ai/rdrs2/internal/integrationtests/pkread

The only benchmark on the pkread endopint is the BenchmarkSimple test case.

./script.sh bench hopsworks.ai/rdrs2/internal/integrationtests/batchpkread

This will execute the 3 different benchmarks on the batch endpoint. These are BenchmarkSimple, BenchmarkManyColumns and BenchmarkBinary.

One can ensure only one of those to be executed using the following command.

./script.sh bench hopsworks.ai/rdrs2/internal/integrationtests/batchpkread BenchmarkSimple

The first you run one of those benchmarks, the script will fill the database with a set of tables used in those benchmarks. This will take some time and after that it will start running the benchmark and output benchmark results.

BenchmarkSimple runs a simple retrieval in a table with a single primary key column and reading a simple VARCHAR(100) column. Thus it tests the maximum throughput of the REST API server.

BenchmarkManyColumns runs a retrieval from a table with one primary key column and 100 small columns storing some string. This benchmark benchmarks the effect of handling many columns in the REST API server.

BenchmarkBinary runs a retrieval from a table with a single primary key column using a single large column that stores around 3 kBytes of data. This benchmark shows how the REST API server handles large volumes of binary data.

All benchmarks report the batch size, the number of threads, the throughput, the median response time and the 99% response time.

Since the benchmark is running on a single machine or VM, it is simple to check the CPU resources used by the different processes. For the most part the focus of the benchmarking is around the efficiency of the REST API server process.

RonDB REST API Results#

The results provided here was executed on a desktop with an Intel Raptor Lake CPU with 16 efficiency cores and 8 power cores. The memory size was 64 GB and the disks were fast enough to not affect the benchmarking.

The CPU is called 13th Gen Intel(R) Core(TM) i9-13900K.

No special attempt was used to lock processes or threads to specific CPUs. In a production setup this is something RonDB would normally do. Normally each RonDB process runs in its own virtual machine or in its pod where the pod mainly consists of the container running the RonDB process. Only some init container is there as well. In this setup it is natural to also lock threads to CPUs for RonDB data nodes. No such attempts are made for REST API servers and MySQL Servers.

BenchmarkSimple with pkread endpoint#

This benchmark scales only on the number of threads used in the Go client. This can be changed by editing the file storage/ndb/rest-server2/server/test_go/internal/integrationtests/pkread/performance_test.go. Find the GOMAXPROCS and change from 24 to whatever number of threads you want to use.

Benchmark	Throughput rows/sec	Median latency	99% latency	Batch	Threads	CPU
Benchmark Simple	23654/s	0.04 ms	0.06 ms	1	1	38%
Benchmark Simple	38230/s	0.05 ms	0.08 ms	1	2	70%
Benchmark Simple	56663/s	0.07 ms	0.17 ms	1	4	140%
Benchmark Simple	90713/s	0.08 ms	0.20 ms	1	8	290%
Benchmark Simple	141985/s	0.10 ms	0.23 ms	1	16	495%
Benchmark Simple	199978/s	0.14 ms	0.41 ms	1	32	715%
Benchmark Simple	236875/s	0.24 ms	0.72 ms	1	64	790%

If one runs the same benchmark in a distributed setup there will be more latency since the package has to be transferred over a network which usually adds around 50 microseconds per jump in a public cloud setting. So the numbers here shows how much latency is due to the actual execution of the REST calls.

The CPU usage increases faster than throughput since the NDB API provides very good latency when the number of requests are very low. At the same time the NDB API becomes more effective again when the load increases, thus at extreme loads there will be more throughput available.

BenchmarkSimple with batchpkread endpoint#

Also in this case it is possible to change the behaviour by editing the Go client. This time the file to edit is storage/ndb/rest-server2/server/test_go/internal/integrationtests/batchpkread/performance_test.go.

Here though there are two parameters that can be changed, the same as above, the number of client threads can be changed through setting GOMAXPROCS. Now we can also set the batch size by setting the variable batchSize.

We will start seeing how performance changes with a single thread and increasing the batch size.

Benchmark	Throughput rows/sec	Median latency	99% latency	Batch	Threads	CPU
Benchmark Simple	22595/s	0.04 ms	0.06 ms	1	1	35%
Benchmark Simple	36949/s	0.05 ms	0.08 ms	2	1	40%
Benchmark Simple	63319/s	0.06 ms	0.09 ms	4	1	44%
Benchmark Simple	87474/s	0.09 ms	0.17 ms	8	1	48%
Benchmark Simple	124857/s	0.12 ms	0.25 ms	16	1	50%
Benchmark Simple	162073/s	0.19 ms	0.35 ms	32	1	52%
Benchmark Simple	185027/s	0.33 ms	0.55 ms	64	1	55%
Benchmark Simple	203803/s	0.60 ms	0.87 ms	128	1	56%
Benchmark Simple	211251/s	1.16 ms	1.50 ms	256	1	59%

As can be seen in those numbers we get almost similar with the batch size replaced by number of threads, both in terms of throughput and latency. However we can see from the CPU usage that sending batches consume 10x less CPU in the REST API server. However both the batch size and number of threads reaches a plateau around 100.

So with these results it is now interesting to move to testing with a combination of threads and batch sizes.

Benchmark	Throughput rows/sec	Median latency	99% latency	Batch	Threads	CPU
Benchmark Simple	137791/s	0.11 ms	0.24 ms	4	4	142%
Benchmark Simple	319844/s	0.19 ms	0.41 ms	16	4	185%
Benchmark Simple	551096/s	0.43 ms	0.75 ms	64	4	206%
Benchmark Simple	319658/s	0.19 ms	0.40 ms	8	8	350%
Benchmark Simple	691409/s	0.35 ms	0.65 ms	32	8	372%
Benchmark Simple	1028243/s	0.95 ms	1.52 ms	128	8	405%
Benchmark Simple	693094/s	0.34 ms	0.75 ms	16	16	542%
Benchmark Simple	1379632/s	0.68 ms	1.19 ms	64	16	710%
Benchmark Simple	1688128/s	2.22 ms	5.07 ms	256	16	725%
Benchmark Simple	1277227/s	0.73 ms	1.68 ms	32	32	800%
Benchmark Simple	1640773/s	0.73 ms	1.68 ms	64	32	800%
Benchmark Simple	2015032/s	1.92 ms	3.89 ms	128	32	800%
Benchmark Simple	2146081/s	3.59 ms	6.10 ms	256	32	1060%
Benchmark Simple	1799511/s	1.98 ms	5.42 ms	64	64	985%
Benchmark Simple	2072006/s	2.358 ms	4.44 ms	128	40	800%

What we see from the above numbers is that batching provides the most efficient way of achieving high performance. However to get to the full potential of REST API server it is necessary to use multiple threads in combination with batching.

RonDB is heavily used in AI applications such as Personalized Recommendation systems. In these systems it is very common to perform batches of hundreds of key lookups with requirement on very low latency. As can be seen here we can acheive latency well below one millisecond responding to hundreds of key lookups.

The application can also choose to split the batches in several requests and execute them in parallel on either different REST API servers or on different threads in the same REST API server. Thus it is possible to optimise on both latency and throughput using RonDB REST API servers.

The next question is what benefits there are from using multiple REST API servers. It is fairly easy to extend the above tests to run 2 benchmark clients using different REST API servers. They will still use the same RonDB data nodes, but will not interact other than competing for the CPU resources since those benchmarks are executed on a single desktop machine.

In this table we report numbers from 2 REST API servers, we add the CPU usage, so it is the total CPU usage and throughput which is provided.

Benchmark	Throughput rows/sec	Median latency	99% latency	Batch	Threads	CPU
Benchmark Simple	40667/s	0.04 ms	0.08 ms	1	1	70%
Benchmark Simple	218801/s	0.13 ms	0.28 ms	16	1	97%
Benchmark Simple	399650/s	0.60 ms	0.85 ms	128	1	114%
Benchmark Simple	498013/s	0.25 ms	0.50 ms	16	4	378%
Benchmark Simple	1086516/s	0.35 ms	0.65 ms	128	4	416%
Benchmark Simple	1946243/s	0.97 ms	1.88 ms	64	16	1025%
Benchmark Simple	2250000/s	2.40 ms	4.80 ms	128	24	1168%

What we see from these numbers is that one benefits more from adding REST API servers rather than adding threads. At the same time there is a limit on the number of slots for REST API servers in a cluster, so it makes sense to scale on threads as well as on number of servers.

The general rule we use at Hopsworks is to allocate 16 CPUs per REST API server. If more is needed one adds more REST API servers. It is possible quite a few REST API servers in a cluster. One can also go down to 8 CPUs or even 4 CPUs and have good behaviour still.

BenchmarkManyColumns with batchpkread endpoint#

This benchmark reads 100 small columns instead of a single column. This puts a lot more load on the REST API server since it has to parse a lot more in the request and also the RonDB data node will require more processing to fetch 100 columns instead of 1.

Benchmark	Throughput rows/sec	Median latency	99% latency	Batch	Threads	CPU
Benchmark ManyColumns	7899/s	0.12 ms	0.30 ms	1	1	41%
Benchmark ManyColumns	16098/s	0.47 ms	0.80 ms	8	1	41%
Benchmark ManyColumns	51137/s	0.60 ms	1.05 ms	8	4	153%
Benchmark ManyColumns	121893/s	1.01 ms	1.66 ms	8	16	516%
Benchmark ManyColumns	150031/s	1.65 ms	3.19 ms	8	32	695%

One problem with this benchmark is that Go client consumes almost of the CPU resources. Thus it is hard to scale the benchmarks any more than shown above. In the last line the RonDB data nodes used only 210% CPU resources whereas the Go client used more than 1400%.

BenchmarkBinary with batchpkread endpoint#

This benchmark is also quite heavy on the Go client, so again it is a bit harder to fully test the REST API server on a single machine.

Benchmark	Throughput Rows/sec	Median latency	99% latency	Batch	Threads	CPU
Benchmark Binary	12280/s	0.06 ms	0.16 ms	1	1	26%
Benchmark Binary	24008/s	0.22 ms	0.53 ms	8	1	26%
Benchmark Binary	74830/s	0.29 ms	0.56 ms	8	4	109%
Benchmark Binary	111947/s	0.38 ms	0.75 ms	8	8	208%
Benchmark Binary	170986/s	0.50 ms	0.98 ms	8	16	366%
Benchmark Binary	201251/s	0.88 ms	2.06 ms	8	32	482%
Benchmark Binary	229624/s	1.18 ms	2.47 ms	12	32	495%

The benchmark client seem to not scale well with larger batch sizes, but the picture looks as expected that the REST API server requires more time to handle many columns compared to handling large columns. However the CPU usage is still very light and one can easily scale to millions of request with both a large number of columns and/or large columns.