RonDB REST API#
The RonDB REST API has two endpoints, the pkread endpoint and the batch endpoint.
The pkread issues a read using a primary key lookup on a table provided a number of columns to read.
To execute this benchmark one can simply start a cluster using MTR. The simplest manner is to install your binary tarball somewhere where you want to run it. Next go into the directory mysql-test and start a cluster. Before starting you likely want to change the contents in mysql-test/include/default_ndbd.cnf. You most likely want to set AutomaticMemoryConfig and AutomaticThreadConfig to 1, you should also set NumCPUs to the number of CPUs you want each of the two RonDB data nodes to use. Next you will want to remove the line about ThreadConfig. It is also a good idea to increase the following parameters, DataMemory, SharedGlobalMemory, RedoBuffer, FragmentLogFileSize and LongMessageBuffer. One should also add the parameter TotalMemoryConfig.
The reason you need to change so many parameters is that the default setting is to run functional test cases and these mostly do execute very small amounts of concurrent transactions and schema changes. A benchmark requires more memory resources and CPU resources. In my case I usually use the following settings (there is nothing scientific about these settings, they are simply good enough):
-
AutomaticMemoryConfig=1
-
AutomaticThreadConfig=1
-
NumCPUs=6
-
DataMemory=2500M
-
TotalMemoryConfig=5G
-
FragmentLogFileSize=1G
-
SharedGlobalMemory=200M
-
RedoBuffer=48M
-
LongMessageBuffer=80M
-
SpinMethod=StaticSpinning
In a default setup one would use SpinMethod=LatencyOptimisedSpinning
.
However using StaticSpinning
means that we can rely on top
to see
how much CPU the RonDB data nodes actually use. Spinning improves the
throughput in all cases, but it does show a higher CPU usage and
sometimes even much higher than the actual CPU used for real work.
These numbers assume you have at least 16 CPUs and 16 GB of memory in your machine. With more or less CPUs you can adjust NumCPUs appropriately and with more memory you can increase DataMemory and TotalMemoryConfig.
Starting the RonDB cluster uses the following command executed in the
mysql-test
directory in the binary distribution after updating the
configuration file as directed above.
This command will start the cluster The benchmarks for REST API are
executed by Go programs found in
storage/ndb/rest-server2/server/test_go
. The following command will
list the available test cases for the REST API.
Now this will also create two files in the var
directory,
rdrs.1.1.config.json
and rdrs.2.1.config.json
. Now we first need to
stop these two REST API servers to and reconfigure them. So first kill
them using kill -9
as shown below. Next edit the configuration and
place them in a path known to you. Finally move to the bin
directory
of the RonDB binary distribution and run the two REST API servers with
the proper configuration.
killall -9 rdrs2
./rdrs2 --config /path/rdrs.1.1.config.json
./rdrs2 --config /path/rdrs.2.1.config.json
Here is the configuration we used in the benchmarks below
(rdrs.1.1.config.json
):
{
"PIDFile": "/path/rdrs.1.1.pid",
"REST": {
"ServerIP": "0.0.0.0",
"ServerPort": 13005,
"NumThreads": 64
},
"RonDB": {
"Mgmds": [ { "IP": "localhost",
"Port": 13000 } ],
"ConnectionPoolSize": 2
},
"Security": {
"TLS": {
"EnableTLS": false,
"RequireAndVerifyClientCert": false,
},
"APIKey": {
"UseHopsworksAPIKeys": false
}
},
"Rondis": {
"Enable": true,
"ServerPort": 6379,
"NumThreads": 64,
"NumDatabases": 2,
"Databases":
[
{
"Index": 0,
"DirtyINCRDECR": true
},
{
"Index": 1,
"DirtyINCRDECR": true
}
]
},
"Log": {
"Level": "INFO"
},
"Testing": {
"MySQL": {
"Servers": [ { "IP": "localhost",
"Port": 13001 } ],
"User": "root",
"Password": ""
}
}
}
The second REST API server uses port 13006 and 6380 instead of 13005 and 6379.
Now before running the benchmarks you need to point to the configuration
file of the REST API server with the environment variable
RDRS_CONFIG_FILE
.
So executing the following command will benchmark the pkread endpoint.
The only benchmark on the pkread
endopint is the BenchmarkSimple
test case.
This will execute the 3 different benchmarks on the batch
endpoint.
These are BenchmarkSimple
, BenchmarkManyColumns
and
BenchmarkBinary
.
One can ensure only one of those to be executed using the following command.
The first you run one of those benchmarks, the script will fill the database with a set of tables used in those benchmarks. This will take some time and after that it will start running the benchmark and output benchmark results.
BenchmarkSimple
runs a simple retrieval in a table with a single
primary key column and reading a simple VARCHAR(100) column. Thus it
tests the maximum throughput of the REST API server.
BenchmarkManyColumns
runs a retrieval from a table with one primary
key column and 100 small columns storing some string. This benchmark
benchmarks the effect of handling many columns in the REST API server.
BenchmarkBinary
runs a retrieval from a table with a single primary
key column using a single large column that stores around 3 kBytes of
data. This benchmark shows how the REST API server handles large volumes
of binary data.
All benchmarks report the batch size, the number of threads, the throughput, the median response time and the 99% response time.
Since the benchmark is running on a single machine or VM, it is simple to check the CPU resources used by the different processes. For the most part the focus of the benchmarking is around the efficiency of the REST API server process.
RonDB REST API Results#
The results provided here was executed on a desktop with an Intel Raptor Lake CPU with 16 efficiency cores and 8 power cores. The memory size was 64 GB and the disks were fast enough to not affect the benchmarking.
The CPU is called 13th Gen Intel(R) Core(TM) i9-13900K.
No special attempt was used to lock processes or threads to specific CPUs. In a production setup this is something RonDB would normally do. Normally each RonDB process runs in its own virtual machine or in its pod where the pod mainly consists of the container running the RonDB process. Only some init container is there as well. In this setup it is natural to also lock threads to CPUs for RonDB data nodes. No such attempts are made for REST API servers and MySQL Servers.
BenchmarkSimple with pkread endpoint#
This benchmark scales only on the number of threads used in the Go
client. This can be changed by editing the file
storage/ndb/rest-server2/server/test_go/internal/integrationtests/pkread/performance_test.go
.
Find the GOMAXPROCS and change from 24 to whatever number of threads you
want to use.
Benchmark | Throughput rows/sec | Median latency | 99% latency | Batch | Threads | CPU |
---|---|---|---|---|---|---|
Benchmark Simple | 23654/s | 0.04 ms | 0.06 ms | 1 | 1 | 38% |
Benchmark Simple | 38230/s | 0.05 ms | 0.08 ms | 1 | 2 | 70% |
Benchmark Simple | 56663/s | 0.07 ms | 0.17 ms | 1 | 4 | 140% |
Benchmark Simple | 90713/s | 0.08 ms | 0.20 ms | 1 | 8 | 290% |
Benchmark Simple | 141985/s | 0.10 ms | 0.23 ms | 1 | 16 | 495% |
Benchmark Simple | 199978/s | 0.14 ms | 0.41 ms | 1 | 32 | 715% |
Benchmark Simple | 236875/s | 0.24 ms | 0.72 ms | 1 | 64 | 790% |
If one runs the same benchmark in a distributed setup there will be more latency since the package has to be transferred over a network which usually adds around 50 microseconds per jump in a public cloud setting. So the numbers here shows how much latency is due to the actual execution of the REST calls.
The CPU usage increases faster than throughput since the NDB API provides very good latency when the number of requests are very low. At the same time the NDB API becomes more effective again when the load increases, thus at extreme loads there will be more throughput available.
BenchmarkSimple with batchpkread endpoint#
Also in this case it is possible to change the behaviour by editing the
Go client. This time the file to edit is
storage/ndb/rest-server2/server/test_go/internal/integrationtests/batchpkread/performance_test.go
.
Here though there are two parameters that can be changed, the same as
above, the number of client threads can be changed through setting
GOMAXPROCS
. Now we can also set the batch size by setting the variable
batchSize
.
We will start seeing how performance changes with a single thread and increasing the batch size.
Benchmark | Throughput rows/sec | Median latency | 99% latency | Batch | Threads | CPU |
---|---|---|---|---|---|---|
Benchmark Simple | 22595/s | 0.04 ms | 0.06 ms | 1 | 1 | 35% |
Benchmark Simple | 36949/s | 0.05 ms | 0.08 ms | 2 | 1 | 40% |
Benchmark Simple | 63319/s | 0.06 ms | 0.09 ms | 4 | 1 | 44% |
Benchmark Simple | 87474/s | 0.09 ms | 0.17 ms | 8 | 1 | 48% |
Benchmark Simple | 124857/s | 0.12 ms | 0.25 ms | 16 | 1 | 50% |
Benchmark Simple | 162073/s | 0.19 ms | 0.35 ms | 32 | 1 | 52% |
Benchmark Simple | 185027/s | 0.33 ms | 0.55 ms | 64 | 1 | 55% |
Benchmark Simple | 203803/s | 0.60 ms | 0.87 ms | 128 | 1 | 56% |
Benchmark Simple | 211251/s | 1.16 ms | 1.50 ms | 256 | 1 | 59% |
As can be seen in those numbers we get almost similar with the batch size replaced by number of threads, both in terms of throughput and latency. However we can see from the CPU usage that sending batches consume 10x less CPU in the REST API server. However both the batch size and number of threads reaches a plateau around 100.
So with these results it is now interesting to move to testing with a combination of threads and batch sizes.
Benchmark | Throughput rows/sec | Median latency | 99% latency | Batch | Threads | CPU |
---|---|---|---|---|---|---|
Benchmark Simple | 137791/s | 0.11 ms | 0.24 ms | 4 | 4 | 142% |
Benchmark Simple | 319844/s | 0.19 ms | 0.41 ms | 16 | 4 | 185% |
Benchmark Simple | 551096/s | 0.43 ms | 0.75 ms | 64 | 4 | 206% |
Benchmark Simple | 319658/s | 0.19 ms | 0.40 ms | 8 | 8 | 350% |
Benchmark Simple | 691409/s | 0.35 ms | 0.65 ms | 32 | 8 | 372% |
Benchmark Simple | 1028243/s | 0.95 ms | 1.52 ms | 128 | 8 | 405% |
Benchmark Simple | 693094/s | 0.34 ms | 0.75 ms | 16 | 16 | 542% |
Benchmark Simple | 1379632/s | 0.68 ms | 1.19 ms | 64 | 16 | 710% |
Benchmark Simple | 1688128/s | 2.22 ms | 5.07 ms | 256 | 16 | 725% |
Benchmark Simple | 1277227/s | 0.73 ms | 1.68 ms | 32 | 32 | 800% |
Benchmark Simple | 1640773/s | 0.73 ms | 1.68 ms | 64 | 32 | 800% |
Benchmark Simple | 2015032/s | 1.92 ms | 3.89 ms | 128 | 32 | 800% |
Benchmark Simple | 2146081/s | 3.59 ms | 6.10 ms | 256 | 32 | 1060% |
Benchmark Simple | 1799511/s | 1.98 ms | 5.42 ms | 64 | 64 | 985% |
Benchmark Simple | 2072006/s | 2.358 ms | 4.44 ms | 128 | 40 | 800% |
What we see from the above numbers is that batching provides the most efficient way of achieving high performance. However to get to the full potential of REST API server it is necessary to use multiple threads in combination with batching.
RonDB is heavily used in AI applications such as Personalized Recommendation systems. In these systems it is very common to perform batches of hundreds of key lookups with requirement on very low latency. As can be seen here we can acheive latency well below one millisecond responding to hundreds of key lookups.
The application can also choose to split the batches in several requests and execute them in parallel on either different REST API servers or on different threads in the same REST API server. Thus it is possible to optimise on both latency and throughput using RonDB REST API servers.
The next question is what benefits there are from using multiple REST API servers. It is fairly easy to extend the above tests to run 2 benchmark clients using different REST API servers. They will still use the same RonDB data nodes, but will not interact other than competing for the CPU resources since those benchmarks are executed on a single desktop machine.
In this table we report numbers from 2 REST API servers, we add the CPU usage, so it is the total CPU usage and throughput which is provided.
Benchmark | Throughput rows/sec | Median latency | 99% latency | Batch | Threads | CPU |
---|---|---|---|---|---|---|
Benchmark Simple | 40667/s | 0.04 ms | 0.08 ms | 1 | 1 | 70% |
Benchmark Simple | 218801/s | 0.13 ms | 0.28 ms | 16 | 1 | 97% |
Benchmark Simple | 399650/s | 0.60 ms | 0.85 ms | 128 | 1 | 114% |
Benchmark Simple | 498013/s | 0.25 ms | 0.50 ms | 16 | 4 | 378% |
Benchmark Simple | 1086516/s | 0.35 ms | 0.65 ms | 128 | 4 | 416% |
Benchmark Simple | 1946243/s | 0.97 ms | 1.88 ms | 64 | 16 | 1025% |
Benchmark Simple | 2250000/s | 2.40 ms | 4.80 ms | 128 | 24 | 1168% |
What we see from these numbers is that one benefits more from adding REST API servers rather than adding threads. At the same time there is a limit on the number of slots for REST API servers in a cluster, so it makes sense to scale on threads as well as on number of servers.
The general rule we use at Hopsworks is to allocate 16 CPUs per REST API server. If more is needed one adds more REST API servers. It is possible quite a few REST API servers in a cluster. One can also go down to 8 CPUs or even 4 CPUs and have good behaviour still.
BenchmarkManyColumns with batchpkread endpoint#
This benchmark reads 100 small columns instead of a single column. This puts a lot more load on the REST API server since it has to parse a lot more in the request and also the RonDB data node will require more processing to fetch 100 columns instead of 1.
Benchmark | Throughput rows/sec | Median latency | 99% latency | Batch | Threads | CPU |
---|---|---|---|---|---|---|
Benchmark ManyColumns | 7899/s | 0.12 ms | 0.30 ms | 1 | 1 | 41% |
Benchmark ManyColumns | 16098/s | 0.47 ms | 0.80 ms | 8 | 1 | 41% |
Benchmark ManyColumns | 51137/s | 0.60 ms | 1.05 ms | 8 | 4 | 153% |
Benchmark ManyColumns | 121893/s | 1.01 ms | 1.66 ms | 8 | 16 | 516% |
Benchmark ManyColumns | 150031/s | 1.65 ms | 3.19 ms | 8 | 32 | 695% |
One problem with this benchmark is that Go client consumes almost of the CPU resources. Thus it is hard to scale the benchmarks any more than shown above. In the last line the RonDB data nodes used only 210% CPU resources whereas the Go client used more than 1400%.
BenchmarkBinary with batchpkread endpoint#
This benchmark is also quite heavy on the Go client, so again it is a bit harder to fully test the REST API server on a single machine.
Benchmark | Throughput Rows/sec | Median latency | 99% latency | Batch | Threads | CPU |
---|---|---|---|---|---|---|
Benchmark Binary | 12280/s | 0.06 ms | 0.16 ms | 1 | 1 | 26% |
Benchmark Binary | 24008/s | 0.22 ms | 0.53 ms | 8 | 1 | 26% |
Benchmark Binary | 74830/s | 0.29 ms | 0.56 ms | 8 | 4 | 109% |
Benchmark Binary | 111947/s | 0.38 ms | 0.75 ms | 8 | 8 | 208% |
Benchmark Binary | 170986/s | 0.50 ms | 0.98 ms | 8 | 16 | 366% |
Benchmark Binary | 201251/s | 0.88 ms | 2.06 ms | 8 | 32 | 482% |
Benchmark Binary | 229624/s | 1.18 ms | 2.47 ms | 12 | 32 | 495% |
The benchmark client seem to not scale well with larger batch sizes, but the picture looks as expected that the REST API server requires more time to handle many columns compared to handling large columns. However the CPU usage is still very light and one can easily scale to millions of request with both a large number of columns and/or large columns.