RonDB Benchmarks#

This chapter reports some benchmark numbers for RonDB. The numbers in this chapter is mostly simple benchmarks executed on one of our development machines. This means that they are very simple to reproduce, obviously they don’t show the scalability of RonDB. For that we will reference some larger benchmark efforts.

Benchmarks of RonDB focus on showing performance of the following data services provided by RonDB.

RonDB REST API
RonDB Feature Store REST API
Rondis
MySQL
RonSQL

RonDB REST API#

The RonDB REST API has two endpoints, the pkread endpoint and the batch endpoint.

The pkread issues a read using a primary key lookup on a table provided a number of columns to read.

To execute this benchmark one can simply start a cluster using MTR. The simplest manner is to install your binary tarball somewhere where you want to run it. Next go into the directory mysql-test and start a cluster. Before starting you likely want to change the contents in mysql-test/include/default_ndbd.cnf. You most likely want to set AutomaticMemoryConfig and AutomaticThreadConfig to 1, you should also set NumCPUs to the number of CPUs you want each of the two RonDB data nodes to use. Next you will want to remove the line about ThreadConfig. It is also a good idea to increase the following parameters, DataMemory, SharedGlobalMemory, RedoBuffer, FragmentLogFileSize and LongMessageBuffer. One should also add the parameter TotalMemoryConfig.

The reason you need to change so many parameters is that the default setting is to run functional test cases and these mostly do execute very small amounts of concurrent transactions and schema changes. A benchmark requires more memory resources and CPU resources. In my case I usually use the following settings (there is nothing scientific about these settings, they are simply good enough):

AutomaticMemoryConfig=1
AutomaticThreadConfig=1
NumCPUs=6
DataMemory=2500M
TotalMemoryConfig=5G
FragmentLogFileSize=1G
SharedGlobalMemory=200M
RedoBuffer=48M
LongMessageBuffer=80M
SpinMethod=StaticSpinning

In a default setup one would use SpinMethod=LatencyOptimisedSpinning. However using StaticSpinning means that we can rely on top to see how much CPU the RonDB data nodes actually use. Spinning improves the throughput in all cases, but it does show a higher CPU usage and sometimes even much higher than the actual CPU used for real work.

These numbers assume you have at least 16 CPUs and 16 GB of memory in your machine. With more or less CPUs you can adjust NumCPUs appropriately and with more memory you can increase DataMemory and TotalMemoryConfig.

Starting the RonDB cluster uses the following command executed in the mysql-test directory in the binary distribution after updating the configuration file as directed above.

./mtr --suite=rdrs2-golang --start-and-exit

This command will start the cluster The benchmarks for REST API are executed by Go programs found in storage/ndb/rest-server2/server/test_go. The following command will list the available test cases for the REST API.

Now this will also create two files in the var directory, rdrs.1.1.config.json and rdrs.2.1.config.json. Now we first need to stop these two REST API servers to and reconfigure them. So first kill them using kill -9 as shown below. Next edit the configuration and place them in a path known to you. Finally move to the bin directory of the RonDB binary distribution and run the two REST API servers with the proper configuration.

killall -9 rdrs2
./rdrs2 --config /path/rdrs.1.1.config.json
./rdrs2 --config /path/rdrs.2.1.config.json

Here is the configuration we used in the benchmarks below (rdrs.1.1.config.json):

{
  "PIDFile": "/path/rdrs.1.1.pid",
  "REST": {
    "ServerIP": "0.0.0.0",
    "ServerPort": 13005,
    "NumThreads": 64
  },
  "RonDB": {
      "Mgmds": [ { "IP": "localhost",
                   "Port": 13000 } ],
      "ConnectionPoolSize": 2
  },
  "Security": {
    "TLS": {
      "EnableTLS": false,
      "RequireAndVerifyClientCert": false,
    },
    "APIKey": {
      "UseHopsworksAPIKeys": false
    }
  },
  "Rondis": {
    "Enable": true,
    "ServerPort": 6379,
    "NumThreads": 64,
    "NumDatabases": 2,
    "Databases":
      [
        {
          "Index": 0,
          "DirtyINCRDECR": true
        },
        {
          "Index": 1,
          "DirtyINCRDECR": true
        }
      ]
  },
  "Log": {
    "Level": "INFO"
  },
  "Testing": {
    "MySQL": {
        "Servers": [ { "IP": "localhost",
                       "Port": 13001 } ],
      "User": "root",
      "Password": ""
    }
  }
}

The second REST API server uses port 13006 and 6380 instead of 13005 and 6379.

Now before running the benchmarks you need to point to the configuration file of the REST API server with the environment variable RDRS_CONFIG_FILE.

So executing the following command will benchmark the pkread endpoint.

./script.sh bench hopsworks.ai/rdrs2/internal/integrationtests/pkread

The only benchmark on the pkread endopint is the BenchmarkSimple test case.

./script.sh bench hopsworks.ai/rdrs2/internal/integrationtests/batchpkread

This will execute the 3 different benchmarks on the batch endpoint. These are BenchmarkSimple, BenchmarkManyColumns and BenchmarkBinary.

One can ensure only one of those to be executed using the following command.

./script.sh bench hopsworks.ai/rdrs2/internal/integrationtests/batchpkread BenchmarkSimple

The first you run one of those benchmarks, the script will fill the database with a set of tables used in those benchmarks. This will take some time and after that it will start running the benchmark and output benchmark results.

BenchmarkSimple runs a simple retrieval in a table with a single primary key column and reading a simple VARCHAR(100) column. Thus it tests the maximum throughput of the REST API server.

BenchmarkManyColumns runs a retrieval from a table with one primary key column and 100 small columns storing some string. This benchmark benchmarks the effect of handling many columns in the REST API server.

BenchmarkBinary runs a retrieval from a table with a single primary key column using a single large column that stores around 3 kBytes of data. This benchmark shows how the REST API server handles large volumes of binary data.

All benchmarks report the batch size, the number of threads, the throughput, the median response time and the 99% response time.

Since the benchmark is running on a single machine or VM, it is simple to check the CPU resources used by the different processes. For the most part the focus of the benchmarking is around the efficiency of the REST API server process.

RonDB REST API Results#

The results provided here was executed on a desktop with an Intel Raptor Lake CPU with 16 efficiency cores and 8 power cores. The memory size was 64 GB and the disks were fast enough to not affect the benchmarking.

The CPU is called 13th Gen Intel(R) Core(TM) i9-13900K.

No special attempt was used to lock processes or threads to specific CPUs. In a production setup this is something RonDB would normally do. Normally each RonDB process runs in its own virtual machine or in its pod where the pod mainly consists of the container running the RonDB process. Only some init container is there as well. In this setup it is natural to also lock threads to CPUs for RonDB data nodes. No such attempts are made for REST API servers and MySQL Servers.

BenchmarkSimple with pkread endpoint#

This benchmark scales only on the number of threads used in the Go client. This can be changed by editing the file storage/ndb/rest-server2/server/test_go/internal/integrationtests/pkread/performance_test.go. Find the GOMAXPROCS and change from 24 to whatever number of threads you want to use.

Benchmark	Throughput	Median latency	99%	Batch	Threads	CPU
BenchmarkSimple	23.654/s	40 us	59 us	1	1	38%
BenchmarkSimple	38.230/s	49 us	81 us	1	2	70%
BenchmarkSimple	56.663/s	66 us	170 us	1	4	140%
BenchmarkSimple	90.713/s	84 us	198 us	1	8	290%
BenchmarkSimple	141.985/s	105 us	234 us	1	16	495%
BenchmarkSimple	199.978/s	144 us	408 us	1	32	715%
BenchmarkSimple	236.875/s	238 us	724 us	1	64	790%

If one runs the same benchmark in a distributed setup there will be more latency since the package has to be transferred over a network which usually adds around 50 microseconds per jump in a public cloud setting. So the numbers here shows how much latency is due to the actual execution of the REST calls.

The CPU usage increases faster than throughput since the NDB API provides very good latency when the number of requests are very low. At the same time the NDB API becomes more effective again when the load increases, thus at extreme loads there will be more throughput available.

BenchmarkSimple with batchpkread endpoint#

Also in this case it is possible to change the behaviour by editing the Go client. This time the file to edit is storage/ndb/rest-server2/server/test_go/internal/integrationtests/batchpkread/performance_test.go.

Here though there are two parameters that can be changed, the same as above, the number of client threads can be changed through setting GOMAXPROCS. Now we can also set the batch size by setting the variable batchSize.

We will start seeing how performance changes with a single thread and increasing the batch size.

Benchmark	Throughput	Median latency	99%	Batch	Threads	CPU
BenchmarkSimple	22.595/s	42 us	62 us	1	1	35%
BenchmarkSimple	36.949/s	52 us	78 us	2	1	40%
BenchmarkSimple	63.319/s	60 us	90 us	4	1	44%
BenchmarkSimple	87.474/s	86 us	174 us	8	1	48%
BenchmarkSimple	124.857/s	121 us	252 us	16	1	50%
BenchmarkSimple	162.073/s	187 us	349 us	32	1	52%
BenchmarkSimple	185.027/s	327 us	548 us	64	1	55%
BenchmarkSimple	203.803/s	596 us	872 us	128	1	56%
BenchmarkSimple	211.251/s	1164 us	1496 us	256	1	59%

As can be seen in those numbers we get almost similar with the batch size replaced by number of threads, both in terms of throughput and latency. However we can see from the CPU usage that sending batches consume 10x less CPU in the REST API server. However both the batch size and number of threads reaches a plateau around 100.

So with these results it is now interesting to move to testing with a combination of threads and batch sizes.

Benchmark	Throughput	Median latency	99%	Batch	Threads	CPU
BenchmarkSimple	137.791/s	106 us	241 us	4	4	142%
BenchmarkSimple	319.844/s	187 us	410 us	16	4	185%
BenchmarkSimple	551.096/s	434 us	747 us	64	4	206%
BenchmarkSimple	319.658/s	190 us	397 us	8	8	350%
BenchmarkSimple	691.409/s	349 us	647 us	32	8	372%
BenchmarkSimple	1.028.243/s	947 us	1520 us	128	8	405%
BenchmarkSimple	693.094/s	337 us	750 us	16	16	542%
BenchmarkSimple	1.379.632/s	678 us	1193 us	64	16	710%
BenchmarkSimple	1.688.128/s	2220 us	5071 us	256	16	725%
BenchmarkSimple	1.277.227/s	730 us	1684 us	32	32	800%
BenchmarkSimple	1.640.773/s	730 us	1684 us	64	32	800%
BenchmarkSimple	2.015.032/s	1917 us	3893 us	128	32	800%
BenchmarkSimple	2.146.081/s	3587 us	6099 us	256	32	1060%
BenchmarkSimple	1.799.511/s	1984 us	5419 us	64	64	985%
BenchmarkSimple	2.072.006/s	2358 us	4442 us	128	40	800%

What we see from the above numbers is that batching provides the most efficient way of achieving high performance. However to get to the full potential of REST API server it is necessary to use multiple threads in combination with batching.

RonDB is heavily used in AI applications such as Personalized Recommendation systems. In these systems it is very common to perform batches of hundreds of key lookups with requirement on very low latency. As can be seen here we can acheive latency well below one millisecond responding to hundreds of key lookups.

The application can also choose to split the batches in several requests and execute them in parallel on either different REST API servers or on different threads in the same REST API server. Thus it is possible to optimise on both latency and throughput using RonDB REST API servers.

The next question is what benefits there are from using multiple REST API servers. It is fairly easy to extend the above tests to run 2 benchmark clients using different REST API servers. They will still use the same RonDB data nodes, but will not interact other than competing for the CPU resources since those benchmarks are executed on a single desktop machine.

In this table we report numbers from 2 REST API servers, we add the CPU usage, so it is the total CPU usage and throughput which is provided.

Benchmark	Throughput	Median latency	99%	Batch	Threads	CPU
BenchmarkSimple	40.667/s	45 us	80 us	1	1	70%
BenchmarkSimple	218.801/s	132 us	280 us	16	1	97%
BenchmarkSimple	399.650/s	602 us	850 us	128	1	114%
BenchmarkSimple	498.013/s	250 us	500 us	16	4	378%
BenchmarkSimple	1.086.516/s	349 us	647 us	128	4	416%
BenchmarkSimple	1.946.243/s	975 us	1880 us	64	16	1025%
BenchmarkSimple	2.250.000/s	2400 us	4800 us	128	24	1168%

What we see from these numbers is that one benefits more from adding REST API servers rather than adding threads. At the same time there is a limit on the number of slots for REST API servers in a cluster, so it makes sense to scale on threads as well as on number of servers.

The general rule we use at Hopsworks is to allocate 16 CPUs per REST API server. If more is needed one adds more REST API servers. It is possible quite a few REST API servers in a cluster. One can also go down to 8 CPUs or even 4 CPUs and have good behaviour still.

BenchmarkManyColumns with batchpkread endpoint#

This benchmark reads 100 small columns instead of a single column. This puts a lot more load on the REST API server since it has to parse a lot more in the request and also the RonDB data node will require more processing to fetch 100 columns instead of 1.

Benchmark	Throughput	Median latency	99%	Batch	Threads	CPU
BenchmarkManyColumns	7.899/s	120 us	299 us	1	1	41%
BenchmarkManyColumns	16.098/s	472 us	804 us	8	1	41%
BenchmarkManyColumns	51.137/s	597 us	1053 us	8	4	153%
BenchmarkManyColumns	121.893/s	1006 us	1665 us	8	16	516%
BenchmarkManyColumns	150.031/s	1652 us	3191 us	8	32	695%

One problem with this benchmark is that Go client consumes almost of the CPU resources. Thus it is hard to scale the benchmarks any more than shown above. In the last line the RonDB data nodes used only 210% CPU resources whereas the Go client used more than 1400%.

BenchmarkBinary with batchpkread endpoint#

This benchmark is also quite heavy on the Go client, so again it is a bit harder to fully test the REST API server on a single machine.

Benchmark	Throughput	Median latency	99%	Batch	Threads	CPU
BenchmarkBinary	12.280/s	64 us	160 us	1	1	26%
BenchmarkBinary	24.008/s	225 us	530 us	8	1	26%
BenchmarkBinary	74.830/s	294 us	563 us	8	4	109%
BenchmarkBinary	111.947/s	380 us	748 us	8	8	208%
BenchmarkBinary	170.986/s	497 us	982 us	8	16	366%
BenchmarkBinary	201.251/s	883 us	2064 us	8	32	482%
BenchmarkBinary	229.624/s	1177 us	2475 us	12	32	495%

The benchmark client seem to not scale well with larger batch sizes, but the picture looks as expected that the REST API server requires more time to handle many columns compared to handling large columns. However the CPU usage is still very light and one can easily scale to millions of request with both a large number of columns and/or large columns.

Rondis#

All the setup for the REST API is also prepared to run Rondis tests using the same REST API binary.

Rondis results#

The Rondis benchmarks will be executed using a simple tool called redis-benchmark.

Result from SET operation#

Benchmark	Throughput	Median latency	99%	Batch	Threads	CPU
SET	8.356/s	119 us	175 us	1	1	15%
SET	13.670/s	143 us	199 us	1	2	35%
SET	24.456/s	159 us	327 us	1	4	68%
SET	47.994/s	159 us	271 us	1	8	127%
SET	87.612/s	175 us	279 us	1	16	196%
SET	104.196/s	239 us	391 us	1	32	235%
SET	108.295/s	399 us	575 us	1	64	256%

In this benchmark we are limited by the RonDB data nodes, but also by the fact that the SET command only issues one write operation at a time.

Result from MSET operation#

To make this benchmark we did some minor changes to valkey-benchmark to make the batch size in MSET test flexible. So as with REST API server we will start to see the impact of increasing batch size.

We will report the numbers in number of rows per second changed rather than the number of commands finished as reported by the benchmark tool.

Benchmark	Throughput	Median latency	99%	Batch	Threads	CPU
MSET	8.313/s	119 us	175 us	1	1	16%
MSET	14.279/s	135 us	199 us	2	1	20%
MSET	24.808/s	159 us	207 us	4	1	28%
MSET	45.661/s	175 us	223 us	8	1	34%
MSET	80.882/s	199 us	255 us	16	1	40%
MSET	139.300/s	231 us	303 us	32	1	45%
MSET	212.371/s	303 us	415 us	64	1	55%
MSET	307.028/s	415 us	551 us	128	1	61%
MSET	356.050/s	695 us	1031 us	256	1	73%

From these numbers we can conclude that it takes very little effort for Rondis using MSET with large batch sizes to effectively run enough requests to keep the RonDB data node busy. Latency numbers are also very good and very stable, there is fairly small difference between median latency and the 99% latency.

So let’s see what mixing batching and threading gives in the Rondis case.

Benchmark	Throughput	Median latency	99%	Batch	Threads	CPU
MSET	88.574/s	170 us	359 us	4	4	90%
MSET	260.036/s	231 us	487 us	16	4	112%
MSET	571.327/s	423 us	767 us	64	4	144%
MSET	782.205/s	639 us	967 us	64	8	205%
MSET	837.148/s	1199 us	1567 us	128	8	191%
MSET	927.760/s	231 us	303 us	128	16	247%
MSET	883.369/s	303 us	415 us	256	16	285%
MSET	970.638/s	2439 us	3615 us	100	24	285%
MSET	1.005.025/s	2823 us	4223 us	90	32	338%

So even with write operations it is possible to get much better throughput and improved latency using batching, even pushing it past 1M key writes per second with a bit higher latency. As we previously batching can improve throughput at similar latency by a factor of 10. This factor was interestingly the same 25 years ago. So over time with so many HW developments the benefit of asynchronous programming over synchronous programming stays at a factor of 10x.

Result from GET operation#

We will look at the GET operation that is limited in performance since it can only handle one operation at a time. Also as can be seen from numbers the performance increases stepwise as number of threads increase.

Benchmark	Throughput	Median latency	99%	Batch	Threads	CPU
GET	32.957/s	31 us	39 us	1	1	40%
GET	50.691/s	39 us	55 us	1	2	79%
GET	83.787/s	47 us	95 us	1	4	151%
GET	119.653/s	63 us	103 us	1	8	247%
GET	117.589/s	103 us	159 us	1	16	270%
GET	162.588/s	119 us	207 us	1	32	356%
GET	153.941/s	207 us	375 us	1	64	340%

Result from MGET operation#

Now we turn to MGET operations. As usual we start seeing the impact of increasing batch size.

Benchmark	Throughput	Median latency	99%	Batch	Threads	CPU
MGET	34.249/s	31 us	39 us	1	1	40%
MGET	54.607/s	39 us	55 us	2	1	52%
MGET	121.902/s	31 us	47 us	4	1	45%
MGET	180.438/s	47 us	63 us	8	1	58%
MGET	362.154/s	47 us	63 us	16	1	54%
MGET	557.055/s	55 us	79 us	32	1	61%
MGET	728.307/s	87 us	119 us	64	1	67%
MGET	827.194/s	143 us	199 us	128	1	79%
MGET	870.185/s	271 us	375 us	256	1	88%

We can see that batching is significantly more efficient compared to threading.

Next we look at the combination of batching and threading.

Benchmark	Throughput	Median latency	99%	Batch	Threads	CPU
MGET	1.703.713/s	143 us	287 us	64	4	270%
MGET	3.124.480/s	143 us	271 us	64	8	485%
MGET	4.524.375/s	303 us	511 us	100	16	700%
MGET	4.825.091/s	407 us	655 us	100	24	745%
MGET	4.880.429/s	511 us	807 us	100	32	780%
MGET	5.086.117/s	559 us	879 us	110	32	815%

As usual we see a clear increase when threading and batching is combined. Obviously the key lookups are very small here, so this is a very synthetic benchmark, but 5M lookups per second on a single machine is still fairly impressive.

MySQL Results#

RonDB has a number of different interfaces. One of the prominent ones is the MySQL Server that provides an SQL interface towards RonDB. As we have seen in the benchmarks on the REST API and Rondis we have very good results when RonDB can combine threading and batching.

It is a lot more difficult to achieve batching in the SQL interface. Actually the NDB team spent 10 years figuring out ways to achieve the batching required to get good performance using SQL.

However SQL is more useful for queries that are more complex in nature. RonDB has its greatest performance strength in batching key lookups. However RonDB still has amazingly low latency even through the SQL interface. It has also very decent ability to send very large data sets over the network. Actually the Sysbench OLTP RW sends 1 GByte of data from RonDB data node to the MySQL Server in the results reported below.

Another benefit of RonDB is that scales very well, thus one can simply add more MySQL Servers to achieve higher throughput. In this blog post we discuss the scalability of RonDB using up to 12 MySQL Servers. Similarly in this blog post we discuss how RonDB behaves when running Sysbench.

There is even a a blog post describing how one can do a bit of batching using SQL in Sysbench to achieve 5.5M key lookups per second.

There is a post on the RonDB website providing a detailed benchmark study using YCSB (Yahoo Cloud Serving Benchmark). This post is unique in the sense that it compares a number of different benchmarks with normal operation and compares it to how it behaves when a node fails and how the node recovery affects the throughput and latency. In addition it compares the behaviour of in-memory tables with disk tables.

In order to understand Sysbench OLTP benchmarks one needs to understand the what queries that Sysbench uses. A traditional Sysbench test is executing 20 SQL queries as a single transaction. There are 10 Point queries performing single key lookup queries, next there are 4 write queries (1 Inser, 2 Update and 1 Delete), next there are 4 scan queries that return 100 rows within a range, finally there is a BEGIN and a COMMIT query.

Thus Sysbench is putting the transaction system into action and not only handling single row interactions. However most of the CPU usage in Sysbench is spent on handling the 4 scan queries. These 4 queries return 100 rows each to the MySQL Server and this means around 50-60 kByte of data sent on the network. It also means that the scanning needs to be efficient in quickly moving from row to row. RonDB is slightly faster than InnoDB in scanning rows. Thus RonDB behaves very well also in Sysbench tests.

Sysbench OLTP RW Results#

Executing a Sysbench test is very simple. The following command will execute a Sysbench on a single machine using the MTR framework. This test will run through the test using a number of different number of threads.

./mtr --suite=ndbcrunch sb_oltp_read_write

In our case we made it a little bit more complex by using CPU binding. In our case we used the following settings (CPU 16-31 is the power efficient ones and 0-15 is the performance CPUs).

[cluster_config.ndbd.1.crunch]
cpubind=16-20

[cluster_config.ndbd.2.crunch]
cpubind=21-24

[mysqld.1.crunch]
cpubind=0-7

[mysqld.2.crunch]
cpubind=8-15

[mysqltest]
cpubind=26-31

To run with this one uses the following command from the mysql-test in the binary distribution.

./mtr --suite=ndbcrunch --defaults-extra-file=suite/ndbcrunch/cpubind.cnf sb_oltp_read_write

We also made a minor change in the my.cnf file by adding a more aggressive SpinMethod and removed NumCPUs since locking to CPUs will make the number of CPUs and which CPUs to use automated.

SpinMethod=DatabaseMachineSpinning
#NumCPUs=6

Ok so on to results (TPS = Transactions per second):

Benchmark	TPS	Average latency	95% Threads	CPU
OLTP RW	833/s	1210 us	1300 us	1	41%
OLTP RW	1.300/s	1540 us	2030 us	2	100%
OLTP RW	2.434/s	1640 us	2220 us	4	190%
OLTP RW	4.142/s	1930 us	2570 us	8	350%
OLTP RW	6.605/s	2420 us	3070 us	16	620%
OLTP RW	9678/s	3330 us	4100 us	32	920%
OLTP RW	12.267/s	5210 us	6670 us	64	1190%
OLTP RW	13.041/s	9810 us	14460 us	128	1370%
OLTP RW	11.995/s	21300 us	41850 us	256	1380%

So from these numbers one case that the latency is very low, less than 65 microseconds per SQL query, and remember that some queries handles 100 rows and some of them are write queries. So latency is very low.

One can also see that the MySQL Server uses a fair amount of CPU, so for this type of applications one needs a considerable amount of CPUs on the MySQL Server and on the application side.

Sysbench OLTP Write Only Results#

In this benchmark we only kept the write queries and BEGIN and COMMIT. One can see that the base latency of write queries is below 100 microseconds even when performing a distributed two-phase commit transaction.

Benchmark	TPS	Average latency	95% Threads	CPU
OLTP W	2.579/s	390 us	430 us	1	20%
OLTP W	4.304/s	460 us	570 us	2	58%
OLTP W	7.642/s	520 us	680 us	4	105%
OLTP W	12.717/s	630 us	840 us	8	200%
OLTP W	20.753/s	770 us	1040 us	16	350%
OLTP W	33.372/s	960 us	1270 us	32	543%
OLTP W	46.613/s	1370 us	2030 us	64	725%
OLTP W	52.691/s	2430 us	4820 us	128	840%
OLTP W	55.073/s	4640 us	11240 us	256	960%

From the numbers we can see that in this test the RonDB data node became the bottleneck, so one could easily increase the number of CPUs used here to get higher numbers.

Sysbench OLTP Read Only Results#

In this we remove the write queries and execute only the read queries. We note that the majority of the latency comes from these queries and we also note that they use considerably more CPU in the MySQL servers. This is mainly due to the scan queries. The 10 Point queries make up very little part of this load.

Benchmark	TPS	Average latency	95% Threads	CPU
OLTP R	1.251/s	800 us	840 us	1	51%
OLTP R	1.882/s	1060 us	1470 us	2	118%
OLTP R	3.540/s	1130 us	1610 us	4	230%
OLTP R	6.110/s	1310 us	1760 us	8	430%
OLTP R	9.848/s	1620 us	2030 us	16	735%
OLTP R	14.429/s	2220 us	2710 us	32	1120%
OLTP R	16.948/s	3770 us	4910 us	64	1350%
OLTP R	17.181/s	7450 us	11040 us	128	1450%
OLTP R	15.639/s	16360 us	31370 us	256	1500%

Sysbench OLTP Point Select Results#

This benchmark only runs single key lookup using SQL. Thus we can compare the cost of issuing those through MySQL with those using RonDB REST API server.

Benchmark	TPS	Average latency	95% Threads	CPU
OLTP P	25.568/s	40 us	40 us	1	43%
OLTP P	40.913/s	50 us	70 us	2	115%
OLTP P	76.666/s	50 us	70 us	4	220%
OLTP P	137.118/s	60 us	90 us	8	420%
OLTP P	238.947/s	70 us	100 us	16	740%
OLTP P	397.695/s	80 us	120 us	32	1160%
OLTP P	505.836/s	130 us	230 us	64	1360%
OLTP P	565.536/s	230 us	440 us	128	1530%
OLTP P	563.731/s	450 us	840 us	256	1730%

As the numbers the MySQL Server scales very nicely all the way up 400k Point queries. After that the MySQL server becomes overloaded and latency grows.

Compared to the REST API server the MySQL server consumes a bit more CPU, the latency is similar. So for key lookups the REST API server is preferrable and even more so when considering batches of key lookups as often happens in AI applications.

We can also see that the 10 Point queries consume about a third of the Read Only queries in the MySQL server compared to the 4 scan queries.

RonSQL#

RonSQL supports pushdown aggregation of queries on a single table. The reason for developing this support is that AI inferencing is often built on features that constitute aggregates of many events. These aggregates are often calculated by offline systems that goes through historical data.

However real-time AI inferencing requires the features to be calculated on the current events as well as historical events. Thus RonDB supports storing event data in time windows, using RonSQL to calculate aggregates based on data in those tables and getting rid of this data when it is too old using the TTL feature.

To benchmark this we used a very simple benchmark consisting of 1 query in TPC-H. Q1 in iTPC-H does a scan on all rows in the lineitem table. There are about 6M rows per scale factor in this table. Using the MySQL server this query requires most of this table to be sent up to the MySQL server to calculate the aggregates there. This takes about 1 second with scale factor 1.

With RonSQL the calculation happens where the data resides and it happens on all partitions in parallel. This means that in a standard setup with 6 partitions we can handle this query 10x faster in 100 ms. By doubling the number of partitions we can improve this to 20x improvement and this query can execute in 50ms instead.