Skip to content

RonDB Benchmarks#

This chapter reports some benchmark numbers for RonDB. The numbers in this chapter is mostly simple benchmarks executed on one of our development machines. This means that they are very simple to reproduce, obviously they don’t show the scalability of RonDB. For that we will reference some larger benchmark efforts.

Benchmarks of RonDB focus on showing performance of the following data services provided by RonDB.

  1. RonDB REST API

  2. RonDB Feature Store REST API

  3. Rondis

  4. MySQL

  5. RonSQL

RonDB REST API#

The RonDB REST API has two endpoints, the pkread endpoint and the batch endpoint.

The pkread issues a read using a primary key lookup on a table provided a number of columns to read.

To execute this benchmark one can simply start a cluster using MTR. The simplest manner is to install your binary tarball somewhere where you want to run it. Next go into the directory mysql-test and start a cluster. Before starting you likely want to change the contents in mysql-test/include/default_ndbd.cnf. You most likely want to set AutomaticMemoryConfig and AutomaticThreadConfig to 1, you should also set NumCPUs to the number of CPUs you want each of the two RonDB data nodes to use. Next you will want to remove the line about ThreadConfig. It is also a good idea to increase the following parameters, DataMemory, SharedGlobalMemory, RedoBuffer, FragmentLogFileSize and LongMessageBuffer. One should also add the parameter TotalMemoryConfig.

The reason you need to change so many parameters is that the default setting is to run functional test cases and these mostly do execute very small amounts of concurrent transactions and schema changes. A benchmark requires more memory resources and CPU resources. In my case I usually use the following settings (there is nothing scientific about these settings, they are simply good enough):

  1. AutomaticMemoryConfig=1

  2. AutomaticThreadConfig=1

  3. NumCPUs=6

  4. DataMemory=2500M

  5. TotalMemoryConfig=5G

  6. FragmentLogFileSize=1G

  7. SharedGlobalMemory=200M

  8. RedoBuffer=48M

  9. LongMessageBuffer=80M

  10. SpinMethod=StaticSpinning

In a default setup one would use SpinMethod=LatencyOptimisedSpinning. However using StaticSpinning means that we can rely on top to see how much CPU the RonDB data nodes actually use. Spinning improves the throughput in all cases, but it does show a higher CPU usage and sometimes even much higher than the actual CPU used for real work.

These numbers assume you have at least 16 CPUs and 16 GB of memory in your machine. With more or less CPUs you can adjust NumCPUs appropriately and with more memory you can increase DataMemory and TotalMemoryConfig.

Starting the RonDB cluster uses the following command executed in the mysql-test directory in the binary distribution after updating the configuration file as directed above.

./mtr --suite=rdrs2-golang --start-and-exit

This command will start the cluster The benchmarks for REST API are executed by Go programs found in storage/ndb/rest-server2/server/test_go. The following command will list the available test cases for the REST API.

Now this will also create two files in the var directory, rdrs.1.1.config.json and rdrs.2.1.config.json. Now we first need to stop these two REST API servers to and reconfigure them. So first kill them using kill -9 as shown below. Next edit the configuration and place them in a path known to you. Finally move to the bin directory of the RonDB binary distribution and run the two REST API servers with the proper configuration.

killall -9 rdrs2
./rdrs2 --config /path/rdrs.1.1.config.json
./rdrs2 --config /path/rdrs.2.1.config.json

Here is the configuration we used in the benchmarks below (rdrs.1.1.config.json):

{
  "PIDFile": "/path/rdrs.1.1.pid",
  "REST": {
    "ServerIP": "0.0.0.0",
    "ServerPort": 13005,
    "NumThreads": 64
  },
  "RonDB": {
      "Mgmds": [ { "IP": "localhost",
                   "Port": 13000 } ],
      "ConnectionPoolSize": 2
  },
  "Security": {
    "TLS": {
      "EnableTLS": false,
      "RequireAndVerifyClientCert": false,
    },
    "APIKey": {
      "UseHopsworksAPIKeys": false
    }
  },
  "Rondis": {
    "Enable": true,
    "ServerPort": 6379,
    "NumThreads": 64,
    "NumDatabases": 2,
    "Databases":
      [
        {
          "Index": 0,
          "DirtyINCRDECR": true
        },
        {
          "Index": 1,
          "DirtyINCRDECR": true
        }
      ]
  },
  "Log": {
    "Level": "INFO"
  },
  "Testing": {
    "MySQL": {
        "Servers": [ { "IP": "localhost",
                       "Port": 13001 } ],
      "User": "root",
      "Password": ""
    }
  }
}

The second REST API server uses port 13006 and 6380 instead of 13005 and 6379.

Now before running the benchmarks you need to point to the configuration file of the REST API server with the environment variable RDRS_CONFIG_FILE.

So executing the following command will benchmark the pkread endpoint.

./script.sh bench hopsworks.ai/rdrs2/internal/integrationtests/pkread

The only benchmark on the pkread endopint is the BenchmarkSimple test case.

./script.sh bench hopsworks.ai/rdrs2/internal/integrationtests/batchpkread

This will execute the 3 different benchmarks on the batch endpoint. These are BenchmarkSimple, BenchmarkManyColumns and BenchmarkBinary.

One can ensure only one of those to be executed using the following command.

./script.sh bench hopsworks.ai/rdrs2/internal/integrationtests/batchpkread BenchmarkSimple

The first you run one of those benchmarks, the script will fill the database with a set of tables used in those benchmarks. This will take some time and after that it will start running the benchmark and output benchmark results.

BenchmarkSimple runs a simple retrieval in a table with a single primary key column and reading a simple VARCHAR(100) column. Thus it tests the maximum throughput of the REST API server.

BenchmarkManyColumns runs a retrieval from a table with one primary key column and 100 small columns storing some string. This benchmark benchmarks the effect of handling many columns in the REST API server.

BenchmarkBinary runs a retrieval from a table with a single primary key column using a single large column that stores around 3 kBytes of data. This benchmark shows how the REST API server handles large volumes of binary data.

All benchmarks report the batch size, the number of threads, the throughput, the median response time and the 99% response time.

Since the benchmark is running on a single machine or VM, it is simple to check the CPU resources used by the different processes. For the most part the focus of the benchmarking is around the efficiency of the REST API server process.

RonDB REST API Results#

The results provided here was executed on a desktop with an Intel Raptor Lake CPU with 16 efficiency cores and 8 power cores. The memory size was 64 GB and the disks were fast enough to not affect the benchmarking.

The CPU is called 13th Gen Intel(R) Core(TM) i9-13900K.

No special attempt was used to lock processes or threads to specific CPUs. In a production setup this is something RonDB would normally do. Normally each RonDB process runs in its own virtual machine or in its pod where the pod mainly consists of the container running the RonDB process. Only some init container is there as well. In this setup it is natural to also lock threads to CPUs for RonDB data nodes. No such attempts are made for REST API servers and MySQL Servers.

BenchmarkSimple with pkread endpoint#

This benchmark scales only on the number of threads used in the Go client. This can be changed by editing the file storage/ndb/rest-server2/server/test_go/internal/integrationtests/pkread/performance_test.go. Find the GOMAXPROCS and change from 24 to whatever number of threads you want to use.

Benchmark Throughput Median latency 99% Batch Threads CPU
BenchmarkSimple 23.654/s 40 us 59 us 1 1 38%
BenchmarkSimple 38.230/s 49 us 81 us 1 2 70%
BenchmarkSimple 56.663/s 66 us 170 us 1 4 140%
BenchmarkSimple 90.713/s 84 us 198 us 1 8 290%
BenchmarkSimple 141.985/s 105 us 234 us 1 16 495%
BenchmarkSimple 199.978/s 144 us 408 us 1 32 715%
BenchmarkSimple 236.875/s 238 us 724 us 1 64 790%

If one runs the same benchmark in a distributed setup there will be more latency since the package has to be transferred over a network which usually adds around 50 microseconds per jump in a public cloud setting. So the numbers here shows how much latency is due to the actual execution of the REST calls.

The CPU usage increases faster than throughput since the NDB API provides very good latency when the number of requests are very low. At the same time the NDB API becomes more effective again when the load increases, thus at extreme loads there will be more throughput available.

BenchmarkSimple with batchpkread endpoint#

Also in this case it is possible to change the behaviour by editing the Go client. This time the file to edit is storage/ndb/rest-server2/server/test_go/internal/integrationtests/batchpkread/performance_test.go.

Here though there are two parameters that can be changed, the same as above, the number of client threads can be changed through setting GOMAXPROCS. Now we can also set the batch size by setting the variable batchSize.

We will start seeing how performance changes with a single thread and increasing the batch size.

Benchmark Throughput Median latency 99% Batch Threads CPU
BenchmarkSimple 22.595/s 42 us 62 us 1 1 35%
BenchmarkSimple 36.949/s 52 us 78 us 2 1 40%
BenchmarkSimple 63.319/s 60 us 90 us 4 1 44%
BenchmarkSimple 87.474/s 86 us 174 us 8 1 48%
BenchmarkSimple 124.857/s 121 us 252 us 16 1 50%
BenchmarkSimple 162.073/s 187 us 349 us 32 1 52%
BenchmarkSimple 185.027/s 327 us 548 us 64 1 55%
BenchmarkSimple 203.803/s 596 us 872 us 128 1 56%
BenchmarkSimple 211.251/s 1164 us 1496 us 256 1 59%

As can be seen in those numbers we get almost similar with the batch size replaced by number of threads, both in terms of throughput and latency. However we can see from the CPU usage that sending batches consume 10x less CPU in the REST API server. However both the batch size and number of threads reaches a plateau around 100.

So with these results it is now interesting to move to testing with a combination of threads and batch sizes.

Benchmark Throughput Median latency 99% Batch Threads CPU
BenchmarkSimple 137.791/s 106 us 241 us 4 4 142%
BenchmarkSimple 319.844/s 187 us 410 us 16 4 185%
BenchmarkSimple 551.096/s 434 us 747 us 64 4 206%
BenchmarkSimple 319.658/s 190 us 397 us 8 8 350%
BenchmarkSimple 691.409/s 349 us 647 us 32 8 372%
BenchmarkSimple 1.028.243/s 947 us 1520 us 128 8 405%
BenchmarkSimple 693.094/s 337 us 750 us 16 16 542%
BenchmarkSimple 1.379.632/s 678 us 1193 us 64 16 710%
BenchmarkSimple 1.688.128/s 2220 us 5071 us 256 16 725%
BenchmarkSimple 1.277.227/s 730 us 1684 us 32 32 800%
BenchmarkSimple 1.640.773/s 730 us 1684 us 64 32 800%
BenchmarkSimple 2.015.032/s 1917 us 3893 us 128 32 800%
BenchmarkSimple 2.146.081/s 3587 us 6099 us 256 32 1060%
BenchmarkSimple 1.799.511/s 1984 us 5419 us 64 64 985%
BenchmarkSimple 2.072.006/s 2358 us 4442 us 128 40 800%

What we see from the above numbers is that batching provides the most efficient way of achieving high performance. However to get to the full potential of REST API server it is necessary to use multiple threads in combination with batching.

RonDB is heavily used in AI applications such as Personalized Recommendation systems. In these systems it is very common to perform batches of hundreds of key lookups with requirement on very low latency. As can be seen here we can acheive latency well below one millisecond responding to hundreds of key lookups.

The application can also choose to split the batches in several requests and execute them in parallel on either different REST API servers or on different threads in the same REST API server. Thus it is possible to optimise on both latency and throughput using RonDB REST API servers.

The next question is what benefits there are from using multiple REST API servers. It is fairly easy to extend the above tests to run 2 benchmark clients using different REST API servers. They will still use the same RonDB data nodes, but will not interact other than competing for the CPU resources since those benchmarks are executed on a single desktop machine.

In this table we report numbers from 2 REST API servers, we add the CPU usage, so it is the total CPU usage and throughput which is provided.

Benchmark Throughput Median latency 99% Batch Threads CPU
BenchmarkSimple 40.667/s 45 us 80 us 1 1 70%
BenchmarkSimple 218.801/s 132 us 280 us 16 1 97%
BenchmarkSimple 399.650/s 602 us 850 us 128 1 114%
BenchmarkSimple 498.013/s 250 us 500 us 16 4 378%
BenchmarkSimple 1.086.516/s 349 us 647 us 128 4 416%
BenchmarkSimple 1.946.243/s 975 us 1880 us 64 16 1025%
BenchmarkSimple 2.250.000/s 2400 us 4800 us 128 24 1168%

What we see from these numbers is that one benefits more from adding REST API servers rather than adding threads. At the same time there is a limit on the number of slots for REST API servers in a cluster, so it makes sense to scale on threads as well as on number of servers.

The general rule we use at Hopsworks is to allocate 16 CPUs per REST API server. If more is needed one adds more REST API servers. It is possible quite a few REST API servers in a cluster. One can also go down to 8 CPUs or even 4 CPUs and have good behaviour still.

BenchmarkManyColumns with batchpkread endpoint#

This benchmark reads 100 small columns instead of a single column. This puts a lot more load on the REST API server since it has to parse a lot more in the request and also the RonDB data node will require more processing to fetch 100 columns instead of 1.

Benchmark Throughput Median latency 99% Batch Threads CPU
BenchmarkManyColumns 7.899/s 120 us 299 us 1 1 41%
BenchmarkManyColumns 16.098/s 472 us 804 us 8 1 41%
BenchmarkManyColumns 51.137/s 597 us 1053 us 8 4 153%
BenchmarkManyColumns 121.893/s 1006 us 1665 us 8 16 516%
BenchmarkManyColumns 150.031/s 1652 us 3191 us 8 32 695%

One problem with this benchmark is that Go client consumes almost of the CPU resources. Thus it is hard to scale the benchmarks any more than shown above. In the last line the RonDB data nodes used only 210% CPU resources whereas the Go client used more than 1400%.

BenchmarkBinary with batchpkread endpoint#

This benchmark is also quite heavy on the Go client, so again it is a bit harder to fully test the REST API server on a single machine.

Benchmark Throughput Median latency 99% Batch Threads CPU
BenchmarkBinary 12.280/s 64 us 160 us 1 1 26%
BenchmarkBinary 24.008/s 225 us 530 us 8 1 26%
BenchmarkBinary 74.830/s 294 us 563 us 8 4 109%
BenchmarkBinary 111.947/s 380 us 748 us 8 8 208%
BenchmarkBinary 170.986/s 497 us 982 us 8 16 366%
BenchmarkBinary 201.251/s 883 us 2064 us 8 32 482%
BenchmarkBinary 229.624/s 1177 us 2475 us 12 32 495%

The benchmark client seem to not scale well with larger batch sizes, but the picture looks as expected that the REST API server requires more time to handle many columns compared to handling large columns. However the CPU usage is still very light and one can easily scale to millions of request with both a large number of columns and/or large columns.

Rondis#

All the setup for the REST API is also prepared to run Rondis tests using the same REST API binary.

Rondis results#

The Rondis benchmarks will be executed using a simple tool called redis-benchmark.

Result from SET operation#

Benchmark Throughput Median latency 99% Batch Threads CPU
SET 8.356/s 119 us 175 us 1 1 15%
SET 13.670/s 143 us 199 us 1 2 35%
SET 24.456/s 159 us 327 us 1 4 68%
SET 47.994/s 159 us 271 us 1 8 127%
SET 87.612/s 175 us 279 us 1 16 196%
SET 104.196/s 239 us 391 us 1 32 235%
SET 108.295/s 399 us 575 us 1 64 256%

In this benchmark we are limited by the RonDB data nodes, but also by the fact that the SET command only issues one write operation at a time.

Result from MSET operation#

To make this benchmark we did some minor changes to valkey-benchmark to make the batch size in MSET test flexible. So as with REST API server we will start to see the impact of increasing batch size.

We will report the numbers in number of rows per second changed rather than the number of commands finished as reported by the benchmark tool.

Benchmark Throughput Median latency 99% Batch Threads CPU
MSET 8.313/s 119 us 175 us 1 1 16%
MSET 14.279/s 135 us 199 us 2 1 20%
MSET 24.808/s 159 us 207 us 4 1 28%
MSET 45.661/s 175 us 223 us 8 1 34%
MSET 80.882/s 199 us 255 us 16 1 40%
MSET 139.300/s 231 us 303 us 32 1 45%
MSET 212.371/s 303 us 415 us 64 1 55%
MSET 307.028/s 415 us 551 us 128 1 61%
MSET 356.050/s 695 us 1031 us 256 1 73%

From these numbers we can conclude that it takes very little effort for Rondis using MSET with large batch sizes to effectively run enough requests to keep the RonDB data node busy. Latency numbers are also very good and very stable, there is fairly small difference between median latency and the 99% latency.

So let’s see what mixing batching and threading gives in the Rondis case.

Benchmark Throughput Median latency 99% Batch Threads CPU
MSET 88.574/s 170 us 359 us 4 4 90%
MSET 260.036/s 231 us 487 us 16 4 112%
MSET 571.327/s 423 us 767 us 64 4 144%
MSET 782.205/s 639 us 967 us 64 8 205%
MSET 837.148/s 1199 us 1567 us 128 8 191%
MSET 927.760/s 231 us 303 us 128 16 247%
MSET 883.369/s 303 us 415 us 256 16 285%
MSET 970.638/s 2439 us 3615 us 100 24 285%
MSET 1.005.025/s 2823 us 4223 us 90 32 338%

So even with write operations it is possible to get much better throughput and improved latency using batching, even pushing it past 1M key writes per second with a bit higher latency. As we previously batching can improve throughput at similar latency by a factor of 10. This factor was interestingly the same 25 years ago. So over time with so many HW developments the benefit of asynchronous programming over synchronous programming stays at a factor of 10x.

Result from GET operation#

We will look at the GET operation that is limited in performance since it can only handle one operation at a time. Also as can be seen from numbers the performance increases stepwise as number of threads increase.

Benchmark Throughput Median latency 99% Batch Threads CPU
GET 32.957/s 31 us 39 us 1 1 40%
GET 50.691/s 39 us 55 us 1 2 79%
GET 83.787/s 47 us 95 us 1 4 151%
GET 119.653/s 63 us 103 us 1 8 247%
GET 117.589/s 103 us 159 us 1 16 270%
GET 162.588/s 119 us 207 us 1 32 356%
GET 153.941/s 207 us 375 us 1 64 340%

Result from MGET operation#

Now we turn to MGET operations. As usual we start seeing the impact of increasing batch size.

Benchmark Throughput Median latency 99% Batch Threads CPU
MGET 34.249/s 31 us 39 us 1 1 40%
MGET 54.607/s 39 us 55 us 2 1 52%
MGET 121.902/s 31 us 47 us 4 1 45%
MGET 180.438/s 47 us 63 us 8 1 58%
MGET 362.154/s 47 us 63 us 16 1 54%
MGET 557.055/s 55 us 79 us 32 1 61%
MGET 728.307/s 87 us 119 us 64 1 67%
MGET 827.194/s 143 us 199 us 128 1 79%
MGET 870.185/s 271 us 375 us 256 1 88%

We can see that batching is significantly more efficient compared to threading.

Next we look at the combination of batching and threading.

Benchmark Throughput Median latency 99% Batch Threads CPU
MGET 1.703.713/s 143 us 287 us 64 4 270%
MGET 3.124.480/s 143 us 271 us 64 8 485%
MGET 4.524.375/s 303 us 511 us 100 16 700%
MGET 4.825.091/s 407 us 655 us 100 24 745%
MGET 4.880.429/s 511 us 807 us 100 32 780%
MGET 5.086.117/s 559 us 879 us 110 32 815%

As usual we see a clear increase when threading and batching is combined. Obviously the key lookups are very small here, so this is a very synthetic benchmark, but 5M lookups per second on a single machine is still fairly impressive.

MySQL Results#

RonDB has a number of different interfaces. One of the prominent ones is the MySQL Server that provides an SQL interface towards RonDB. As we have seen in the benchmarks on the REST API and Rondis we have very good results when RonDB can combine threading and batching.

It is a lot more difficult to achieve batching in the SQL interface. Actually the NDB team spent 10 years figuring out ways to achieve the batching required to get good performance using SQL.

However SQL is more useful for queries that are more complex in nature. RonDB has its greatest performance strength in batching key lookups. However RonDB still has amazingly low latency even through the SQL interface. It has also very decent ability to send very large data sets over the network. Actually the Sysbench OLTP RW sends 1 GByte of data from RonDB data node to the MySQL Server in the results reported below.

Another benefit of RonDB is that scales very well, thus one can simply add more MySQL Servers to achieve higher throughput. In this blog post we discuss the scalability of RonDB using up to 12 MySQL Servers. Similarly in this blog post we discuss how RonDB behaves when running Sysbench.

There is even a a blog post describing how one can do a bit of batching using SQL in Sysbench to achieve 5.5M key lookups per second.

There is a post on the RonDB website providing a detailed benchmark study using YCSB (Yahoo Cloud Serving Benchmark). This post is unique in the sense that it compares a number of different benchmarks with normal operation and compares it to how it behaves when a node fails and how the node recovery affects the throughput and latency. In addition it compares the behaviour of in-memory tables with disk tables.

In order to understand Sysbench OLTP benchmarks one needs to understand the what queries that Sysbench uses. A traditional Sysbench test is executing 20 SQL queries as a single transaction. There are 10 Point queries performing single key lookup queries, next there are 4 write queries (1 Inser, 2 Update and 1 Delete), next there are 4 scan queries that return 100 rows within a range, finally there is a BEGIN and a COMMIT query.

Thus Sysbench is putting the transaction system into action and not only handling single row interactions. However most of the CPU usage in Sysbench is spent on handling the 4 scan queries. These 4 queries return 100 rows each to the MySQL Server and this means around 50-60 kByte of data sent on the network. It also means that the scanning needs to be efficient in quickly moving from row to row. RonDB is slightly faster than InnoDB in scanning rows. Thus RonDB behaves very well also in Sysbench tests.

Sysbench OLTP RW Results#

Executing a Sysbench test is very simple. The following command will execute a Sysbench on a single machine using the MTR framework. This test will run through the test using a number of different number of threads.

./mtr --suite=ndbcrunch sb_oltp_read_write

In our case we made it a little bit more complex by using CPU binding. In our case we used the following settings (CPU 16-31 is the power efficient ones and 0-15 is the performance CPUs).

[cluster_config.ndbd.1.crunch]
cpubind=16-20

[cluster_config.ndbd.2.crunch]
cpubind=21-24

[mysqld.1.crunch]
cpubind=0-7

[mysqld.2.crunch]
cpubind=8-15

[mysqltest]
cpubind=26-31

To run with this one uses the following command from the mysql-test in the binary distribution.

./mtr --suite=ndbcrunch --defaults-extra-file=suite/ndbcrunch/cpubind.cnf sb_oltp_read_write

We also made a minor change in the my.cnf file by adding a more aggressive SpinMethod and removed NumCPUs since locking to CPUs will make the number of CPUs and which CPUs to use automated.

SpinMethod=DatabaseMachineSpinning
#NumCPUs=6

Ok so on to results (TPS = Transactions per second):

Benchmark TPS Average latency 95% Threads CPU
OLTP RW 833/s 1210 us 1300 us 1 41%
OLTP RW 1.300/s 1540 us 2030 us 2 100%
OLTP RW 2.434/s 1640 us 2220 us 4 190%
OLTP RW 4.142/s 1930 us 2570 us 8 350%
OLTP RW 6.605/s 2420 us 3070 us 16 620%
OLTP RW 9678/s 3330 us 4100 us 32 920%
OLTP RW 12.267/s 5210 us 6670 us 64 1190%
OLTP RW 13.041/s 9810 us 14460 us 128 1370%
OLTP RW 11.995/s 21300 us 41850 us 256 1380%

So from these numbers one case that the latency is very low, less than 65 microseconds per SQL query, and remember that some queries handles 100 rows and some of them are write queries. So latency is very low.

One can also see that the MySQL Server uses a fair amount of CPU, so for this type of applications one needs a considerable amount of CPUs on the MySQL Server and on the application side.

Sysbench OLTP Write Only Results#

In this benchmark we only kept the write queries and BEGIN and COMMIT. One can see that the base latency of write queries is below 100 microseconds even when performing a distributed two-phase commit transaction.

Benchmark TPS Average latency 95% Threads CPU
OLTP W 2.579/s 390 us 430 us 1 20%
OLTP W 4.304/s 460 us 570 us 2 58%
OLTP W 7.642/s 520 us 680 us 4 105%
OLTP W 12.717/s 630 us 840 us 8 200%
OLTP W 20.753/s 770 us 1040 us 16 350%
OLTP W 33.372/s 960 us 1270 us 32 543%
OLTP W 46.613/s 1370 us 2030 us 64 725%
OLTP W 52.691/s 2430 us 4820 us 128 840%
OLTP W 55.073/s 4640 us 11240 us 256 960%

From the numbers we can see that in this test the RonDB data node became the bottleneck, so one could easily increase the number of CPUs used here to get higher numbers.

Sysbench OLTP Read Only Results#

In this we remove the write queries and execute only the read queries. We note that the majority of the latency comes from these queries and we also note that they use considerably more CPU in the MySQL servers. This is mainly due to the scan queries. The 10 Point queries make up very little part of this load.

Benchmark TPS Average latency 95% Threads CPU
OLTP R 1.251/s 800 us 840 us 1 51%
OLTP R 1.882/s 1060 us 1470 us 2 118%
OLTP R 3.540/s 1130 us 1610 us 4 230%
OLTP R 6.110/s 1310 us 1760 us 8 430%
OLTP R 9.848/s 1620 us 2030 us 16 735%
OLTP R 14.429/s 2220 us 2710 us 32 1120%
OLTP R 16.948/s 3770 us 4910 us 64 1350%
OLTP R 17.181/s 7450 us 11040 us 128 1450%
OLTP R 15.639/s 16360 us 31370 us 256 1500%

Sysbench OLTP Point Select Results#

This benchmark only runs single key lookup using SQL. Thus we can compare the cost of issuing those through MySQL with those using RonDB REST API server.

Benchmark TPS Average latency 95% Threads CPU
OLTP P 25.568/s 40 us 40 us 1 43%
OLTP P 40.913/s 50 us 70 us 2 115%
OLTP P 76.666/s 50 us 70 us 4 220%
OLTP P 137.118/s 60 us 90 us 8 420%
OLTP P 238.947/s 70 us 100 us 16 740%
OLTP P 397.695/s 80 us 120 us 32 1160%
OLTP P 505.836/s 130 us 230 us 64 1360%
OLTP P 565.536/s 230 us 440 us 128 1530%
OLTP P 563.731/s 450 us 840 us 256 1730%

As the numbers the MySQL Server scales very nicely all the way up 400k Point queries. After that the MySQL server becomes overloaded and latency grows.

Compared to the REST API server the MySQL server consumes a bit more CPU, the latency is similar. So for key lookups the REST API server is preferrable and even more so when considering batches of key lookups as often happens in AI applications.

We can also see that the 10 Point queries consume about a third of the Read Only queries in the MySQL server compared to the 4 scan queries.

RonSQL#

RonSQL supports pushdown aggregation of queries on a single table. The reason for developing this support is that AI inferencing is often built on features that constitute aggregates of many events. These aggregates are often calculated by offline systems that goes through historical data.

However real-time AI inferencing requires the features to be calculated on the current events as well as historical events. Thus RonDB supports storing event data in time windows, using RonSQL to calculate aggregates based on data in those tables and getting rid of this data when it is too old using the TTL feature.

To benchmark this we used a very simple benchmark consisting of 1 query in TPC-H. Q1 in iTPC-H does a scan on all rows in the lineitem table. There are about 6M rows per scale factor in this table. Using the MySQL server this query requires most of this table to be sent up to the MySQL server to calculate the aggregates there. This takes about 1 second with scale factor 1.

With RonSQL the calculation happens where the data resides and it happens on all partitions in parallel. This means that in a standard setup with 6 partitions we can handle this query 10x faster in 100 ms. By doubling the number of partitions we can improve this to 20x improvement and this query can execute in 50ms instead.