RonDB Benchmarks#
This chapter reports some benchmark numbers for RonDB. The numbers in this chapter is mostly simple benchmarks executed on one of our development machines. This means that they are very simple to reproduce, obviously they don’t show the scalability of RonDB. For that we will reference some larger benchmark efforts.
Benchmarks of RonDB focus on showing performance of the following data services provided by RonDB.
-
RonDB REST API
-
RonDB Feature Store REST API
-
Rondis
-
MySQL
-
RonSQL
RonDB REST API#
The RonDB REST API has two endpoints, the pkread endpoint and the batch endpoint.
The pkread issues a read using a primary key lookup on a table provided a number of columns to read.
To execute this benchmark one can simply start a cluster using MTR. The simplest manner is to install your binary tarball somewhere where you want to run it. Next go into the directory mysql-test and start a cluster. Before starting you likely want to change the contents in mysql-test/include/default_ndbd.cnf. You most likely want to set AutomaticMemoryConfig and AutomaticThreadConfig to 1, you should also set NumCPUs to the number of CPUs you want each of the two RonDB data nodes to use. Next you will want to remove the line about ThreadConfig. It is also a good idea to increase the following parameters, DataMemory, SharedGlobalMemory, RedoBuffer, FragmentLogFileSize and LongMessageBuffer. One should also add the parameter TotalMemoryConfig.
The reason you need to change so many parameters is that the default setting is to run functional test cases and these mostly do execute very small amounts of concurrent transactions and schema changes. A benchmark requires more memory resources and CPU resources. In my case I usually use the following settings (there is nothing scientific about these settings, they are simply good enough):
-
AutomaticMemoryConfig=1
-
AutomaticThreadConfig=1
-
NumCPUs=6
-
DataMemory=2500M
-
TotalMemoryConfig=5G
-
FragmentLogFileSize=1G
-
SharedGlobalMemory=200M
-
RedoBuffer=48M
-
LongMessageBuffer=80M
-
SpinMethod=StaticSpinning
In a default setup one would use SpinMethod=LatencyOptimisedSpinning
.
However using StaticSpinning
means that we can rely on top
to see
how much CPU the RonDB data nodes actually use. Spinning improves the
throughput in all cases, but it does show a higher CPU usage and
sometimes even much higher than the actual CPU used for real work.
These numbers assume you have at least 16 CPUs and 16 GB of memory in your machine. With more or less CPUs you can adjust NumCPUs appropriately and with more memory you can increase DataMemory and TotalMemoryConfig.
Starting the RonDB cluster uses the following command executed in the
mysql-test
directory in the binary distribution after updating the
configuration file as directed above.
This command will start the cluster The benchmarks for REST API are
executed by Go programs found in
storage/ndb/rest-server2/server/test_go
. The following command will
list the available test cases for the REST API.
Now this will also create two files in the var
directory,
rdrs.1.1.config.json
and rdrs.2.1.config.json
. Now we first need to
stop these two REST API servers to and reconfigure them. So first kill
them using kill -9
as shown below. Next edit the configuration and
place them in a path known to you. Finally move to the bin
directory
of the RonDB binary distribution and run the two REST API servers with
the proper configuration.
killall -9 rdrs2
./rdrs2 --config /path/rdrs.1.1.config.json
./rdrs2 --config /path/rdrs.2.1.config.json
Here is the configuration we used in the benchmarks below
(rdrs.1.1.config.json
):
{
"PIDFile": "/path/rdrs.1.1.pid",
"REST": {
"ServerIP": "0.0.0.0",
"ServerPort": 13005,
"NumThreads": 64
},
"RonDB": {
"Mgmds": [ { "IP": "localhost",
"Port": 13000 } ],
"ConnectionPoolSize": 2
},
"Security": {
"TLS": {
"EnableTLS": false,
"RequireAndVerifyClientCert": false,
},
"APIKey": {
"UseHopsworksAPIKeys": false
}
},
"Rondis": {
"Enable": true,
"ServerPort": 6379,
"NumThreads": 64,
"NumDatabases": 2,
"Databases":
[
{
"Index": 0,
"DirtyINCRDECR": true
},
{
"Index": 1,
"DirtyINCRDECR": true
}
]
},
"Log": {
"Level": "INFO"
},
"Testing": {
"MySQL": {
"Servers": [ { "IP": "localhost",
"Port": 13001 } ],
"User": "root",
"Password": ""
}
}
}
The second REST API server uses port 13006 and 6380 instead of 13005 and 6379.
Now before running the benchmarks you need to point to the configuration
file of the REST API server with the environment variable
RDRS_CONFIG_FILE
.
So executing the following command will benchmark the pkread endpoint.
The only benchmark on the pkread
endopint is the BenchmarkSimple
test case.
This will execute the 3 different benchmarks on the batch
endpoint.
These are BenchmarkSimple
, BenchmarkManyColumns
and
BenchmarkBinary
.
One can ensure only one of those to be executed using the following command.
The first you run one of those benchmarks, the script will fill the database with a set of tables used in those benchmarks. This will take some time and after that it will start running the benchmark and output benchmark results.
BenchmarkSimple
runs a simple retrieval in a table with a single
primary key column and reading a simple VARCHAR(100) column. Thus it
tests the maximum throughput of the REST API server.
BenchmarkManyColumns
runs a retrieval from a table with one primary
key column and 100 small columns storing some string. This benchmark
benchmarks the effect of handling many columns in the REST API server.
BenchmarkBinary
runs a retrieval from a table with a single primary
key column using a single large column that stores around 3 kBytes of
data. This benchmark shows how the REST API server handles large volumes
of binary data.
All benchmarks report the batch size, the number of threads, the throughput, the median response time and the 99% response time.
Since the benchmark is running on a single machine or VM, it is simple to check the CPU resources used by the different processes. For the most part the focus of the benchmarking is around the efficiency of the REST API server process.
RonDB REST API Results#
The results provided here was executed on a desktop with an Intel Raptor Lake CPU with 16 efficiency cores and 8 power cores. The memory size was 64 GB and the disks were fast enough to not affect the benchmarking.
The CPU is called 13th Gen Intel(R) Core(TM) i9-13900K.
No special attempt was used to lock processes or threads to specific CPUs. In a production setup this is something RonDB would normally do. Normally each RonDB process runs in its own virtual machine or in its pod where the pod mainly consists of the container running the RonDB process. Only some init container is there as well. In this setup it is natural to also lock threads to CPUs for RonDB data nodes. No such attempts are made for REST API servers and MySQL Servers.
BenchmarkSimple with pkread endpoint#
This benchmark scales only on the number of threads used in the Go
client. This can be changed by editing the file
storage/ndb/rest-server2/server/test_go/internal/integrationtests/pkread/performance_test.go
.
Find the GOMAXPROCS and change from 24 to whatever number of threads you
want to use.
Benchmark | Throughput | Median latency | 99% | Batch | Threads | CPU |
---|---|---|---|---|---|---|
BenchmarkSimple | 23.654/s | 40 us | 59 us | 1 | 1 | 38% |
BenchmarkSimple | 38.230/s | 49 us | 81 us | 1 | 2 | 70% |
BenchmarkSimple | 56.663/s | 66 us | 170 us | 1 | 4 | 140% |
BenchmarkSimple | 90.713/s | 84 us | 198 us | 1 | 8 | 290% |
BenchmarkSimple | 141.985/s | 105 us | 234 us | 1 | 16 | 495% |
BenchmarkSimple | 199.978/s | 144 us | 408 us | 1 | 32 | 715% |
BenchmarkSimple | 236.875/s | 238 us | 724 us | 1 | 64 | 790% |
If one runs the same benchmark in a distributed setup there will be more latency since the package has to be transferred over a network which usually adds around 50 microseconds per jump in a public cloud setting. So the numbers here shows how much latency is due to the actual execution of the REST calls.
The CPU usage increases faster than throughput since the NDB API provides very good latency when the number of requests are very low. At the same time the NDB API becomes more effective again when the load increases, thus at extreme loads there will be more throughput available.
BenchmarkSimple with batchpkread endpoint#
Also in this case it is possible to change the behaviour by editing the
Go client. This time the file to edit is
storage/ndb/rest-server2/server/test_go/internal/integrationtests/batchpkread/performance_test.go
.
Here though there are two parameters that can be changed, the same as
above, the number of client threads can be changed through setting
GOMAXPROCS
. Now we can also set the batch size by setting the variable
batchSize
.
We will start seeing how performance changes with a single thread and increasing the batch size.
Benchmark | Throughput | Median latency | 99% | Batch | Threads | CPU |
---|---|---|---|---|---|---|
BenchmarkSimple | 22.595/s | 42 us | 62 us | 1 | 1 | 35% |
BenchmarkSimple | 36.949/s | 52 us | 78 us | 2 | 1 | 40% |
BenchmarkSimple | 63.319/s | 60 us | 90 us | 4 | 1 | 44% |
BenchmarkSimple | 87.474/s | 86 us | 174 us | 8 | 1 | 48% |
BenchmarkSimple | 124.857/s | 121 us | 252 us | 16 | 1 | 50% |
BenchmarkSimple | 162.073/s | 187 us | 349 us | 32 | 1 | 52% |
BenchmarkSimple | 185.027/s | 327 us | 548 us | 64 | 1 | 55% |
BenchmarkSimple | 203.803/s | 596 us | 872 us | 128 | 1 | 56% |
BenchmarkSimple | 211.251/s | 1164 us | 1496 us | 256 | 1 | 59% |
As can be seen in those numbers we get almost similar with the batch size replaced by number of threads, both in terms of throughput and latency. However we can see from the CPU usage that sending batches consume 10x less CPU in the REST API server. However both the batch size and number of threads reaches a plateau around 100.
So with these results it is now interesting to move to testing with a combination of threads and batch sizes.
Benchmark | Throughput | Median latency | 99% | Batch | Threads | CPU |
---|---|---|---|---|---|---|
BenchmarkSimple | 137.791/s | 106 us | 241 us | 4 | 4 | 142% |
BenchmarkSimple | 319.844/s | 187 us | 410 us | 16 | 4 | 185% |
BenchmarkSimple | 551.096/s | 434 us | 747 us | 64 | 4 | 206% |
BenchmarkSimple | 319.658/s | 190 us | 397 us | 8 | 8 | 350% |
BenchmarkSimple | 691.409/s | 349 us | 647 us | 32 | 8 | 372% |
BenchmarkSimple | 1.028.243/s | 947 us | 1520 us | 128 | 8 | 405% |
BenchmarkSimple | 693.094/s | 337 us | 750 us | 16 | 16 | 542% |
BenchmarkSimple | 1.379.632/s | 678 us | 1193 us | 64 | 16 | 710% |
BenchmarkSimple | 1.688.128/s | 2220 us | 5071 us | 256 | 16 | 725% |
BenchmarkSimple | 1.277.227/s | 730 us | 1684 us | 32 | 32 | 800% |
BenchmarkSimple | 1.640.773/s | 730 us | 1684 us | 64 | 32 | 800% |
BenchmarkSimple | 2.015.032/s | 1917 us | 3893 us | 128 | 32 | 800% |
BenchmarkSimple | 2.146.081/s | 3587 us | 6099 us | 256 | 32 | 1060% |
BenchmarkSimple | 1.799.511/s | 1984 us | 5419 us | 64 | 64 | 985% |
BenchmarkSimple | 2.072.006/s | 2358 us | 4442 us | 128 | 40 | 800% |
What we see from the above numbers is that batching provides the most efficient way of achieving high performance. However to get to the full potential of REST API server it is necessary to use multiple threads in combination with batching.
RonDB is heavily used in AI applications such as Personalized Recommendation systems. In these systems it is very common to perform batches of hundreds of key lookups with requirement on very low latency. As can be seen here we can acheive latency well below one millisecond responding to hundreds of key lookups.
The application can also choose to split the batches in several requests and execute them in parallel on either different REST API servers or on different threads in the same REST API server. Thus it is possible to optimise on both latency and throughput using RonDB REST API servers.
The next question is what benefits there are from using multiple REST API servers. It is fairly easy to extend the above tests to run 2 benchmark clients using different REST API servers. They will still use the same RonDB data nodes, but will not interact other than competing for the CPU resources since those benchmarks are executed on a single desktop machine.
In this table we report numbers from 2 REST API servers, we add the CPU usage, so it is the total CPU usage and throughput which is provided.
Benchmark | Throughput | Median latency | 99% | Batch | Threads | CPU |
---|---|---|---|---|---|---|
BenchmarkSimple | 40.667/s | 45 us | 80 us | 1 | 1 | 70% |
BenchmarkSimple | 218.801/s | 132 us | 280 us | 16 | 1 | 97% |
BenchmarkSimple | 399.650/s | 602 us | 850 us | 128 | 1 | 114% |
BenchmarkSimple | 498.013/s | 250 us | 500 us | 16 | 4 | 378% |
BenchmarkSimple | 1.086.516/s | 349 us | 647 us | 128 | 4 | 416% |
BenchmarkSimple | 1.946.243/s | 975 us | 1880 us | 64 | 16 | 1025% |
BenchmarkSimple | 2.250.000/s | 2400 us | 4800 us | 128 | 24 | 1168% |
What we see from these numbers is that one benefits more from adding REST API servers rather than adding threads. At the same time there is a limit on the number of slots for REST API servers in a cluster, so it makes sense to scale on threads as well as on number of servers.
The general rule we use at Hopsworks is to allocate 16 CPUs per REST API server. If more is needed one adds more REST API servers. It is possible quite a few REST API servers in a cluster. One can also go down to 8 CPUs or even 4 CPUs and have good behaviour still.
BenchmarkManyColumns with batchpkread endpoint#
This benchmark reads 100 small columns instead of a single column. This puts a lot more load on the REST API server since it has to parse a lot more in the request and also the RonDB data node will require more processing to fetch 100 columns instead of 1.
Benchmark | Throughput | Median latency | 99% | Batch | Threads | CPU |
---|---|---|---|---|---|---|
BenchmarkManyColumns | 7.899/s | 120 us | 299 us | 1 | 1 | 41% |
BenchmarkManyColumns | 16.098/s | 472 us | 804 us | 8 | 1 | 41% |
BenchmarkManyColumns | 51.137/s | 597 us | 1053 us | 8 | 4 | 153% |
BenchmarkManyColumns | 121.893/s | 1006 us | 1665 us | 8 | 16 | 516% |
BenchmarkManyColumns | 150.031/s | 1652 us | 3191 us | 8 | 32 | 695% |
One problem with this benchmark is that Go client consumes almost of the CPU resources. Thus it is hard to scale the benchmarks any more than shown above. In the last line the RonDB data nodes used only 210% CPU resources whereas the Go client used more than 1400%.
BenchmarkBinary with batchpkread endpoint#
This benchmark is also quite heavy on the Go client, so again it is a bit harder to fully test the REST API server on a single machine.
Benchmark | Throughput | Median latency | 99% | Batch | Threads | CPU |
---|---|---|---|---|---|---|
BenchmarkBinary | 12.280/s | 64 us | 160 us | 1 | 1 | 26% |
BenchmarkBinary | 24.008/s | 225 us | 530 us | 8 | 1 | 26% |
BenchmarkBinary | 74.830/s | 294 us | 563 us | 8 | 4 | 109% |
BenchmarkBinary | 111.947/s | 380 us | 748 us | 8 | 8 | 208% |
BenchmarkBinary | 170.986/s | 497 us | 982 us | 8 | 16 | 366% |
BenchmarkBinary | 201.251/s | 883 us | 2064 us | 8 | 32 | 482% |
BenchmarkBinary | 229.624/s | 1177 us | 2475 us | 12 | 32 | 495% |
The benchmark client seem to not scale well with larger batch sizes, but the picture looks as expected that the REST API server requires more time to handle many columns compared to handling large columns. However the CPU usage is still very light and one can easily scale to millions of request with both a large number of columns and/or large columns.
Rondis#
All the setup for the REST API is also prepared to run Rondis tests using the same REST API binary.
Rondis results#
The Rondis benchmarks will be executed using a simple tool called
redis-benchmark
.
Result from SET operation#
Benchmark | Throughput | Median latency | 99% | Batch | Threads | CPU |
---|---|---|---|---|---|---|
SET | 8.356/s | 119 us | 175 us | 1 | 1 | 15% |
SET | 13.670/s | 143 us | 199 us | 1 | 2 | 35% |
SET | 24.456/s | 159 us | 327 us | 1 | 4 | 68% |
SET | 47.994/s | 159 us | 271 us | 1 | 8 | 127% |
SET | 87.612/s | 175 us | 279 us | 1 | 16 | 196% |
SET | 104.196/s | 239 us | 391 us | 1 | 32 | 235% |
SET | 108.295/s | 399 us | 575 us | 1 | 64 | 256% |
In this benchmark we are limited by the RonDB data nodes, but also by
the fact that the SET
command only issues one write operation at a
time.
Result from MSET operation#
To make this benchmark we did some minor changes to valkey-benchmark
to make the batch size in MSET
test flexible. So as with REST API
server we will start to see the impact of increasing batch size.
We will report the numbers in number of rows per second changed rather than the number of commands finished as reported by the benchmark tool.
Benchmark | Throughput | Median latency | 99% | Batch | Threads | CPU |
---|---|---|---|---|---|---|
MSET | 8.313/s | 119 us | 175 us | 1 | 1 | 16% |
MSET | 14.279/s | 135 us | 199 us | 2 | 1 | 20% |
MSET | 24.808/s | 159 us | 207 us | 4 | 1 | 28% |
MSET | 45.661/s | 175 us | 223 us | 8 | 1 | 34% |
MSET | 80.882/s | 199 us | 255 us | 16 | 1 | 40% |
MSET | 139.300/s | 231 us | 303 us | 32 | 1 | 45% |
MSET | 212.371/s | 303 us | 415 us | 64 | 1 | 55% |
MSET | 307.028/s | 415 us | 551 us | 128 | 1 | 61% |
MSET | 356.050/s | 695 us | 1031 us | 256 | 1 | 73% |
From these numbers we can conclude that it takes very little effort for
Rondis using MSET
with large batch sizes to effectively run enough
requests to keep the RonDB data node busy. Latency numbers are also very
good and very stable, there is fairly small difference between median
latency and the 99% latency.
So let’s see what mixing batching and threading gives in the Rondis case.
Benchmark | Throughput | Median latency | 99% | Batch | Threads | CPU |
---|---|---|---|---|---|---|
MSET | 88.574/s | 170 us | 359 us | 4 | 4 | 90% |
MSET | 260.036/s | 231 us | 487 us | 16 | 4 | 112% |
MSET | 571.327/s | 423 us | 767 us | 64 | 4 | 144% |
MSET | 782.205/s | 639 us | 967 us | 64 | 8 | 205% |
MSET | 837.148/s | 1199 us | 1567 us | 128 | 8 | 191% |
MSET | 927.760/s | 231 us | 303 us | 128 | 16 | 247% |
MSET | 883.369/s | 303 us | 415 us | 256 | 16 | 285% |
MSET | 970.638/s | 2439 us | 3615 us | 100 | 24 | 285% |
MSET | 1.005.025/s | 2823 us | 4223 us | 90 | 32 | 338% |
So even with write operations it is possible to get much better throughput and improved latency using batching, even pushing it past 1M key writes per second with a bit higher latency. As we previously batching can improve throughput at similar latency by a factor of 10. This factor was interestingly the same 25 years ago. So over time with so many HW developments the benefit of asynchronous programming over synchronous programming stays at a factor of 10x.
Result from GET operation#
We will look at the GET operation that is limited in performance since it can only handle one operation at a time. Also as can be seen from numbers the performance increases stepwise as number of threads increase.
Benchmark | Throughput | Median latency | 99% | Batch | Threads | CPU |
---|---|---|---|---|---|---|
GET | 32.957/s | 31 us | 39 us | 1 | 1 | 40% |
GET | 50.691/s | 39 us | 55 us | 1 | 2 | 79% |
GET | 83.787/s | 47 us | 95 us | 1 | 4 | 151% |
GET | 119.653/s | 63 us | 103 us | 1 | 8 | 247% |
GET | 117.589/s | 103 us | 159 us | 1 | 16 | 270% |
GET | 162.588/s | 119 us | 207 us | 1 | 32 | 356% |
GET | 153.941/s | 207 us | 375 us | 1 | 64 | 340% |
Result from MGET operation#
Now we turn to MGET
operations. As usual we start seeing the impact of
increasing batch size.
Benchmark | Throughput | Median latency | 99% | Batch | Threads | CPU |
---|---|---|---|---|---|---|
MGET | 34.249/s | 31 us | 39 us | 1 | 1 | 40% |
MGET | 54.607/s | 39 us | 55 us | 2 | 1 | 52% |
MGET | 121.902/s | 31 us | 47 us | 4 | 1 | 45% |
MGET | 180.438/s | 47 us | 63 us | 8 | 1 | 58% |
MGET | 362.154/s | 47 us | 63 us | 16 | 1 | 54% |
MGET | 557.055/s | 55 us | 79 us | 32 | 1 | 61% |
MGET | 728.307/s | 87 us | 119 us | 64 | 1 | 67% |
MGET | 827.194/s | 143 us | 199 us | 128 | 1 | 79% |
MGET | 870.185/s | 271 us | 375 us | 256 | 1 | 88% |
We can see that batching is significantly more efficient compared to threading.
Next we look at the combination of batching and threading.
Benchmark | Throughput | Median latency | 99% | Batch | Threads | CPU |
---|---|---|---|---|---|---|
MGET | 1.703.713/s | 143 us | 287 us | 64 | 4 | 270% |
MGET | 3.124.480/s | 143 us | 271 us | 64 | 8 | 485% |
MGET | 4.524.375/s | 303 us | 511 us | 100 | 16 | 700% |
MGET | 4.825.091/s | 407 us | 655 us | 100 | 24 | 745% |
MGET | 4.880.429/s | 511 us | 807 us | 100 | 32 | 780% |
MGET | 5.086.117/s | 559 us | 879 us | 110 | 32 | 815% |
As usual we see a clear increase when threading and batching is combined. Obviously the key lookups are very small here, so this is a very synthetic benchmark, but 5M lookups per second on a single machine is still fairly impressive.
MySQL Results#
RonDB has a number of different interfaces. One of the prominent ones is the MySQL Server that provides an SQL interface towards RonDB. As we have seen in the benchmarks on the REST API and Rondis we have very good results when RonDB can combine threading and batching.
It is a lot more difficult to achieve batching in the SQL interface. Actually the NDB team spent 10 years figuring out ways to achieve the batching required to get good performance using SQL.
However SQL is more useful for queries that are more complex in nature. RonDB has its greatest performance strength in batching key lookups. However RonDB still has amazingly low latency even through the SQL interface. It has also very decent ability to send very large data sets over the network. Actually the Sysbench OLTP RW sends 1 GByte of data from RonDB data node to the MySQL Server in the results reported below.
Another benefit of RonDB is that scales very well, thus one can simply add more MySQL Servers to achieve higher throughput. In this blog post we discuss the scalability of RonDB using up to 12 MySQL Servers. Similarly in this blog post we discuss how RonDB behaves when running Sysbench.
There is even a a blog post describing how one can do a bit of batching using SQL in Sysbench to achieve 5.5M key lookups per second.
There is a post on the RonDB website providing a detailed benchmark
study using YCSB
(Yahoo Cloud Serving Benchmark). This
post
is unique in the sense that it compares a number of different benchmarks
with normal operation and compares it to how it behaves when a node
fails and how the node recovery affects the throughput and latency. In
addition it compares the behaviour of in-memory tables with disk tables.
In order to understand Sysbench OLTP benchmarks one needs to understand the what queries that Sysbench uses. A traditional Sysbench test is executing 20 SQL queries as a single transaction. There are 10 Point queries performing single key lookup queries, next there are 4 write queries (1 Inser, 2 Update and 1 Delete), next there are 4 scan queries that return 100 rows within a range, finally there is a BEGIN and a COMMIT query.
Thus Sysbench is putting the transaction system into action and not only handling single row interactions. However most of the CPU usage in Sysbench is spent on handling the 4 scan queries. These 4 queries return 100 rows each to the MySQL Server and this means around 50-60 kByte of data sent on the network. It also means that the scanning needs to be efficient in quickly moving from row to row. RonDB is slightly faster than InnoDB in scanning rows. Thus RonDB behaves very well also in Sysbench tests.
Sysbench OLTP RW Results#
Executing a Sysbench test is very simple. The following command will execute a Sysbench on a single machine using the MTR framework. This test will run through the test using a number of different number of threads.
In our case we made it a little bit more complex by using CPU binding. In our case we used the following settings (CPU 16-31 is the power efficient ones and 0-15 is the performance CPUs).
[cluster_config.ndbd.1.crunch]
cpubind=16-20
[cluster_config.ndbd.2.crunch]
cpubind=21-24
[mysqld.1.crunch]
cpubind=0-7
[mysqld.2.crunch]
cpubind=8-15
[mysqltest]
cpubind=26-31
To run with this one uses the following command from the mysql-test
in
the binary distribution.
We also made a minor change in the my.cnf
file by adding a more
aggressive SpinMethod
and removed NumCPUs
since locking to CPUs will
make the number of CPUs and which CPUs to use automated.
Ok so on to results (TPS = Transactions per second):
Benchmark | TPS | Average latency | 95% Threads | CPU | ||
---|---|---|---|---|---|---|
OLTP RW | 833/s | 1210 us | 1300 us | 1 | 41% | |
OLTP RW | 1.300/s | 1540 us | 2030 us | 2 | 100% | |
OLTP RW | 2.434/s | 1640 us | 2220 us | 4 | 190% | |
OLTP RW | 4.142/s | 1930 us | 2570 us | 8 | 350% | |
OLTP RW | 6.605/s | 2420 us | 3070 us | 16 | 620% | |
OLTP RW | 9678/s | 3330 us | 4100 us | 32 | 920% | |
OLTP RW | 12.267/s | 5210 us | 6670 us | 64 | 1190% | |
OLTP RW | 13.041/s | 9810 us | 14460 us | 128 | 1370% | |
OLTP RW | 11.995/s | 21300 us | 41850 us | 256 | 1380% |
So from these numbers one case that the latency is very low, less than 65 microseconds per SQL query, and remember that some queries handles 100 rows and some of them are write queries. So latency is very low.
One can also see that the MySQL Server uses a fair amount of CPU, so for this type of applications one needs a considerable amount of CPUs on the MySQL Server and on the application side.
Sysbench OLTP Write Only Results#
In this benchmark we only kept the write queries and BEGIN and COMMIT. One can see that the base latency of write queries is below 100 microseconds even when performing a distributed two-phase commit transaction.
Benchmark | TPS | Average latency | 95% Threads | CPU | ||
---|---|---|---|---|---|---|
OLTP W | 2.579/s | 390 us | 430 us | 1 | 20% | |
OLTP W | 4.304/s | 460 us | 570 us | 2 | 58% | |
OLTP W | 7.642/s | 520 us | 680 us | 4 | 105% | |
OLTP W | 12.717/s | 630 us | 840 us | 8 | 200% | |
OLTP W | 20.753/s | 770 us | 1040 us | 16 | 350% | |
OLTP W | 33.372/s | 960 us | 1270 us | 32 | 543% | |
OLTP W | 46.613/s | 1370 us | 2030 us | 64 | 725% | |
OLTP W | 52.691/s | 2430 us | 4820 us | 128 | 840% | |
OLTP W | 55.073/s | 4640 us | 11240 us | 256 | 960% |
From the numbers we can see that in this test the RonDB data node became the bottleneck, so one could easily increase the number of CPUs used here to get higher numbers.
Sysbench OLTP Read Only Results#
In this we remove the write queries and execute only the read queries. We note that the majority of the latency comes from these queries and we also note that they use considerably more CPU in the MySQL servers. This is mainly due to the scan queries. The 10 Point queries make up very little part of this load.
Benchmark | TPS | Average latency | 95% Threads | CPU | ||
---|---|---|---|---|---|---|
OLTP R | 1.251/s | 800 us | 840 us | 1 | 51% | |
OLTP R | 1.882/s | 1060 us | 1470 us | 2 | 118% | |
OLTP R | 3.540/s | 1130 us | 1610 us | 4 | 230% | |
OLTP R | 6.110/s | 1310 us | 1760 us | 8 | 430% | |
OLTP R | 9.848/s | 1620 us | 2030 us | 16 | 735% | |
OLTP R | 14.429/s | 2220 us | 2710 us | 32 | 1120% | |
OLTP R | 16.948/s | 3770 us | 4910 us | 64 | 1350% | |
OLTP R | 17.181/s | 7450 us | 11040 us | 128 | 1450% | |
OLTP R | 15.639/s | 16360 us | 31370 us | 256 | 1500% |
Sysbench OLTP Point Select Results#
This benchmark only runs single key lookup using SQL. Thus we can compare the cost of issuing those through MySQL with those using RonDB REST API server.
Benchmark | TPS | Average latency | 95% Threads | CPU | ||
---|---|---|---|---|---|---|
OLTP P | 25.568/s | 40 us | 40 us | 1 | 43% | |
OLTP P | 40.913/s | 50 us | 70 us | 2 | 115% | |
OLTP P | 76.666/s | 50 us | 70 us | 4 | 220% | |
OLTP P | 137.118/s | 60 us | 90 us | 8 | 420% | |
OLTP P | 238.947/s | 70 us | 100 us | 16 | 740% | |
OLTP P | 397.695/s | 80 us | 120 us | 32 | 1160% | |
OLTP P | 505.836/s | 130 us | 230 us | 64 | 1360% | |
OLTP P | 565.536/s | 230 us | 440 us | 128 | 1530% | |
OLTP P | 563.731/s | 450 us | 840 us | 256 | 1730% |
As the numbers the MySQL Server scales very nicely all the way up 400k Point queries. After that the MySQL server becomes overloaded and latency grows.
Compared to the REST API server the MySQL server consumes a bit more CPU, the latency is similar. So for key lookups the REST API server is preferrable and even more so when considering batches of key lookups as often happens in AI applications.
We can also see that the 10 Point queries consume about a third of the Read Only queries in the MySQL server compared to the 4 scan queries.
RonSQL#
RonSQL supports pushdown aggregation of queries on a single table. The reason for developing this support is that AI inferencing is often built on features that constitute aggregates of many events. These aggregates are often calculated by offline systems that goes through historical data.
However real-time AI inferencing requires the features to be calculated on the current events as well as historical events. Thus RonDB supports storing event data in time windows, using RonSQL to calculate aggregates based on data in those tables and getting rid of this data when it is too old using the TTL feature.
To benchmark this we used a very simple benchmark consisting of 1 query
in TPC-H
. Q1 in iTPC-H
does a scan on all rows in the lineitem
table. There are about 6M rows per scale factor in this table. Using the
MySQL server this query requires most of this table to be sent up to the
MySQL server to calculate the aggregates there. This takes about 1
second with scale factor 1.
With RonSQL the calculation happens where the data resides and it happens on all partitions in parallel. This means that in a standard setup with 6 partitions we can handle this query 10x faster in 100 ms. By doubling the number of partitions we can improve this to 20x improvement and this query can execute in 50ms instead.