Skip to content

When to use RonDB#

What is special with RonDB#

RonDB is a key-value store with SQL capabilities. It is based on NDB Cluster, also known as MySQL Cluster.

It is a DBMS (DataBase Management System) that is designed for the most demanding mission-critical applications in the telecom network, in internet infrastructure applications, in financial applications, in storage infrastructure applications, in web applications, in mobile apps and many other applications such as gaming, train control, vehicle control and a lot more you probably can come up with better than me.

It is probably the DBMS and key value database with the highest availability statistics surpassing most competitors by more than a magnitude higher availability.

When a transaction in RonDB has completed, all replicas have been updated and you will always be able to see your own updates. This is an important feature that makes application development a lot easier compared to when using eventual consistency. It is also a requirement to have constant uptime even in the presence of failures.

The requirements from the telecom network was that complex transactions have to be completed in ten milliseconds. RonDB supports this requirement and this makes RonDB useful also in financial applications and many other application categories that require bounded latency on the queries.

It is designed with the following features in mind:

  1. Class 6 Availability (less than 30 seconds downtime per year)

  2. Data consistency in large clusters

  3. High Write Scalability and Read Scalability

  4. Predictable response time

  5. Available with MySQL interface, LDAP interface, file system interfaces

  6. Available with APIs from all modern languages

To reach Class 6 RonDB supports online software changes, online schema changes, online add node, global solutions with highly available fail-over cases. It is designed with two levels of replication where the first level is local replication to protect against HW and SW failures. The second level is a global replication level that protects against outages due to conflicts, power issues, earthquakes and so forth. The global replication can also be used to perform more complex changes compared to the local replication level.

In both the local replication and in the global replication level NDB is designed to support multi-primary solutions. In the local replication this is completely transparent to the user.

NDB was developed in response to the development of Network Databases in the telecom world in the 1980s and 1990s. It was originally developed at Ericsson where I spent 13 years learning telecom and databases and developing real-time systems for switches and databases. It has been a part of MySQL since 2003 and it has been in production at many different mission-critical applications across the world since then.

At the moment there are at least several tens of thousands of clusters running in production and probably a lot more. NDB has proven itself in all the above listed areas and have added a few more unique selling points over time such as a parallel query feature and good read scalability as well as scalable writes that has been there from day one.

Experience have shown that NDB meet Class 6 availability (less than 30 seconds of downtime per year) and for extended periods even Class 8 availability (less than 0.3 seconds of downtime per year).

RonDB is based on NDB and moves the design focus for usage in cloud applications. RonDB is designed for automated management, the user only specifies the amount of replication he wants, the size of the HW he wants to use. The RonDB managed version ensures that the user always have access to this through the MySQL APIs and the native RonDB APIs.

To get a quick idea if RonDB is something for your application I will list the unique selling points (USPs) of RonDB.

AlwaysOn for reads and writes#

I use the term AlwaysOn here to mean a DBMS that is essentially never down. RonDB makes it possible to solve most online changes with its set of features.

This includes the following points:

  1. Can survive multiple node crashes in one cluster

  2. Can handle Add Column while writes happen

  3. Can Add/Drop indexes while writes happen

  4. Can Add/Drop foreign keys while writes happen

  5. Can Add new shards while writing

  6. Can reorganise data to use new shards while writing

  7. Can perform software upgrade while writing (multiple version steps)

  8. Automated node failure detection and handling

  9. Automated recovery after node failure

  10. Transactional node failures => Can survive multiple node failures per node group while writes happens

  11. Schema changes are transactional

  12. Support global failover for the most demanding changes

  13. Support global online switch over between clusters in different geographies

One of the base requirement of RonDB is to always support both reads and writes. The only acceptable downtime is for a short time when a node fails. It can take up to a few seconds to discover that the node has failed (the time is dependent on the responsiveness of the operating system used). As soon as the failure have been discovered, data is immediately available for reads and writes, the reconfiguration time is measured in microseconds.

There are many other solutions that build on a federation of databases. This means stand-alone DBMSs that replicate to each other at commit time. Given that they are built as stand-alone DBMSs they are not designed to communicate with other systems until it is time to commit the transactions. Thus with large transactions the whole transaction has to applied on the backup replicas before commit if immediate failover has to happen. This technique would in turn stop all other commits since the current replication techniques uses some form of token that is passed around and thus large transactions would block the entire system in that case.

Thus these systems can never provide synchronous replication AND at the same time providing this immediate failover and predictable response time independent of transaction sizes. RonDB can deliver this since it is designed as a distributed DBMS where all replicas are involved before committing the transaction. Thus large transactions will block all rows they touch, but all other rows are available for other transactions to concurrently read and write.

RonDB is designed from the start to handle as many failures as possible. It is possible e.g. to start with 4 replicas and see one failure at a time and end up with only 1 replica alive and we can still continue to both read and write the database.

Many management operations are possible to perform while the system continues to both read and write. This is a unique feature where RonDB is at the forefront. It is possible to add/drop indexes, add columns, add/drop foreign keys, upgrade software, add new data nodes and reorganise data to use those new nodes. All failure handling is automated, both failure detection, failure handling and recovery of the nodes involved.

Schema changes are transactional, such that if they fail they can be rolled back. Schema transactions and user transactions cannot be combined in one transaction.

Global solution for replication#

This includes the following points:

  1. Synchronous replication inside one cluster

  2. Asynchronous replication between clusters

  3. Conflict detection when using multiple master clusters

  4. Replication architecture designed for real-world physics

RonDB has a unique feature in that it supports multiple levels of replication. The base replication is the replication inside the cluster. This replication is synchronous and as soon as you have updated an object you will see the update in all other parts of the cluster.

The next level of replication is asynchronous replication between clusters. This replication takes a set of transactions for the last 10 millisecond period or so and replicates it to the other cluster. Thus the other cluster will see the updates from the updated cluster with a small delay.

These replication modes are independent of each other. It is possible to replicate to/from InnoDB using this approach as well.

The asynchronous replication supports multiple primary clusters. This requires conflict detection handling and NDB provides APIs to handle conflicts when conflicting updates occur. Several different conflict detection mechanisms are supported. It is possible to create complex replication architecture with things such as circular replication as well.

The replication architecture is designed to handle real-world physics. The default synchronous replication is designed for communication inside a data center where latency is less than 100 microseconds to communicate between nodes and communication paths are wide (nowadays often 10G Ethernet and beyond).

With the development of public clouds we have a new level which is availability zones, these normally communicate between each other in less than a millisecond and down to 400 microseconds. Local replication can be used between availability zones.

When communicating between data centers in different regions the latency is normally at least 10 milliseconds and can reach 100 milliseconds if they are very far apart. In this case we provide the asynchronous replication option.

This means that wherever you placed your data in the world, there is a replication solution for that inside RonDB.

The global replication solution enables continued operation in the presence of earthquakes and other major disturbances. It makes it possible to perform the most demanding changes with small disturbances. There is methods to ensure that switching over applications from one cluster to another can be performed in a completely online fashion using transactional update anywhere logic.

Thus if users want to follow the trends and move their data in RonDB from on-premise to the cloud, this can be made without any downtime at all.

Hardened High Availability#

NDB was originally designed for latency in the order of a few milliseconds. It was designed for Class 5 availability (less than 5 minutes of downtime per year). It turns out that we have achieved Class 6 availability in reality (less than 30 seconds of downtime per year).

NDB was first put into production in a high availability environment in 2004 using version 3.4 of the product. This user is still operating this cluster and now using a 7.x version.

Thousands and thousands of clusters have since been put into production usage.

With modern hardware we are now able to deliver response time on the order of 100 microseconds and even faster using specialised communication HW. At the same time the applications creates larger transactions. Thus we still maintain latency of transactions on the order of a few milliseconds.

Consistency of data in large clusters#

This includes the following points:

  1. Fully synchronous transactions with non-blocking two-phase-commit protocol

  2. Data immediately seen after commit from any node in the cluster

  3. Can always read your own updates

  4. Cross-shard transactions and queries

In the MySQL world as well as in the NoSQL world there is a great debate about how to replicate for highest availability. Our design uses fully synchronous transactions.

Most designs have moved towards complex replication protocols since the simple two-phase commit protocol is a blocking protocol. Instead of moving to a complex replication protocol we solved the blocking part. Our two-phase commit protocol is non-blocking since we can rebuild the transaction state after a crash in a new node. Thus independent of how many crashes we experience we will always be able to find a new node to take over the transactions from the failed node (as long as the cluster is still operational, a failed cluster will always require a recovery action, independent of replication protocol).

During this node failure takeover the rows that were involved in the transactions that lost their transaction coordinator remains locked. The remainder of the rows are unaffected, they can immediately be used in new transactions from any alive transaction coordinator. The locked rows will remain locked until we have rebuilt the transaction states and decided the outcome of the transactions that lost their transaction coordinator.

When a transaction have completed, RonDB have replicated not only the logs of the changes. RonDB have also updated the data in each replica. Thus independent of which replica is read, it will always see the latest changes.

Thus we can handle cross-shard transactions and queries, it means that we make the data available for reads immediately after committing the data.

The requirement to always be able to read your own updates we solve either by always sending reads to the primary replica or through a setting on the table to use the read backup feature in which case the commit acknowledged is delayed shortly to ensure that we can immediately read the backup replicas and see our own update.

Cluster within one data center#

image

RonDB was originally designed for clusters that resided within one data center. The latency to send messages within one data center can be anywhere between a few microseconds to below one hundred microseconds for very large data centers. This means that we can complete a transaction within less than a millisecond and that complex transactions with tens of changes can be completed within ten milliseconds.

Using low latency HW the latency in this case can be brought down even further. Dolphin ICS in Norway is a company that have specialised in low latency interconnect technology. Using their SuperSocket drivers it is possible to bring down latency in sending a message from one node to another to less than one microsecond. Actually the first communication technology that worked with NDB was based on Dolphin HW already in the 1990s.

Using SuperSocket HW is equivalent in speed to using RDMA protocols. NDB have supported specialised HW interconnects in the past from OSE Delta and from Dolphin and there has been experimental work on Infiniband transporters. But using SuperSocket driver technology removed the need for specialised transporter technology.

It is possible to use Infiniband technology with RonDB using IPoIB (IP over Infiniband). This has great bandwidth, but not any other significant advantages compared to using Ethernet technologies.

Cluster within one region#

In a cloud provider it is customary to bring down entire data centers from time to time. To build a highly available solution in a cloud often requires the use of several data centers in the same region. Most cloud providers have three data centers per region. Most cloud providers promise latency of 1-2 milliseconds between data centers whereas the Oracle cloud provides latency below half a millisecond. Inside a cloud data center (within an availability zone) the latency is below 100 microseconds for all cloud vendors.

This setup works fine with RonDB and will be covered in more detail in the chapter on RonDB in the cloud. The main difference is that the latency to communicate is a magnitude higher in this case. Thus not all applications will be a fit for this setup.

Several clusters in different regions#

image

For global replication we use an asynchronous replication protocol that is able to handle many clusters doing updates at the same time and providing conflict detection protocols (more on that below). Thus we use two levels of replication for the highest availability. RonDB is safe for earthquakes, data center failures and the most complex software and hardware upgrades.

RonDB is unique in bringing both the best possible behaviour in a local cluster while at the same time providing support of global replication.

The figure below shows a complex setup that uses two clusters in different regions and each cluster resides in several availability domains (ADs) within one region. RonDB Replication is used to ensure that a RonDB setup with global replication will survive in the presence of an entire region being down.

We will cover this setup in greater detail in the part on Global Replication.

High read and write Scalability#

RonDB 21.04 can scale to 64 data nodes and 24 MySQL Servers and 12 API nodes (each MySQL Server and API node can use 4 cluster connections).

NDB have shown in benchmarks that it can scale up to 20 million write transactions per second and more than 200 million reads per second already in 2015.

To achieve the optimal scaling the following points are important:

  1. Partition pruned index scans (optimised routing and pruning of queries)

  2. Distribution-aware transaction routing (semi-automatic)

  3. Partitioning schemes

RonDB is designed to scale both for reads and writes. RonDB implements sharding transparent to the user. It can include up to 32 node groups where data is stored. It is also possible to fully replicate tables within all node groups. The data nodes can be accessed from up to 24 MySQL Servers that all see the same view of the data.

This makes it possible to issue tens of millions of SQL queries per second against the same data set using local access thus providing perfect read scalability.

To provide optimal scaling in a distributed system it is still important to provide hints and write code taking account of the partitioning scheme used.

Costs of sending have been and continuos to be one of the main costs in a distributed system. In order to optimise the sending it is a good idea to start the transaction at the node where the data resides. We call this Distribution-aware transaction routing.

When using the NDB API it is possible to explicitly state which node to start the transaction at. For the most part it is easier for the application to state that the transaction should start at the node where the primary replica of a certain record resides.

When using SQL the default mechanism is to place the transaction at the node that the first query is using. If the first query is

mysql> SELECT * from table where pk = 1;

then the transaction will be started at the node where the record with primary key column pk equal to 1 is stored. This can be found by using the hash on the primary key (or distribution key if only a part of the primary key is used to distribute the records).

The next problem in a shared nothing database is that the number of partitions grows with the number of nodes and in RonDB also with the number of threads used. When scanning the table using some ordered index it is necessary to scan all partitions unless the partition key is fully provided.Thus the cost of scanning using ordered index grows with a growing cluster.

To counter this it is important to use partition pruned index scans as much as possible. A partition pruned index scan is an index scan where the full partition key is provided as part of the index scan. Since the partition key is provided we can limit the search to the partition that the rows with this partition key are stored.

In TPC-C for example it is a good idea to use warehouse id as the distribution key for all tables. This means that any query that accesses only one warehouse only need to be routed to one partition. Thus there is no negative scaling effect.

If it isn't possible to use partition pruned index scans there is one more method to decrease cost of ordered index scans. By default tables are distributed over all nodes and all ldm threads (ldm stands for Local Database Manager and contains the record storage, hash indexes and ordered indexes and the local recovery logic). It is possible to specify when creating the table to use less partitions in the table. The minimum is to have one partition per node group by using standard partitioning schemes. It is possible to specify the exact number of partitions, but when specifying an exact number of partitions the data will not be evenly spread over the nodes, this requires more understanding of the system and its behaviour. Therefore we have designed options that use less partitions in a balanced manner.

In RonDB we have added a new thread type called query threads. This makes it possible to read a partition from multiple threads concurrently. This makes it possible to increase scalability without having to create more partitions. In RonDB a table gets 6 partitions per table per node group by default.

By using these schemes in an efficient manner it is possible to develop scalable applications that can do many millions of complex transactions per second. A good example of an application that have been successful in this is HopsFS. They have developed the meta data layer of Hadoop HDFS using RonDB.

They use the id of the parent directory as distribution key which gives good use of partitioned pruned index scans for almost all file operations. They have shown that this scaled to at least 12 data nodes and 60 application nodes (they ran out of lab resources to show any bigger clusters).

A presentation of these results in a research paper at the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing won the IEEE Scale Challenge Award in 2017.

Predictable response time#

This includes the following points:

  1. Main memory storage for predictable response time

  2. Durable by committing on multiple computers in memory (Network Durable)

  3. CPU cache optimised distributed hash algorithm

  4. CPU cache optimised T-tree index (ordered index)

  5. Application defined partitioning (primary key partitioning by default)

  6. Batched key lookup access

  7. Several levels of piggybacking in network protocols

  8. Maintaining predictable response time while improving throughput

Piggybacking happens in several levels, one level is that we pack many commit messages and some other messages together in special PACKED_SIGNAL messages. The other is that we use one socket for communication between nodes. Given that our internal architecture uses an asynchronous programming model means that we automatically can batch a lot of messages together in one TCP/IP message. We execute a set of messages and collect the responses from these messages in internal buffers until we're done executing and only then do we send the messages. Finally we execute in many threads in parallel such that when sending we can collect messages from many threads.

The telecom applications, financial applications and gaming applications have strict requirements on the time one is allowed to take before one responds to the set of queries in a transaction. We have demanding users that require complex transactions with tens of interactions with hundreds of rows to complete within a few milliseconds in a system that is loaded to 90% of its capacity.

This is an area where one would probably not expect to find an open source DBMS such as MySQL at the forefront. But in this application area RonDB comes out at the top. The vendors selecting NDB Cluster have often tough requirements, up to more than millions of transactions per second, Class 6 availability and at the same time providing 24x7 services and being available to work with users when they have issues.

The data structures used in NDB have been engineered with a lot of thought on how to make them use CPU caches in the best possible manner. As a testament to this we have shown that the threads handling the indexes and the data have an IPC of 1.27 (IPC = Instruction Per Cycle). Normal DBMS usually report an IPC of around 0.25.

Base platform for data storage#

This includes the following points:

  1. MySQL storage engine for SQL access

  2. OpenLDAP backend for LDAP access

  3. HopsFS metadata backend for Hadoop HDFS (100s of PByte data storage)

  4. Prototypes of distributed disk block devices

  5. Integrated into Oracle OpenStack

RonDB can be used as a base platform for many different types of data. A popular use is SQL access using MySQL. Another popular approach is to use NDB as the backend in an LDAP server. OpenLDAP have a backend storage engine that can use RonDB. Some users have developed their own proprietary LDAP servers based on top of NDB (e.g. Juniper).

SICS, a research lab connected to KTH in Stockholm, Sweden, have developed HopsFS, this is a metadata backend to Hadoop HDFS for storing many, many petabytes of data into HDFS and enabling HDFS to handle many millions of file operations per second. HopsFS uses disk data in NDB to store small files that are not efficient to store in HDFS itself.

Based on work in this research lab a new startup, Logical Clocks AB, have been created that focus on developed a Feature Store for Machine Learning applications where RonDB is both used for the online Feature Store as well as that the offline Feature Store is built on top of HopsFS.

The development of RonDB is done by Logical Clocks AB.

Prototypes have been built successfully where NDB was used to store disk blocks to implement a distributed block device.

The list of use cases where it is possible to build advanced data services on top of RonDB is long and will hopefully grow longer over the years in the future. One of the main reasons for this is that RonDB separates the data server functionality from the query server functionality.

Multi-dimensional scaling#

This includes the following points:

  1. Scaling inside a thread

  2. Scaling with many threads

  3. Scaling with many nodes

  4. Scaling with many MySQL Servers

  5. Scaling with many clusters

  6. Built-in Load balancing in NDB API (even in presence of node failures)

RonDB have always been built for numerous levels of scaling. From the beginning it scaled to a number of different nodes. The user can always scale by using many clusters (some of our users use this approach). Since MySQL Cluster 7.0 there is a multithreaded data node, each successive new version after that have improved our support for multithreading in the data node and in the API nodes. The MySQL Server scales to a significant number of threads. A data node together with a MySQL Server scales to use the largest dual socket x86 servers.

RonDB have designed a scalable architecture inside each thread that makes good use of the CPU resources inside a node. This architecture has the benefit that the more load the system gets, the more efficient it executes. Thus we get an automatic overload protection.

In the NDB API we have an automatic load balancer built in. This load balancer will automatically pick the most suitable data node to perform the task specified by the application in the NDB API.

There is even load balancing performed inside a RonDB data node where the receive thread decides which LDM thread or query thread to send the request to.

Non-blocking 2PC transaction protocol#

RonDB uses a two-phase commit protocol. The research literature talks about this protocol as a blocking protocol. Our variant of this protocol is non-blocking. The problem is that a transaction has a coordinator role. The problem is what to do at crashes of the coordinator role.

RonDB have a protocol to take over the transaction coordinator role for a crashed node. The state of the transaction coordinator is rebuilt in the new transaction coordinator by asking the transaction participants about the state of all ongoing transactions. We use this take over protocol to decide on either abort or commit of each transaction that belonged to the crashed node. Thus there is no need to wait for the crashed node to come back.

The protocol is recursive such that the transaction coordinator role can handle multiple node failures until the entire cluster fails.

Thus there are no blocking states in our two-phase commit protocol. Normally a node failure is handled within a second or two or less than this unless large transactions was ongoing at crash time.

Global checkpoints#

RonDB uses a method called Network Durable transactions, this means that when a transaction is acknowledged towards the API we know that the transaction is safe on several computers. It is however not yet safe on durable media (e.g. hard drive, SSD, NVMe or persistent memory).

In order to ensure that we recover a consistent point after a cluster crash we create regular consistent commit points. We call those global checkpoints. We actually create two types of global checkpoints. One of them are used for RonDB Replication. These are created around once per 10 milliseconds. These are called epochs in RonDB Replication. The epochs are not durable on disk. Each second or two we create a global checkpoint that is durable. When we recover after a complete cluster crash we recover to one of those global checkpoints.

The NDB API provides the global checkpoint identifier of the transaction committed, this makes it possible to wait for this global checkpoint to be durable on disk if this is necessary.

The global checkpoint identifier is heavily used in our recovery protocols. It is a very important building block of RonDB.

Automatically Parallelised and Distributed Queries#

Any range scan that is scanning more than one partition will be automatically parallelised. As an example we have 4 range scans in the Sysbench OLTP benchmark. Using data nodes with 8 partitions per table will execute those scan queries twice as fast, compared to tables with only one partition. This is a case without any filtering in the data nodes. With filtering the improvement will be bigger.

In MySQL Cluster 7.2 a method to execute complex SQL queries was added. This method uses a framework where a multi-table join can be sent as one query to the data nodes. The manner it executes is that it reads one table at a time, each table reads a set of rows and sends the data of these rows back to the API node. At the same time it sends key information onwards to the second table together with information sent in the original request. This query execution is automatically parallelised in the data nodes.

There is still a bottleneck in that only one thread in the MySQL Server will be used to process the query results. Queries where lots of filtering are pushed to the data nodes can be highly parallelised.

There is a number of limitations on this support, the EXPLAIN command in MySQL will give a good idea about what will be used and some reasons why the pushdown of the joins to the data nodes doesn't work (pushing down joins to data nodes enables the query to be automatically parallelised).

Interestingly the MySQL Server can divide a large join into several parts where one part is pushed to the data node and another part is executed from the MySQL Server using normal single-table reads.

Rationale for RonDB#

The idea to build NDB Cluster that later was merged into the MySQL framework and became RonDB came from the analysis Mikael Ronstrőm did as part of his Ph.D studies. He worked at Ericsson and participated in a study about the new UMTS mobile system developed in Europe that later turned into 3G. As part of this study there was a lot of work on understanding the requirements of network databases for the UMTS system. In addition he studied a number of related areas like Number Portability, Universal Personal Telephony, Routing Servers, Name Servers (e.g. DNS Servers), Intelligent Network Servers. Other application areas I studied was News-on-demand, multimedia email services, event data services and finally he studied data requirements for genealogy.

Areas of study was requirements on response times, transaction rates and availability and a few more aspects.

These studies led to a few conclusions:

  1. Predictable response time requirements required a main memory database

  2. Predictable response time requirements required using a real-time scheduler

  3. Throughput requirements required a main memory database

  4. Throughput requirements required building a scalable database

  5. Availability requirements required using replicated data

  6. Availability requirements required using a Shared-Nothing Architecture

  7. Availability requirements means that applications should not execute inside the DBMS unless in a protected manner

  8. Certain applications needs to store large objects on disk inside the DBMS

  9. The most common operation is key lookup

  10. Most applications studied were write intensive

Separation of Data Server and Query Server#

In MySQL there is a separation of the Query Server and the Data Server functionality. The Query Server is what takes care of handling the SQL queries and maps those to lower layers call into the Data Server. The API to the Data Server in MySQL is the storage engine API.

Almost all DBMS have a similar separation between Data Server and Query Server. However there are many differences in how to locate the interfaces between Data Server and Query Server and Query Server and the application.

Fully integrated model#

The figure below shows a fully integrated model where the application is running in the same binary as the DBMS. This gives very low latency. The first database product Mikael Ronstrőm worked on was called DBS and was a database subsystem in AXE that generated PLEX code (PLEX was the programming language used in AXE) from SQL statements. This is probably the most efficient SQL queries that exists, a select query that read one column using a primary key took 2 assembler instructions in the AXE CPU (APZ).

There was methods to query and update the data through external interfaces, but those were a lot slower compared to internal access.

image

This model is not flexible, it's great in some applications, but the applications have to be developed with the same care as the DBMS since it is running in the same memory. It is not a model that works well to get the absolutely best availability. It doesn't provide the benefits of scaling the DBMS in many different directions.

The main reason for avoiding this approach is the high availability requirement. If the application gets a wrong pointer and writes some data out of place, it can change the data inside the DBMS. Another problem with this approach is that it becomes difficult to manage in a shared nothing architecture since the application will have to be colocated with its data for the benefits to be of any value.

Colocated Query Server and Data Server#

The next step that one would take is to separate the Query Server from the API, but still colocating the Query Server and the Data Server as shown in the figure.

This is the model used by most DBMSs. This makes the DBMS highly specialised, if the external API is using SQL, every access to the Data Server have to go through SQL. If the external API is LDAP, all access have to use LDAP to get access to the data.

image

Some examples of this model is Oracle, IBM DB2, MySQL/InnoDB and various LDAP server solutions. Its drawback is that it isn't good to handle mixed workloads that use both complex queries where a language like SQL is suitable and many simple queries that are the bulk of the application.

The Query Server functionality requires a lot of code to manage all sorts of complex queries, it requires a flexible memory model that can handle many concurrent memory allocations from many different threads.

The Data Server is much more specialised software, it has a few rather simple queries that it can handle, the code to handle those are much smaller compared to the Query Server functionality.

Thus even in this model the argument for separating the Query Server and the Data Server due to availability requirements is a strong argument. Separating them means that there is no risk for flaws in the Query Server code to cause corruption of the data. The high availability requirements of NDB led to a choice of model where we wanted to minimise the amount of code that had direct access to the data.

Most DBMSs uses SQL or some other high-level language to access it. The translation from SQL to low-level primitives gives a fairly high overhead. The requirements on high throughput of primary key operations for NDB meant that it was important to provide an access path to data which didn't require the access to go through a layer like SQL.

These two arguments were strong advocates for separation of the query server and Data Server functionality.

Thus RonDB although designed in the 1990s was designed as a key value database with SQL capabilities as an added value.

RonDB Model#

image

The above reasoning led to the conclusion to separate the Query Server facilities from the Data Server facility. The Data Server provides a low-level API where it is possible to access rows directly through a primary key or unique key and it is possible to scan a table using a full table scan or an index scan on an ordered index. It is possible to create and alter meta data objects. It is possible to create asynchronous event streams that provide information about any changes to the data in the Data Server. The Data Server will handle the recovery of data, it will handle the data consistency and replication of data.

This ensures that the development of the Data Server can be focused and there is very little interaction with the development of the Query Server parts. It does not have to handle SQL, LDAP, query optimisation, file system APIs or any other higher level functionality required to support the Query Server functionality. This greatly diminishes the complexity of the Data Server.

Later on we have added capabilities to perform a bit more complex query executions to handle complex join operations.

Benefits of RonDB Model#

The figure of the RonDB model above shows how a network API between the Data Server and the Query Server it is possible to both use it with SQL as the API, to access the Data Server API directly and to use the LDAP API to the same Data Servers that supports SQL and applications using the Data Server API directly and file system services on top of RonDB. Thus providing a flexible application development.

The nice thing with development of computers is that they develop to get more and more bandwidth for networking. In addition it is possible to colocate the Query Server process and the Data Server process on the same machine. In this context the integration of the Query Server and the Data Server is almost at the same level as the integration when they are part of the same process. Some DBMSs use the colocated model, but still use multiple processes to implement the DBMS.

This separation of the Data Server and the Query Server have been essential for RonDB. It means that one can use RonDB to build an SQL server, it means that you can use RonDB to build an LDAP server, you can use it to build various networking servers, meta data servers for scalable file systems and it means that you can use RonDB to build a file server and you can even use it to build a distributed block device to handle petabytes of data. We will discuss all those use cases and a few more later on in this book.

Drawbacks of RonDB Model#

There is a drawback to the separation of Query Server and the Data Server. To query the data we have to pass through several threads to access the data. This increases the latency of accessing the data when you use SQL access. Writing a combined Query Server and Data Server to implement a SQL engine is always faster for a single DBMS server. RonDB is however designed for a distributed environment with replication and sharding from the ground up.

We have worked hard to optimise SQL access in the single server case, and we're continuing this work. The performance is now almost at par with the performance one can achieve with a combined Query Server and Data Server.

Understanding the RonDB Model#

An important part of the research work was to understand the impact of the separation of the Data Server and the Query Server.

Choosing a network-based protocol as the Data Server API was a choice to ensure the highest level of reliability of the DBMS and its applications. We have a clear place where we can check the correctness of the API calls and only through this API can the application change the data in the Data Server.

Another reason that made it natural to choose a network protocol as API was that the development of technologies for low-latency and high bandwidth had started already in the 1990s. The first versions of NDB Cluster had an SCI transporter as its main transporter which ensured that communication between computers could happen in microseconds. TCP/IP sockets have since replaced it since SCI and Infiniband now have support for TCP/IP sockets that is more or less as fast as direct use of SCI and Infiniband.

One more reason for using a network protocol as API is that it enables us to build scalable Data Servers.

As part of the research we looked deeply into the next generation mobile networks and their use of network databases. From these studies it was obvious that the traffic part of the applications almost always made simple queries, mostly key lookups and in some cases slightly more complex queries were used. Thus the requirements pointed clearly at developing RonDB as a key value database.

At the same time there are management applications in the telecom network, these applications will often use more complex queries and will almost always use some sort of standard access method such as SQL.

Normally the traffic queries have strict requirements on response times whereas the management applications have less strict requirements on response times.

It was clear that there was a need for both a fast path to the data as well as standard APIs used for more complex queries and for data provisioning.

From this it was clear that it was desirable to have a clearly defined Data Server API in a telecom DBMS to handle most of the traffic queries. Thus we can develop the Data Server to handle real-time applications while at the same time supporting applications that focus more on complex queries and analysing the data. Applications requiring real-time access can use the same cluster as applications with less requirements on real-time access but with needs to analyse the data.

How we did it#

To make it easier to program applications we developed the C++ NDB API that is used to access the RonDB Data Server protocol.

The marriage between MySQL and NDB Cluster was a natural one since NDB Cluster had mainly focused on the Data Server parts and by connecting to the MySQL Server we had a natural implementation of the Query Server functionality. Based on this marriage we have now created RonDB as a key value database optimised for the public cloud.

The requirements on fast failover times in telecom DBMSs made it necessary to implement RonDB as a shared nothing DBMS.

The Data Server API has support for storing relational tables in a shared nothing architecture. The methods available in the RonDB Data Server API are methods for key-value access for read and write, scan access using full table scan and ordered index scans. In 7.2 we added some functionality to pushdown join execution into the Data Server API. There is a set of interfaces to create tables, drop tables, create foreign keys, drop foreign keys, alter a table, adding and dropping columns, adding and dropping indexes and so forth. There is also an event API to track data changes in RonDB.

To decrease the amount of interaction we added an interpreter to the RonDB Data Server, this can be used for simple pushdown of filters, it can be used to perform simple update operations (such as increment a value) and it can handle LIKE filters.

Advantages and Disadvantages#

What have the benefits and disadvantages of these architecturial choices been over the 15 years NDB have been in production usage?

One advantage is that RonDB can be used for many different things.

One important use case is what it was designed for. There are many good examples of applications written against any of the direct NDB APIs to serve telecom applications, financial applications and web applications while still being able to access the same data through an SQL interface in the MySQL Server. These applications uses the performance advantage that makes it possible to scale applications to as much as hundreds of millions of operations per second.

Another category in poular use with NDB is to implement an LDAP server as a Query Server on top of the NDB Data Server (usually based on OpenLDAP). This would have been difficult using a Query Server API since the Query Server adds a significant overhead to simple requests.

The latest example of use cases is to use the Data Server to implement a scalable file system. This has been implemented by HopsFS in replacing the Name Server in Hadoop HDFS with a set of Name Servers that use a set of NDB Data Servers to store the actual metadata. Most people that hear about such an architecture being built on something with MySQL in the name will immediately think of the overhead in using SQL interface to implement a file system. But it isn't the SQL interface which is used, it is implemented directly on top of the Java implementation of the NDB API, ClusterJ.

The final category is using RonDB with the SQL interface. There are many applications that want to get the high availability of RonDB but still want to use the SQL interface. There are many pure SQL applications that see the benefit of the data consistency of MySQL Servers using RonDB and the availability and scalability of RonDB.

The disadvantages is that DBMSs that have a more direct API between the Query Server and the Data Server will get benefits in that they don't have to go over a network API to access its data. With RonDB you pay this extra cost to get higher availability, more flexible access to your data and higher scalability and more predictable response times.

This cost was a surprise to many early users of NDB. We have worked hard since the inception of NDB Cluster into MySQL to ensure that the performance of SQL queries is as close as possible to the colocated Query Server and Data Server APIs. We've gotten close and we are continously working on getting closer. Early versions lacked many optimisations using batching, and many more important optimisations that have been added over the years.

Performance of RonDB is today close to the performance of MySQL/InnoDB for SQL applications and as soon as some of the parallelisation abilities of RonDB is made use of, the performance is better.

By separating the Data Server and the Query Server we have made it possible to work on parallelising some queries in an easy manner. This makes the gap much smaller and for many complex queries RonDB will outperform local storage engines.

Predictable response time requirements#

As part of Mikael Ronstrőm's Ph.D thesis, databases in general was studied and looked at what issues DBMSs had when executing in an OS. What was discovered was that the DBMS is spending most of its time in handling context switches, waiting for disks and in various networking operations. Thus a solution was required that avoided the overhead of context switches between different tasks in the DBMS while at the same time integrating networking close to the operations of the DBMS.

When analysing the requirements for predictable response times in NDB Cluster based on its usage in telecom databases two things were important. The first requirement is that we need to be able to respond to queries within a few milliseconds (today down to tens of microseconds). The second requirement is that we need to do this while at the same time supporting a mix of simple traffic queries combined with a number of more complex queries.

The first requirement was the main requirement that led to NDB Cluster using a main memory storage model with durability on disk using a REDO log and various checkpoints.

The second requirement is a bit harder to handle. To solve the second requirement in a large environment with many CPUs can be done by allowing the traffic queries and management queries to run on different CPUs. This model will not work at all in a confined environment with only 1-2 CPUs and it will be hard to get it to work in a large environment since the usage of the management queries will come and go quickly.

The next potential solution is to simply leave the problem to the OS. Modern OSs of today use a time-sharing model. However each time quanta is fairly long compared to our requirement of responding within parts of a millisecond.

Another solution would be to use a real-time operating system, but this would make the product too limited. Even in the telecom application space real-time operating systems is mostly used in the access network.

There could be simple transactions with only a single key lookup. Number translation isn't much more than this simple key lookup query. At the same time most realistic transactions are looking more like the one below where there is a number of lookup's and a number of index scan's as part of the transaction and some of those lookup's do updates.

image

There are complex queries that analyse data (these are mostly read-only transactions) and these could do thousands or more of these lookup's and scans and they could all execute as part of one single query. It would be hard to handle predictability of response times if these were mixed using normal threads using time-sharing in the OS.

Most DBMS today use the OS to handle the requirements on reponse times. As an example if one uses MySQL/InnoDB and send various queries to the MySQL Server, some traffic queries and some management queries, MySQL will use different threads for each query. MySQL will deliver good throughput in the context of varying workloads since the OS will use time-sharing to fairly split the CPU usage amongst the various threads. It will not be able to handle response time requirements of parts of a millisecond with a mixed load of simple and complex queries.

Most vendors of key value databases have now understood that the threading model isn't working for key value databases. However RonDB has an architecture that has slowly improved over the years and is still delivering the best latency and throughput of the key value databases available.

AXE VM#

When designing NDB Cluster we wanted to avoid this problem. NDB was initially designed within Ericsson. In Ericsson a real-time telecom switch had been developed in the 70s, the AXE. The AXE is still in popular use today and new versions of it is still developed. AXE had a solution to this problem which was built around a message passing machine.

AXE had its own operating system and at the time its own CPUs. The computer architecture of an AXE was very efficient and the reason behind this was exactly that the cost of switching between threads was almost zero. It got this by using asynchronous programming using modules called blocks and messages called signals.

Mikael Ronstrőm spent a good deal of the 90s developing a virtual machine for AXE called AXE VM that inspired the development of a real product called APZ VM (APZ is the name of the CPU subsystem in the AXE). This virtual machine was able to execute on any computer.

The AXE VM used a model where execution was handled as execution of signals. A signal is simply a message, this message contains an address label, it contains a signal number and it contains data of various sizes. A signal is executed inside a block, a block is a module that is self-contained, it owns all its data and the only manner to get to the data in the block is through sending a signal to the block.

The AXE VM implemented a real-time operating system inside a normal operating system such as Windows, Linux, Solaris or Mac OS X. This solves the problem with predictable response time, it gives low overhead for context switching between execution of different queries and it makes it possible to integrate networking close to the execution of signals.

The implementation of RonDB borrows ideas from the AXE architecture and implements an architecture with a number of blocks and signals sent between those blocks. A nice side effect of this is that the code is using a message passing oriented implementation. This is nice when implementing a distributed database engine.

The first figure shows how RonDB uses blocks with its own data and signals passed between blocks to handle things. This was the initial architecture.

image

The figure below shows how the full RonDB architecture looks like in the data nodes. It shows how it uses the AXE architecture to implement a distributed message passing system using blocks and a VM (virtual machine) that implements message passing between blocks in the same thread, message passing between blocks in the same process and message passing between blocks in different processes, it can handle different transporters of the messages between processes (currently only TCP/IP sockets), scheduling of signal execution and interfacing the OS with CPU locking, thread priorities and various other things.

We have further advanced this such that the performance of a data node is now much higher and can nicely fill up a 16-core machine and a 32-core machine and with some drop of scalability even a 48-core machine. The second figure shows the full model used in RonDB.

image

A signal can be local to the thread in which case the virtual machine will discover this and put it into the scheduler queue of the own thread. It could be a message to another block in the same node but in a different thread. In this case we have highly optimised non-locking code to transfer the message using memory barriers to the other thread and place it into the scheduler queue of this thread. A message could be targeted for another node, in this case we use the send mechanism to send it over the network using a TCP/IP socket.

The target node could be another data node, it could be a management server or a MySQL Server. The management server and MySQL Server uses a different model where each thread is its own block. The addressing label contains a block number and a thread id, but this label is interpreted by the receiver, only the node id is used at the send side to determine where to send the message. In the data node a receiver thread will receive the message and will transport it to the correct block using the same mechanism used for messages to other threads.

Another positive side effect of basing the architecture on the AXE was that we inherited an easy model for tracing crashes. Signals go through so called job buffers. Thus we have access to a few thousand of the last executed signals in each thread and the data of the signals. This trace comes for free. In addition we borrowed another concept called Jump Address Memory (JAM), this shows the last jumps that the thread did just before the crash. This is implemented through a set of macros that we use in a great number of places in the code. This has some overhead, but the payback is huge in terms of fixing crashes. It means that we can get informative crash traces from the users that consume much less disk space than a core file would consume and still it delivers more real information than the core files would do. The overhead of generating those jump addresses is on the order of 10-20% in the data node threads. This overhead has paid off in making it a lot easier to support the product.

The AXE VM had a lot of handling of the language used in AXE called PLEX. This is no longer present in RonDB. But RonDB still is implemented using signals and blocks. The blocks are implemented in C++ and in AXE VM it was possible to have such blocks, they were called simulated blocks. In NDB all blocks are nowadays simulated blocks.

How does it work#

How does this model enable response times of down to parts of a millisecond in a highly loaded system?

First of all it is important to state that RonDB handles this.

There are demanding users in the telecom, networking and in financial sectors and lately in the storage world that expects to run complex transactions involving tens of different key lookups and scan queries and that expects these transactions to complete within a few milliseconds at 90-95% load in the system.

For example in the financial sector missing the deadline might mean that you miss the opportunity to buy or sell some stock equity in real-time trading. In the telecom sector your telephone call setup and other telco services depends on immediate responses to complex transactions.

At the same time these systems need to ensure that they can analyse the data in real-time, these queries have less demanding response time requirements, running those queries isn't allowed to impact the response time of the traffic queries.

The virtual machine model implements this by using a design technique where each signal is only allowed to execute for a few microseconds. A typical key lookup query in modern CPUs takes less than two microseconds to execute. Scanning a table is divided up into scanning a few rows at a time where each such scan takes less than ten microseconds. All other maintenance work to handle restarts, node failures, aborts, creating new tables and so forth is similarly implemented with the same requirements on signal execution.

Thus a typical traffic transaction is normally handled by one key lookup or a short scan query and the response is sent back to the API node. A transaction consists of a number of such interactions normally in the order of tens of such queries. Thus each interaction needs to complete within 100-200 microseconds in order to handle response times of a few millseconds.

RonDB can handle this response time requirement even when 20-30 messages are queued up before the message given that each message will only take on the order of 1-2 microseconds to execute. Thus most of the time is still spent in the transporter layer sending the message and receiving the message.

A complex query will execute in this model by being split into many small signal executions. Each time a signal is completed it will put itself back into the queue of signals and wait for its next turn.

Traffic queries will always have the ability to meet strict requirements on response time. Another nice thing with this model is that it will adapt to varying workloads within a few microseconds. If there is currently no traffic queries to execute, the complex query will get the CPU to itself since the next signal will execute immediately after being put on the queue.

In the figure below we have shown an example of how the execution of a primary key read operation might be affected by delays in the various steps it handles as part of the lookup. In this example case (the execution times are rough estimates of an example installation) the total latency to execute the primary key lookup query is 19.7 microseconds plus the time for a few thread wakeups. The example case represents a case where load is around 80%. We will go through in detail later in this book what happens during the execution of a transaction.

The figure clearly shows the benefits of avoiding the wakeups. Most of the threads in the RonDB data nodes have the possibility to spin before going back to sleep. By spinning for a few hundred microseconds in the data nodes the latency of requests will go down since most of the wakeup times disappear. Decreasing the wakeup times also benefits query execution time positively since the CPU caches are warmer when we come back to continue executing a transaction.

In a benchmark we often see that performance scales better than linearly going from one to four threads. The reason is that with a few more active transactions the chance of finding threads already awake increases and thus the latency incurred by wakeups decreases as load increases.

image

We opted for a functional division of the threads. We could have put all parts into one thread type that handles everything. This would decrease the amount of wakeups needed to execute one lookup. It would decrease the scalability at the same time. This is why RonDB implements an adaptive CPU spinning model that decreases the amount of wakeup latency.

There is one more important aspect of this model. As load increases two things happens. First we execute more and more signals every time we have received a set of signals. This means that the overhead to collect each signal decreases. Second executing larger and larger sets of signals means that we send larger and larger packets. Thus the cost per packet decreases. Thus RonDB data nodes executes more and more efficiently as load increases. This is an important characteristic that avoids many overload problems.

This means in the above example that as load increases the amount of wakeups decrease, we spend more time in queues, but at the same time we spend less time in waiting for wakeups. Thus the response time is kept almost stable until we reach a high load. The latency of requests starts to dramatically increase around 90-95% load. Up until 90% load the latency is stable and you can get very stable response times.

The functional division of threads makes RonDB data nodes behave better compared to if everything was implemented as one single thread type. If the extra latency for wakeups one can use the spinning feature discussed above. This have the effect that we can get very low response times in the case of a lightly loaded cluster. Thus we could get response times down to below 30 microseconds for most operation types. Updating transactions will take a bit longer since they have to visit each node several times before the transaction is complete, a minimum of 5 times for a simple update transaction and a minimum of 7 times for a simple update transaction on a table that uses the read backup feature (explained in a later chapter).

The separation of Data Server and Query Server functionality makes it possible to use different Query Server for traffic queries to the ones used for complex queries. In the RonDB model this means that you can use a set of MySQL Servers in the cluster to handle short real-time queries. You can use a different set of MySQL Servers to handle complex queries. Thus RonDB can handle real-time requirements in a proper configuration of the cluster when executing SQL queries.

The separation of Data Server and Query Server functionality might mean that RonDB have slightly longer minimum response time compared to a local storage engine in MySQL, but RonDB will continue to deliver low and predictable response times using varying workloads and executing at high loads.

One experiment that was done when developing pushdown join functionality showed that the latency of pushed down joins was the same when executing in an idle cluster compared to a cluster that performed 50.000 update queries per second concurrently with the join queries.

We have now built a benchmark structure that makes it very easy to combine running an OLTP benchmark combined with an OLAP benchmark running at the same time in RonDB.

RonDB and multi-core architectures#

In 2008 the MySQL team started working hard on solving the challenge that modern processors have more CPUs and CPU cores. Before that the normal server computer had 2 CPUs and some times even 4 CPUs. MySQL scaled well to those types of computers. However when the single threaded performance was harder and harder to further develop the processor manufacturers started developing processors with a lot more CPUs and CPU cores. Lately we've seen the introduction of various server processors from AMD, Intel and Oracle that have 32 CPU cores per processor and each CPU core can have 2-8 CPUs dependent on the CPU architecture (2 for x86 and 8 for SPARC and Power).

MySQL have now developed such that each MySQL Server can scale beyond 64 CPUs and each RonDB data node can scale to around 64 CPUs as well. Modern standard dual socket servers come equipped with up to 128 CPUs for x86s and up to 512 CPUs for SPARC computers.

The development of more and more CPUs per generation have slowed down now compared to a few years ago. But both MySQL Server and RonDB data nodes are developed in such a fashion that we try to keep up with the further development of scalability in modern computers.

Both NDB data nodes and MySQL Server nodes can scale to at least a large processor with 32 CPU cores for x86. The RonDB architecture still makes it possible to have more than one node per computer. Thus the RonDB architecture will work nicely for all types of computers, even with the largest computers. NDB is tested with computers that have up to 120 CPUs for x86 and up to 1024 CPUs with SPARC machines.

Shared nothing architecture#

image

Most traditional DBMSs used in high-availability setups are using a shared disk approach as shown by the figure above. Shared disk architecture means that the disks are shared by all nodes in the cluster. In reality the shared disk is a set of storage servers, implemented either by software or a combination of hardware and software. Modern shared disk DBMS mostly rely on pure software solutions for the storage server part.

Shared disk have advantages for query throughput in that all data is accessible from each server. This means that it reads data from the storage server to perform its queries. This approach has limited scalability but each server can be huge. In most cases the unit transported between Storage Server and Data Server is pages of some standard size such as 8 kByte.

It is very hard to get fast failover cases in a shared disk architecture since it is necessary to replay the global REDO log as part of failover handling. Only after this replay can pages that was owned by a failed node be read or written again.

This is the reason why all telecom databases use the shared nothing approach instead as shown in the figure below. With shared nothing the surviving node can instantly take over, the only delay in taking over is the time it takes to discover that the node has failed. The time it takes to discover this is mostly dependent on the operating system and a shared nothing database can handle discovery and take over in sub-seconds if the OS supports it.

image

An interesting note is that shared disk implementations depends on a high-availability file system that is connected to all nodes. The normal implementation of such a high-availability storage system is to use a shared nothing approach. Interestingly RonDB could be used to implement such a high-availability storage system. It is an intriguing fact that you can build a highly available shared disk DBMS on top of a storage subsystem that is implemented on top of RonDB.

RonDB and the CAP theorem#

The CAP theorem says that one has to focus on a maximum of two of the three properties Consistency, Availability and Partitioning where partitioning refers to network partitioning. RonDB focus on consistency and availability. If the cluster becomes partitioned we don't allow both partitions to continue, this would create consistency problems. We try to achieve the maximum availability without sacrificing consistency.

We go a long way to ensure that we will survive as many partitioning scenarios as possible without violating consistency. We will describe this in a later chapter on handling crashes in RonDB.

However there is more to the CAP theorem and RonDB. We support replicating from one cluster to another cluster in another part of the world. This replication is asynchronous. For replication between clusters we sacrifice consistency to improve availability in the presence of Partitioning (network partitioning). The figure below shows an architecture for replication between two clusters. As can be seen here it is possible to have two replication channels. Global replication is implemented using standard MySQL Replication where the replication happens in steps of a group of transactions (epochs).

image

If the network is partitioned between the two clusters (or more than two clusters) then each cluster will continue to operate. How failures are handled here is manually handled (implemented as a set of automated scripts that might have some manual interaction points).

RonDB gives the possibility to focus on consistency in each individual local cluster but still providing the extra level of availability by handling partitioned networks and disasters hitting a complete region of the world. Global replication is implemented and used by many users using this in critical applications.

The replication between clusters can support writing in all clusters with conflict resolution. More on this in the chapter on RonDB Replication.

OLTP, OLAP and relational models#

There are many DBMSs implemented in this world. They all strive to do similar things. They take data and place it in some data structure for efficient access. DBMSs can be divided into OLTP (On-line Transaction Processing) and OLAP (On-line Analytical Processing). There are databases focused on special data sets such as hierarchical data referred to as graph databases, other types are specialised DBMSs for geographical data (GIS). There is a set of DBMSs focused on time-series data.

The most commonly used DBMSs are based on the relational model. Lately we have seen a development of numerous Big Data solutions. These data solutions focus on massive data sets that require scaling into thousands of computers and more. RonDB will scale to hundreds of computers, but will do so while maintaining consistency inside the cluster. It will provide predictable response times even in large clusters.

RonDB is a DBMS focused on OLTP, one of the most unique points of RonDB is that 1 transaction with 10 operations takes similar time to 10 transactions with 1 operation each. An OLTP DBMS always focus on row-based storage and so does RonDB. An OLAP DBMS will often focus on columnar data storage since it makes scanning data in modern processors much faster. At the same time the columnar storage increases the overhead in changing the data. RonDB use row storage.

Data storage requirements#

In the past almost all DBMSs focused on storing data on disk with a page cache. In the 1990s the size of main memory grew to a point where it was sometimes interesting to use data which always resides in memory. Nowadays most major DBMS have a main memory options, many of those are used for efficient analytics, but some also use it for higher transaction throughput and best possible response time.

RonDB chose to store data in main memory mainly to provide predictable response time. At the same time we have developed an option for tables to store some columns on disk with a page cache. This can be used to provide a much larger data set stored in RonDB using SSD storage. Disk data is primarily intended for use with SSDs since most current RonDB applications have stringent requirements on response time.

RonDB and Big Data#

RonDB was designed long before Big Data solutions came around, but the design of RonDB fits well into the Big Data world. It has automatic sharding of the data, queries can get data from all shards in the cluster. RonDB is easy to work with since it uses transactions to change data which means that the application won't have to consider how to solve data inconsistencies.

The data structures in RonDB is designed for main memory. The hash index is designed to avoid CPU cache misses and so is the access to the row storage. The ordered index is using a T-tree data structure specifically designed for main memory. The page cache uses a modern caching algorithm which takes into account both data in the page cache as well as data that have since long been evoked from the page cache when making decisions of the heat level of a page.

image

Thus the basic data structure components in RonDB are all of top notch quality which gives us the ability to deliver benchmarks with 200 million reads per second and beyond.

Unique selling point of RonDB#

The most important unique selling points for RonDB comes from our ability to failover instantly, to be able to add indexes, drop indexes, add columns, add tables, drop tables, add node groups and reorganise data to use those new node groups, change software in one node at a time or multiple at a time, add tablespaces, add files to tablespaces. All of these changes and more can all be done while the DBMS is up and running and all data can still be both updated and read, including the tables that are currently changed. All of this can happen even with nodes in the cluster leaving and joining. The most unique feature of RonDB I call AlwaysOn. It is designed to never go down, only bugs in the software or very complex changes can bring down the system and even that can be handled most of the time by using MySQL replication to a backup cluster.

There are many thousands of NDB clusters up and running that have met tough requirements on availability. Important to remember here is that to reach such high availability requires a skilled operational team and processes to handle all types of problems that can occur and of course 24x7 support.