Detailed RonDB Internals#
This chapter mentions a few basic foundations for the software design in RonDB. We have already touched upon that we use the fail fast model for software development. This is natural when you expect that your nodes are always setup in a replicated fashion. It is not so natural when you have downtime associated with each node failure.
Internal triggers in RonDB#
In Mikael Ronströms Ph.D thesis he developed quite a few algorithms for node recovery, for online schema changes and so forth. Most of those algorithms are based on a trigger approach.
In RonDB we make use of triggers for a wide range of topics. We use triggers to maintain our ordered indexes (T-trees). Thus every change on a column that is part of an ordered index will generate an insert trigger and a delete trigger on the ordered index.
We use triggers to maintain our unique indexes. We use triggers to maintain our foreign keys. We use triggers for various meta data operations such as when building an index as an online operation.
When adding nodes we dynamically add more partitions to parts of the table. These partitions are built as an online operation and we use triggers to maintain these new partitions. We use triggers to maintain fully replicated tables.
The nice thing about our triggers is that they are very well integrated with our transactional changes. RonDB does almost every change in a transactional context. This means that using triggers makes it is easy for us to perform additional changes required by online meta data operations and maintaining indexes and so forth.
There are asynchronous triggers as well that are used to maintain our event reporting APIs that are used for replicating from one cluster to another.
Transporter Model#
All our communication methods are located in a transporter. Transporters interact with the remaining parts of the virtual machine in the data nodes. The virtual machine need not worry about what technique we are using to transport the data.
Over the years we have implemented transporters for Dolphin SCI technology, shared memory, TCP/IP and a special transporter for OSE/Delta messages.
The SCI transporter is no longer needed since Dolphin technology can be used with Dolphin SuperSockets. These provide communication to another within less than a microsecond and is even better integrated with the Linux OS compared to using a native SCI transporter.
The shared memory transporter was revived in MySQL Cluster 7.6. This greatly improves latency when colocating the NDB API nodes and the RonDB data nodes.
RonDB runs well on Infiniband as well, previously we used SDP in the same fashion as for Dolphin SuperSocket, but now it is more common to use IPoIB (IP over InfiniBand).
Memory Allocation principles#
Historically memory allocation was a very expensive operation. Thus to get the optimal performance it was a good idea to avoid allocating small amounts of memory.
This reason for managing your own memory is smaller with modern malloc implementations that scale very well and have little overhead. However using malloc to extend the memory in a process can still cause extra latency.
Managing your own memory is however still a very good idea to ensure that you can get the best possible real-time experience. It also greatly improves your control over the execution of the RonDB data nodes. Since RonDB data nodes allocate all the memory at the startup and it can even lock this memory it ensures that other programs executing on the same machine cannot destroy the setup of the RonDB data node.
The preferred method for using RonDB is to set the LockInMainMemory parameter to 1 such that the memory is locked once allocated. RonDB data nodes allocate all its memory in the startup phase. Once it has allocated the memory it maintains the memory itself.
Allocating memory dynamically provides a more flexible environment. It is much harder to provide the highest availability in a flexible environment. Most every program in the world can get into hang situations when overallocating memory. Overallocation of memory leads to swapping that slows the process down to a grinding halt. So even if the process still is considered up and running, it is not delivering the expected service.
In contrast the RonDB data nodes takes the pain at startup, once the node is up and running, it won’t release its memory, thus ensuring continued operation. Our flexibility instead comes from that data nodes can be restarted with a new configuration without causing any downtime in the system.
The MySQL Server and the NDB API use a much more traditional memory allocation principle where memory is allocated and released as needs arise.
Automatic Memory Configuration#
In RonDB we implemented automatic memory configurations. This means that we will use OS information to retrieve information about the amount of memory available in the machine. We will use almost all of the available memory. RonDB will decide how much memory is used for various purposes. RonDB will ensure that it is possible to create a large amount of tables, columns, indexes and other schema objects. It will ensure that it is possible to handle fairly large transactions and many parallel transactions. The remainder of the memory is divided between the memory of the page cache for disk columns and the memory used to store the actual in-memory rows (90% of the remaining memory is used for this purpose).
Angel process#
Automated responses to failures is important in RonDB. Our approach to solving this problem is to ensure that any node failure is followed by a node restart. We implement this by using an angel process when starting a data node.
When starting the data node it is the angel process that is started. After reading the configuration from the management server the angel process will fork itself into an angel process that simply waits for the running data node process to fail. The actual work in the data node is executed by the forked process.
When the forked process stops, the angel process discovers it and restarts the process and thus a node restart is immediately started after the node failure.
It is possible to configure the cluster such that we don’t get those automatic restarts, but there is very little reason to use this configuration option.