Blocks in a Data Node#

To understand the signaling diagrams for RonDB, it is necessary to understand the various blocks in RonDB at a high level. These blocks are the execution modules inside the RonDB data nodes. An execution module must execute within one thread, but different execution modules can be colocated within one thread in a variety of manners. The configuration variable ThreadConfig provides a great deal of flexibility in configuring the placement of execution modules into threads.

The code also documents some of the most important signaling flows in RonDB. A few examples of this are:

Block Ndbcntr (file NdbcntrMain.cpp): Contains a long description of how a restart works
Block Qmgr (file QmgrMain.cpp): Contains a signaling diagram for how the registration of new nodes into the cluster at startup happens; also describes how this is the base for heartbeat signals
Block Dbdih (file DbdihMain.cpp): Contains a long description of how ALTER TABLE REORGANIZE PARTITIONS works using a detailed signaling flow

Hopefully, the following explanation aids in understanding the descriptions in the code. This especially holds for anyone interested in understanding the details of complex online metadata changes.

LDM Blocks#

LDM stands for Local Data Manager. It is a set of blocks that cooperate to deliver the local database service that is a very central part of RonDB. In ndbmtd these blocks all operate inside the same ldm thread. It is not possible to break those blocks apart and execute them in different threads. It is however possible to have several instances of ldm threads. Each ldm thread handles one instance of each of the below blocks.

Being data managers, data ownership is divided up by LDM instances. Thereby, the LDM instances can also take care of the parts of the REDO log. In RonDB we can have fewer REDO log parts than the number of LDM instances. In this case, the REDO log parts will be handled by the first LDM instances, but all LDM instances will write to the REDO log controlled through a mutex on the REDO log part.

DBLQH#

DBLQH stands for Database Local Query Handler. This block is where most interactions with the LDM blocks start. It does handle the REDO log, but its main responsibility is to interact with the other blocks operating the tuple storage, hash index, ordered index, page cache, restore operations, backup and local checkpoint services.

The most common signal DBLQH receives is the LQHKEYREQ, which is used for reading or writing a row using a primary key of a table (either a user table or a unique index table or a BLOB table). To serve this query it uses DBACC to lookup in the hash table and DBTUP. The latter has the responsibility of the tuple storage and performs the operation requested by the user.

Scan operations are the second most common operation. This means receiving a SCAN_FRAGREQ signal to either run a full table scan (implemented by either DBTUP) or a range scan (implemented by DBTUX).

DBLQH also controls the execution of local checkpoints together with the BACKUP block.

One can say that DBLQH is the service provider for the LDM blocks that makes use of internal services delivered by the other blocks.

DBACC#

DBACC stands for Database Access Manager. DBACC has two main functions. Firstly, it controls the local part of the distributed hash index that every primary key uses. Every table in RonDB, including unique index and BLOB tables, has a primary key. Even tables without a defined primary key use a hidden primary key.

Secondly, the hash table in DBACC also acts as the locking data structure to ensure that rows are locked properly by user operations.

DBTUP#

DBTUP means Database Tuple Manager. DBTUP is where the actual in-memory rows are stored using the fixed row parts and the variable-sized row parts. DBTUP can execute all sorts of operations requested by the higher levels of software in RonDB. Thus this is where the actual reads and writes of data happen. DBTUP also contains a fairly simple interpreter that can execute several simple statements that can be used to push down condition evaluation to RonDB. It also reads and writes the disk columns delivered by PGMAN.

DBTUX#

DBTUX means Database Tuple Index. It contains an implementation of a T-tree. The T-tree is an index specifically developed to support ordered indexes in main memory. Every update of a row involving an ordered index will perform changes of the T-tree and it is used heavily during restart when the T-trees are rebuilt using recovered rows.

The DBTUX software has been very stable for a long time. It does what it is supposed to do and there is very little reason to change it. Most new features developed are done at a higher service level in RonDB and DBTUP has lots of capabilities that interact with higher service levels in RonDB.

PGMAN#

PGMAN stands for Page Manager and it handles the page cache for disk pages used for disk columns. Each ldm thread has a part of the page cache. It contains a state-of-the-art page caching algorithm that is explained in comments for the get_page function in this block.

BACKUP#

The BACKUP block is responsible for writing local checkpoints and backup information to files. It does so by using a full table scan using the DBTUP block to ensure that all rows are analyzed to see if they should be written into this local checkpoint or not. The local checkpoints in RonDB use a mechanism called Partial Local Checkpoint (Partial LCP). This mechanism was introduced in MySQL Cluster 7.6 and provides the ability to write checkpoints that only write a part of the in-memory data. This is an extremely important feature to enable the use of RonDB with large amounts of memory. The previous implementation made every checkpoint a full checkpoint. With terabytes of data in memory, this is no longer a feasible approach.

The Partial LCP uses an adaptive algorithm to decide on the size of each checkpoint of each table fragment. The full table scan in DBTUP for checkpoints uses a special mechanism to quickly go through the database to find those rows that have changed since the last checkpoint. Pages that are not written to since the last checkpoint will be quickly discovered and skipped. The partial checkpoint means that a subset of the pages will be written fully and a subset will only write changed rows. The number of pages to write fully depends on how much write activity has occurred in the table fragment since the last checkpoint.

The Partial LCP has to handle dropped pages and new pages properly. The code in DbtupScan.cpp contains a long description of how this works together with proof that the implementation works.

RESTORE#

The RESTORE block is only used during the early phases of a restart when it is used to restore a local checkpoint. The local checkpoint contains saved in-memory rows that are written back into DBTUP at recovery using more or less the same code path as a normal insert or write operation would use in RonDB.

With Partial LCP we have to write several checkpoints back in time to ensure that we have restored the checkpoint properly. Thus this phase of the recovery takes a slightly longer time. However, the Partial LCP speeds up the checkpoints so much that the cost of executing the REDO log decreases substantially. Thus recovery with Partial LCP can go up 5x faster compared to using full checkpoints.

Query worker blocks#

Each query worker contains the blocks DBQLQH, DBQACC, DBQTUP, DBQTUX, QBACKUP and QRESTORE. These are the same blocks as in the LDM threads, but have a variable m_is_query_thread that is set to true.

Query workers only handle Read Committed queries. Any thread acting as a query worker can handle queries to any table fragment. However, when routing the requests to query workers we ensure that they only execute queries for the same Round Robin Group. Round Robin groups strive to share a common L3 cache to ensure that we make good use of CPU caches and TLB data.

TC blocks#

TC stands for Transaction Coordinator. These blocks handle a small but important part of the RonDB architecture. There are only two blocks here: DBTC and DBSPJ. All transactions pass through the TC block and for pushdown joins, the DBSPJ block is used as the join processor.

DBTC#

The most common signaling interface to DBTC is TCKEYREQ and TCINDXREQ. The TCKEYREQ is used for primary key operations towards all tables. DBTC works intimately with the DBDIH block, DBDIH contains all distribution information. DBTC queries this information for all queries to ensure that we work with an up-to-date version of the distribution information. TCINDXREQ handles unique key requests.

The distribution information in DBDIH is used from all TC blocks and requires special handling of concurrency. Since it is read extremely often and updates to it are rare, we used a special mechanism where we read an atomic variable before accessing the data, we read the same variable again after accessing the data. If the variable has changed since we started reading, we will retry and read the data once again. This requires special care in handling errors in the read processing.

DBTC uses the DBLQH service provider interface to implement transactional changes. DBTC ensures that transactional changes are executed in the correct order. This includes handling triggers fired when executing change operations. These triggers are fired to handle unique indexes, foreign keys and table reorganizations. The triggers are often fired in DBTUP and sent to DBTC for processing as part of the transaction that fired the change.

Scan operations and pushdown join operations are started through the SCAN_TABREQ signal. Scan operations on a single table are handled by DBTC and pushdown joins are handled by DBSPJ.

DBTC handles timeouts to ensure that we make progress even in the presence of deadlocks.

If a data node fails, a lot of ongoing transactions will lose their transaction state. To be able to complete those transactions (by either aborting or committing them) DBTC can build up the state again by asking DBLQH for information about ongoing transactions from the failed node. It is the first tc thread in the master data node that will take care of this. The master data node is the oldest one in the cluster.

DBSPJ#

DBSPJ stands for Database Select-Project-Join. SPJ is a popular phrase in discussing join queries in databases. The DBSPJ blocks implement the linked join operations that are used in RonDB to implement a form of parallel query. DBSPJ has no function for anything apart from join queries.

It implements the joins by sending SCAN_FRAGREQ for scans to DBLQH and LQHKEYREQ to DBLQH for primary key lookups. The results of the scans and key lookups are sent directly to the NDB API. These also get some keys that make it possible for the NDB API to put together the result rows. Some parts of the reads can also be sent back to DBSPJ when the result from one table is required as input to a later table in the join processing.

Main thread blocks#

There are two execution modules used for various things, this is the main and the rep execution module. Here we will first go through the blocks found in the main execution module. These blocks all only have one instance. There are also a few proxy blocks. Proxy blocks are a special kind of block that was introduced to handle multiple block instances. The proxy blocks implement some functionalities such that it is possible to send a signal to a block and have it executed on all block instances of e.g. the DBLQH block. This helped change the code from the single-threaded version to the multi-threaded version.

DBDICT#

DBDICT stands for Database Dictionary. It contains metadata about tables, columns, indexes, foreign keys and various internal metadata objects.

DBDICT implements a framework for implementing any type of schema transaction. A schema transaction always tracks its progress. If a failure happens we will either roll the schema transaction backwards or forwards. If it has passed a point where it can no longer be rolled back, we will ensure that it will be completed even in the presence of node crashes and even in the presence of a cluster crash. Schema transactions cannot contain any normal user transactions.

DBDICT implements this by executing a careful stepwise change of each schema transaction. Thus the schema transaction is divided into many smaller steps. Each of those steps is applied in all nodes in parallel. After completing the steps we record the information about the step we have completed in a file flushed to disk. This means that schema transactions in RonDB are not fast, but it makes them capable of handling very complex changes. However, introducing new types of schema changes still requires writing new functions and verifying that the new schema change variant works well.

DBDICT has an important role in the startup of a node where it decides which tables to restore. The node asks DBDIH to perform the actual recovery and will assist DBDIH in some of those recovery steps.

DBDIH#

DBDIH stands for Database Distribution Handler. This is one of the most important parts of RonDB. In DBDIH we maintain all the distribution information about all tables in the cluster. It contains algorithms to ensure that this information is synchronized with all other nodes in the cluster.

This distribution information is updated by schema changes, node starts and node failures. Thus for a specific table, the information is very stable. Days and even years can pass without it being updated. To handle this we have used a mechanism similar to the RCU mechanism used in the Linux kernel. The idea is that when reading one performs the following, before starting to read we check if the data is currently being changed (checked by reading a flag on the table). If it isn’t currently being changed we go on and read the information, but we record the flags we read before reading it. If it was currently being changed we will simply loop until the data isn’t being updated. After reading the data we will check if the data have changed since we read (checked by again reading the table flags). If the data didn’t change we’re done, if it changed, we restart the read once again.

This procedure is used to protect several variables regarding data distribution maintained by DBDIH. All the code to handle these protected regions is assembled in DBDIH to make it easier to follow this part of the code.

DBDIH runs the global checkpoint protocol, which generates recoverable checkpoints on a cluster level. DBDIH also runs the local checkpoint protocol. DBDIH gathers information during the local checkpoint process that is used in restarts. With the implementation of Partial LCP most of the handling of checkpoints is handled locally in the DBLQH block. However, DBDIH is still used to control the flow of checkpoints although some information it retrieves from DBLQH isn’t used anymore.

TRIX#

TRIX stands for Trigger Execution. It is involved in handling online index builds, copying data during table reorganizations and online foreign key builds. It is a block that assists DBDICT by performing several important operations to move data around to implement complex online schema changes.

QMGR#

QMGR stands for Cluster Manager with a little play on pronunciation. This block is responsible for the heartbeat protocol in RonDB. It is the lowest level of cluster management in RonDB. It is involved in early start phases to register a new node into the cluster and maintains the order used to decide which node is the master node (it is the oldest node registered). Nodes are registered one at a time and all nodes are involved in this registration, thus all nodes agree on the age of a node. Next, we decide that the oldest node is always chosen as the new master node whenever a master node fails.

NDBCNTR#

NDBCNTR stands for Network Database Controller. From the beginning, this block was the block responsible for the database start of the data node. This is still true to a great extent.

The original idea with the design was that QMGR was a generic block that handled being part of the cluster. The NDBCNTR block was specifically responsible for the database module. The idea in the early design was that there could be other modules as well. Nowadays QMGR and NDBCNTR assist each other in performing important parts of the cluster management and parts of the restart control. Mostly NDBCNTR is used for new functionality and QMGR takes care of the lower level functionality which among other things relates to setting up connections between nodes in the cluster.

CMVMI#

CMVMI stands for Cluster Manager Virtual Machine Interface. This block was originally a block that was used as a gateway between blocks implemented in PLEX and blocks implemented in C++. Nowadays and for 21 years all blocks are C++ blocks. It now implements a few support functions for other blocks.

DBUTIL#

DBUTIL stands for Database Utility. This block implements a signal-based API towards DBTC and DBLQH. It is a support block to other blocks like NDBCNTR, TRIX and so forth to execute database operations from the block code.

NDBFS#

NDBFS stands for Network Database File System. It implements a file system API using signals. It was developed to ensure that file system accesses could use the normal signaling protocols. Given that all the rest of the code in the RonDB data nodes is asynchronous and uses signals between blocks, it was important to move also the RonDB file system accesses into this structure. Not every OS has an asynchronous API towards their filesystem, we implemented the interface to the file system through many small threads that each can do blocking file system calls. But they interact with the main execution module whenever they are requested to perform a file operation and the threads send back information to the main execution modules when the file operation is completed.

The APIs implemented towards NDBFS support several different models. The various uses of a file system are rather different for the functions in RonDB. Some functions simply need to write a file of random size, others work with strict page sizes and so forth.

Proxy blocks#

The main execution module handles the proxy blocks for DBTC and DBSPJ.

Conclusion#

The main execution module handles a variety of things. For the most part, it is involved heavily in restarts and metadata operations. NDBFS operations and checkpoint operations in DBDIH are the most common operations it does during normal traffic load.

Rep blocks#

The blocks that execute in the rep execution module are used for a variety of purposes. The name rep is short for replication and stems from that the SUMA block is used to distribute events from the data node to the MySQL replication servers.

SUMA#

SUMA stands for Subscription Manager. Events about changes on rows are sent from the SUMA block to the NDB API nodes that have subscribed to changes on the changed tables. This is mainly used for RonDB Replication, but can also be used for event processing such as in HopsFS where it is used to update ElasticSearch to enable generic searches in HopsFS for files.

LGMAN#

LGMAN stands for Log Manager. There is only one LGMAN block in the data node. But this block can be accessed from any ldm execution module. In this case the Logfile_client class is used, when creating such an object a mutex is grabbed to ensure that only one thread at a time is accessing the UNDO log. These accesses from the ldm execution modules are only writing into the UNDO log buffer. The actual file writes are executed in the LGMAN block in the rep execution module.

TSMAN#

TSMAN stands for Tablespace Manager. The creation and opening of tablespace files happens in the TSMAN block. The actual writes to the tablespace files are however handled by the PGMAN blocks in the ldm threads and the extra PGMAN block described below.

Any writes to extent information is handled by TSMAN. These accesses are executed from the ldm execution modules. Before entering the TSMAN block these accesses have to create a Tablespace_client object. This object will grab a mutex to ensure that only one ldm execution module at a time reads and writes into the extent information.

DBINFO#

DBINFO is a block used to implement scans used to get information that is presented in the ndbinfo tables. There are many ndbinfo tables and they are used to present information about what goes on in RonDB. In a managed RonDB setup, this information is presented in various graphs using Grafana.

Proxy blocks#

The proxy blocks for DBLQH, DBACC, DBTUP, DBTUX, BACKUP, RESTORE and PGMAN are handled in the rep execution module.

PGMAN extra#

The PGMAN block has one instance in each ldm execution module. There is also an extra PGMAN instance. The main responsibility of this instance is to handle checkpointing the extent information in tablespaces. The tablespace data used by normal tables is handled by the PGMAN instances in the ldm execution module.

The extent information is a few pages in each tablespace data file that are locked into the page cache. The checkpointing of those pages is handled by this extra PGMAN block instance. This block instance also executes in the rep execution module.

THRMAN#

THRMAN stands for Thread Manager. This block has one instance in each thread. It handles a few things such as tracking CPU usage in the thread and anything that requires access to a specific thread in the data node.

TRPMAN#

TRPMAN stands for Transporter Manager. Transporters are used to implement communication between nodes in RonDB. There is a TRPMAN block in each recv thread. This block can be used to open and close communication between nodes at the request of other blocks.