Blocks in a Data Node#
In order to understand the signalling diagrams that we will present in a later chapter on common use cases, it is necessary to at least understand the various blocks in RonDB at a high level.
The code also documents some of the most important signalling flows in RonDB. As an example the block Ndbcntr and in the code file NdbcntrMain.cpp contains a long description of a how a restart works. In the block Qmgr and in the file QmgrMain.cpp we have a signalling diagram for how the registration of new nodes into the cluster at startup happens and how this is the base for heartbeat signals sent later. In the block Dbdih and in the file DbdihMain.cpp there is a long description of how ALTER TABLE REORGANIZE PARTITIONS works with a detailed signalling flow. Hopefully with the aid of this explanation of the internals it will be easier to understand those descriptions in the code for anyone interested in understanding on a detailed level what goes on inside RonDB to handle the complex online meta data changes.
In the description we will describe the execution modules inside the RonDB data nodes. An execution module must execute within one thread, but different execution modules can be colocated within one thread in a variety of manners. The configuration variable ThreadConfig provides a great deal of flexibility in configuring the placement of execution modules into threads.
LDM stands for Local Data Manager. It is a set of blocks that cooperate to deliver the local database service that is a very central part of RonDB. In ndbmtd these blocks all operate inside the same thread in the ldm threads. It is not possible to break those blocks apart and execute them in different threads. It is however possible to have several instances of the ldm threads. Each ldm thread handles one instance of each of the below blocks, in addition it can take care log parts of the REDO log. In RonDB we can have fewer REDO log parts than the number of LDM instances. In this case the REDO log parts will be handled by the first LDM instances, but all LDM instances will write to the REDO log controlled through a mutex on the REDO log part.
DBLQH stands for DataBase Local Query Handler. This block is where most interactions with the LDM blocks starts. It does handle the REDO log, but its main responsibility is to interact with the other blocks operating the tuple storage, hash index, ordered index, page cache, restore operations, backup and local checkpoint services.
The most common signal DBLQH receive is the LQHKEYREQ that is used for reading or writing a row using a primary key of a table (either a user table or a unique index table or a BLOB table). To serve this query it uses DBACC to lookup in the hash table and DBTUP that has the responsibility of the tuple storage and that performs the operation requested by the user.
Scan operations is the second most common operation and this is receiving a SCAN_FRAGREQ signal to order either a full table scan (implemented by either DBACC or DBTUP) or a range scan (implemented by DBTUX).
DBLQH also controls the execution of local checkpoints together with the BACKUP block.
One can say that DBLQH is the service provider for the LDM blocks that makes use of internal services delivered by the other blocks.
DBACC has two main functions. It controls the local part of the distributed hash index that every primary key uses, and every table in RonDB, including unique index tables and BLOB tables, have a primary key. Even tables without a primary key uses a hidden primary key.
The hash table in DBACC also acts as the locking data structure to ensure that rows are locked in the proper fashion by user operations.
DBACC stands for DataBase ACCess manager.
DBTUP means DataBase TUPle manager. DBTUP is where the actual in-memory rows are stored using the fixed row parts and the variable sized row parts. DBTUP has the ability to execute all sorts of operations requested by the higher levels of software in RonDB. Thus this is where the actual reads and writes of data happens. DBTUP also contains a fairly simple interpreter that can execute a number of simple statements that can be used to push down condition evaluation to RonDB. It also reads and writes the disk columns delivered by PGMAN.
DBTUX means DataBase TUple IndeX. It contains an implementation of a T-tree. The T-tree is an index specifically developed to support ordered indexes in main memory. Every update of a row involving an ordered index will perform changes of the T-tree and it is used heavily during restart when the T-trees are rebuilt using rows recovered.
The DBTUX software have been very stable for a long time. It does what it is supposed to do and there is very little reason to change it. Most new features developed are done at a higher service level in RonDB and DBTUP has lots of capabilities that interact with higher service levels in RonDB.
PGMAN stands for PaGe MANager and it handles the page cache for disk pages used for disk columns. Each ldm thread has a part of the page cache. It contains a state-of-the-art page caching algorithm that is explained in comments for the get_page function in this block.
The backup block is responsible to write local checkpoint and backup information to files. It does so by using a full table scan using the DBTUP block to ensure that all rows are analysed to see if they should be written into this local checkpoint or not. The local checkpoints in RonDB uses a mechanism called Partial Local Checkpoint (Partial LCP). This mechanism was introduced in MySQL Cluster 7.6 and provides the ability to write checkpoints that only write a part of the in-memory data. This is an extremely important feature to enable use of RonDB with large memories. The previous implementation made every checkpoint a full checkpoint. With terabytes of data in memory, this is no longer a feasible approach.
The Partial LCP uses an adaptive algorithm to decide on the size of each checkpoint of each table fragment. The full table scan in DBTUP for checkpoints uses some special mechanism to quickly go through the database to find those rows that have changed since last checkpoint. Pages that are not written to since last checkpoint will be quickly discovered and skipped. The partial checkpoint means that a subset of the pages will be written fully and a subset will only write changed rows. The number of pages to write fully depends on how much write activity have occurred in the table fragment since the last checkpoint.
The Partial LCP has to handle dropped pages and new pages in a proper manner. The code in DbtupScan.cpp contains a long description of how this works together with a proof of that the implementation actually works.
The RESTORE block is only used during the early phases of a restart when it is used to restore a local checkpoint. The local checkpoint contains saved in-memory rows that are written back into DBTUP at recovery using more or less the same code path as a normal insert or write operation would use in RonDB.
With Partial LCP we have to write several checkpoints back in time to ensure that we have restored the checkpoint properly. Thus this phase of the recovery takes slightly longer time, however the Partial LCP speeds up the checkpoints so much that the cost of executing the REDO log decreases substantially. Thus recovery with Partial LCP can go up 5x faster compared to using full checkpoints.
Query thread blocks#
Each query thread contains the blocks DBQLQH, DBQACC, DBQTUP, DBQTUX, QBACKUP and QRESTORE. These are using exactly the same code except that they have a variable m_is_query_thread that is set to true. A query thread doesn't have a PGMAN instance since the query threads currently don't handle queries using disk columns.
Query threads only handle Read Committed queries. Any query thread can handle queries to any table fragment. However when routing the requests to query threads we ensure that query threads only execute queries for the same Round Robin Group. Round Robin groups strive to share a common L3 cache to ensure that we make good use of CPU caches and TLB data.
TC stands for Transaction Coordinator. These blocks handle a small but important part of the RonDB architecture. There are only two blocks here DBTC and DBSPJ. All transactions pass through the TC block and for pushdown joins the DBSPJ block is used as the join processor.
The most common signalling interface to DBTC is TCKEYREQ and TCINDXREQ. The TCKEYREQ is used for primary key operations towards all tables. DBTC works intimately with the DBDIH block, DBDIH contains all distribution information. DBTC queries this information for all queries to ensure that we work with an up to date version of the distribution information. TCINDXREQ handles unique key requests.
The distribution information in DBDIH is used from all TC blocks, this information required special handling of concurrency. Since it is read extremely often and updates to it are rare, we used a special mechanism where we read an atomic variable before accessing the data, we read the same variable again after accessing the data, if the variable has changed since we started reading, we will retry and read the data once again. Obviously this requires special care in handling errors in the read processing.
DBTC uses the DBLQH service provider interface to implement transactional changes. DBTC ensures that transactional changes are executed in the correct order. This includes handling triggers fired when executing change operations. These triggers are fired to handle unique indexes, foreign keys and table reorganisations. The triggers are often fired in DBTUP and sent to DBTC for processing as part of the transaction that fired the change.
Scan operations and pushdown join operations are started through the SCAN_TABREQ signal. Scan operations on a single table is handled by DBTC and pushdown joins are handled by DBSPJ.
DBTC handles timeouts to ensure that we get progress even in the presence of deadlocks.
If a data node fails, a lot of ongoing transactions will lose their transaction state. To be able to complete those transactions (by either aborting or committing them) DBTC can build up the state again by asking DBLQH for information about ongoing transactions from the failed node. It is the first tc thread in the master data node that will take care of this.
DBSPJ stands for DataBase Select-Project-Join. SPJ is a popular phrase in discussing join queries in databases. The DBSPJ blocks implements the linked join operations that is used in RonDB to implement a form of parallel query. DBSPJ has no function for anything apart from join queries.
It implements the joins by sending SCAN_FRAGREQ for scans to DBLQH and LQHKEYREQ to DBLQH for primary key lookups. The results of the scans and key lookups are sent directly to the NDB API that also get some keys that makes it possible for the NDB API to put together the result rows. Some parts of the reads can also be sent back to DBSPJ when the result from one table is required as input to a later table in the join processing.
Main thread blocks#
There are two execution modules used for various things, this is the main and the rep execution module. Here we will first go through the blocks found in the main execution module. These blocks all have only one instance. There are also a few proxy blocks. Proxy blocks is a special kind of block that was introduced to handle multiple block instances. The proxy blocks implements some functionalities such that it is possible to send a signal to a block and have it executed on all block instances of e.g. the DBLQH block. This was helpful in changing the code from the single-threaded version to the multi-threaded version.
DBDICT stands for DataBase DICTionary. It contains meta data about tables, columns, indexes, foreign keys, various internal meta data objects. It implements a framework for implementing any type of schema transactions. A schema transaction tracks the progress of a schema transaction. If a failure happens we will either roll the schema transaction backwards or forwards. If it has passed a point where it can no longer be rolled back, we will ensure that it will be completed even in the presence of node crashes and even in the presence of a cluster crash. Schema transactions cannot contain any normal user transactions.
DBDICT implements this by executing a careful stepwise change of each schema transaction. Thus the schema transaction is divided into many smaller steps, each of those steps are applied in all nodes in parallel. After completing the steps we record the information about the step we have completed in a file flushed to disk. This means that schema transactions in RonDB are not so fast, but they can handle very complex changes. It is possible to use this framework for quite advanced schema changes. But introducing new types of schema changes still requires writing a few new functions and verifying that the new schema change variant is working well.
DBDICT has an important role in the startup of a node where it decides which tables to restore. It asks DBDIH to perform the actual recovery and will assist DBDIH in some of those recovery steps.
DBDIH stands for DataBase DIstribution Handler. This is one of the most important parts of RonDB. In DBDIH we maintain all the distribution information about all tables in the cluster. It contains algorithms to ensure that this information is synchronised with all other nodes in the cluster.
This distribution information is updated by schema changes and by node starts and by node failures. Thus for a specific table the information is very stable. Days and even years can pass without it being updated. To handle this we have used a mechanism similar to the RCU mechanism used in the Linux kernel. The idea is that when reading one performs the following, before starting to read we check if the data is currently being changed (checked by reading a flag on the table). If it isn't currently being changed we go on and read the information, but we record the flags we read before reading it. If it was currently being changed we will simply loop until the data isn't being updated. After reading the data we will check if the data have changed since we read (checked by again reading the table flags). If the data didn't change we're done, if it changed, we restart the read once again.
This procedure is used to protect a number of variables regarding data distribution maintained by DBDIH. All the code to handle these protected regions are assembled together in DBDIH to make it easier to follow this part of the code.
DBDIH runs the global checkpoint protocol, global checkpoints was described in an earlier chapter and represents recoverable checkpoints on a cluster level. DBDIH also runs the local checkpoint protocol. DBDIH gathers information during the local checkpoint process that is used in restarts. With the implementation of Partial LCP most of the handling of checkpoints is handled locally in the DBLQH block. However DBDIH is still used to control the flow of checkpoints although some information it retrieves from DBLQH isn't really used any more.
TRIX stands for TRIgger eXecution. It is involved in handling online index builds, copying data during table reorganisations and online foreign key builds. It is a block that assists DBDICT by performing a number of important operations to move data around to implement complex online schema changes.
QMGR stands for Cluster Manager with a little play on pronunciation. This block is responsible for the heartbeat protocol in RonDB. It is the lowest level of the cluster management in RonDB. It is involved in early start phases to register a new node into the cluster and maintains the order used to decide which node is the master node (it is the oldest node registered). Nodes are registered one at a time and all nodes are involved in this registration, thus all nodes agree on the age of a node. Next we decide that the oldest node is always choosen as the new master node whenever a master node fails.
NDBCNTR stands Network DataBase CoNTRoller. From the beginning this block was the block responsible for the database start of the data node. This is still true to a great extent.
The original idea with the design was that QMGR was a generic block that handled being part of the cluster. The NDBCNTR block was specifically responsible for the database module. The idea in the early design was that there could be other modules as well. Nowadays QMGR and NDBCNTR assist each other in performing important parts of the cluster management and parts of the restart control. Mostly NDBCNTR is used for new functionality and QMGR takes care of the lower level functionality which among other things relates to setting up connections between nodes in the cluster.
CMVMI stands Cluster Manager Virtual Machine Interface. This block was originally a block that was used as a gateway between blocks implemented in PLEX and blocks implemented in C++. Nowadays and since 21 years all blocks are C++ blocks. It now implements a few support functions for other blocks.
DBUTIL stands DataBase UTILity. This blocks implements a signal based API towards DBTC and DBLQH. It is a support block to other blocks like NDBCNTR, TRIX and so forth to execute database operations from the block code.
NDBFS stands for Network DataBase File System. It implements a file system API using signals. It was developed to ensure that file system accesses could use the normal signalling protocols. Given that all the rest of the code in the RonDB data nodes is asynchronous and uses signals between blocks, it was important to move also the RonDB file system accesses into this structure. Not every OS have an asynchronous API towards their filesystem, we implemented the interface to the file system through many small threads that each can do blocking file system calls. But they interact with the main execution module whenever they are requested to perform a file operation and the threads send back information to the main execution modules when the file operation is completed.
The APIs implemented towards NDBFS supports a number of different models. The various uses of a file system is rather different for the functions in RonDB. Some functions simply need to write a file of random size, others work with strict page sizes and so forth.
The main execution module handles the proxy blocks for DBTC and DBSPJ.
The main execution module handles a variety of things. For the most part it is involved heavily in restarts and in meta data operations. NDBFS operations and checkpoint operations in DBDIH is the most common operation it does during normal traffic load.
The blocks that execute in the rep execution module are used for a variety of purposes. The name rep is short for replication and stems from that the SUMA block is used to distribute events from the data node to the MySQL replication servers.
SUMA stands for SUbscribption MAnager. Events about changes on rows are sent from the SUMA block to the NDB API nodes that have subscribed to changes on the changed tables. This is mainly used for RonDB Replication, but can also be used for event processing such as in HopsFS where it is used to update ElasticSearch to enable generic searches in HopsFS for files.
LGMAN stands for LoG MANager. There is only one LGMAN block in the data node. But this block can be accessed from any ldm execution module. In this case the Logfile_client class is used, when creating such an object a mutex is grabbed to ensure that only one thread at a time is accessing the UNDO log. These accesses from the ldm execution modules are only writing into the UNDO log buffer. The actual file writes are executed in the LGMAN block in the rep execution module.
TSMAN stands for TableSpace MANager. The creation and opening of tablespace files happens in the TSMAN block. The actual writes to the tablespace files are however handled by the PGMAN blocks in the ldm threads and the extra PGMAN block described below.
Any writes to extent information is handled by TSMAN. These accesses are executed from the ldm execution modules. Before entering the TSMAN block these accesses have to create a Tablespace_client object. This object will grab a mutex to ensure that only one ldm execution module at a time reads and writes into the extent information.
DBINFO is a block used to implement scans used to get information that is presented in the ndbinfo tables. There are lots of ndbinfo tables, those are used to present information about what goes on in RonDB. In a managed RonDB setup this information is presented in lots of graphs using Grafana.
The proxy blocks for DBLQH, DBACC, DBTUP, DBTUX, BACKUP, RESTORE and PGMAN are handled in the rep execution module.
The PGMAN block has one instance in each ldm execution module. There is also an extra PGMAN instance. The main responsibility of this instance is to handle checkpointing the extent information in tablespaces. The tablespace data used by normal tables is handled by the PGMAN instances in the ldm execution module.
The extent information is a few pages in each tablespace data file that are locked into the page cache. The checkpointing of those pages are handled by this extra PGMAN block instance. This block instance also executes in the rep execution module.
THRMAN stands for THRead MANager. This block has one instance in each thread. It handles a few things such as tracking CPU usage in the thread and anything that requires access to a specific thread in the data node.
TRPMAN stands for TRansPorter MANager. Transporters are used to implement communication between nodes in RonDB. There is a TRPMAN block in each recv thread. This block can be used to open and close communication between nodes at request of other blocks.