Signal flows in RonDB#
We will go through some of the most important signalling flows to give an idea about what happens when a query is executed against RonDB.
For queries against the database there are three main flows that we use. The key lookup flow used to read and write data through primary keys or unique keys. The scan flow used to perform full table scans, range scans and pruned versions of these. We have pushdown join that is essentially linked key lookups and scan operations. Pushdown joins builds entirely on top of the key lookup and scan flows.
In addition there are a number of protocols implemented in RonDB to handle all sorts of node failures, node restarts, schema changes and so forth. We will go through a few of those and in some cases simply point out where one can find descriptions of these in the source code.
To maintain various secondary data structures we use triggers and these are integrated into the key lookup protocol and we will describe how this happens.
RonDB was designed with efficient key lookup transactions in mind. Almost all applications that was analysed before implementing RonDB had a major portion of key lookups. Most of them had a high percentage of writes as well.
One of the problems with DBMSs in the 90s was that the overhead of transactions was very high. This is still to a great extent true. For example in other MySQL clustering solutions there are limitations on how many transactions that can commit in parallel.
With RonDB each row operation is committed in exactly the same fashion if it is a single row transaction or if it is a transaction with 1000 rows being updated. We have worked on streamlining this implementation and there is absolutely no limitation on how many transactions that can commit in parallel. The only limit is the CPU resources, memory resources, networking resources and disk resources.
Basic write transaction#
We send one signal from the NDB API for each row to be read, updated, deleted, written or inserted. The same signal is used for all types of key operations.
The figure below shows the signal flow for a single row transaction.
The signal used is called TCKEYREQ and this signal is sent to any tc thread based on some hinting techniques described in the chapter on the NDB API.
DBTC uses DBDIH to discover the placement of all the live replicas in the cluster of the row. Next it sends LQHKEYREQ to DBLQH in the node where the primary replica belongs. The signal is sent directly to the proper ldm thread owning the partition where this primary row resides.
The idea with sending it to the primary replica first is to avoid deadlocks which would easily happen if the writes are sent to replicas in any order.
There is no special signal to start a transaction. Start of a transaction is indicated by one bit in the TCKEYREQ signal.
There is a commit signal that can be used called TC_COMMITREQ. In the signal diagram above it is used. It is not necessary to use this signal. One can also send a TCKEYREQ with the commit bit set. In this case the transaction is committed immediately after completing the last operation and no more TCKEYREQs will be accepted in this transaction after receiving this bit in a TCKEYREQ signal.
The primary sends a signal to the backup replica, the first LQHKEYREQ contained the list of nodes to update. It sends a LQHKEYREQ, the code executed in the backup replica is almost exactly the same as executed in the primary replica, both replicas will update the row in the prepare phase. The new row will be stored in a copy row, the old row is kept until the commit point. This means that readers that want to see the latest committed row state can always proceed and perform the read.
When all replicas have been updated the last replica sends LQHKEYCONF back to the DBTC that handles the transaction. In this case the TCKEYREQ had both a start bit and a commit bit set, thus the transaction will immediately be committed. We will get the GCI (global checkpoint identifier) of the transaction from DBDIH and send COMMIT to the backup replica. Backup replicas will simply record the commit state and send COMMIT to the next replica in the chain until it reaches the primary replica.
At the primary replica it will release the lock and commit the row change by installing the copy row into the row state and by writing the commit message into the REDO log buffer. After this any reader will see the new row state. Next the primary replica will send COMMITTED to DBTC.
In RonDB all tables default to use the Read Backup feature added in MySQL Cluster 7.5. For those tables the primary replica will not release the locks in the Commit phase, instead it will wait with releasing locks until the Complete phase.
When receiving the COMMITTED signal in DBTC the behaviour depends on if the table is a READ BACKUP table or not. If it isn't DBTC will send an ack to the API using the TCKEYCONF signal. In parallel it will send the COMPLETE signal to the last backup replica. In the complete phase the backup replicas will release the locks and commit the row changes. It will also release operation records and any other memory attached to the operation.
The primary replica will send COMPLETED back to DBTC. At this point the READ BACKUP tables will send TCKEYCONF. After executing the COMPLETED in DBTC the transaction is committed and we have released all records belonging to the transaction.
A final part is happening in the background. The NDB API when receiving TCKEYCONF indicating transaction commit will send TC_COMMIT_ACK to the DBTC. This will generate a number of REMOVE_MARKER_ORD signals. These signals clear transaction state that we needed to keep around when for some reason the NDB API didn't receive the confirmation of the commit decision.
A final note is that most of these signals are sent as packed signals. This is true for LQHKEYCONF, COMMIT, COMMITTED, COMPLETE, COMPLETED and REMOVE_MARKER_ORD. This saves 12 bytes of overhead for each of those signals in the transporter part. Cost of sending data over TCP/IP sockets is a large part of the cost of executing distributed transactions.
One more important lesson that I learned through a master thesis project more than 20 years ago is the cost of flexible protocols. The original idea for the NDB protocol was to use a standard telecom protocol using BER encoding. This turned out to be much too expensive. Thus we invented a streamlined protocol where all parts of the protocol are 32 bits in size. Thus we don't have to process one byte at a time, rather we can process 4 bytes at a time which is using the most efficient CPU instructions.
Most signals are fairly static, thus the code that reads the signal data knows exactly where to find each item. This makes the code very efficient.
A special note on key writes is that they can also read data while updating it. For example it is possible to send a delete operation that reads the record while deleting it. These operations can read either the values before the update or after and it is even possible to send back calculated values.
Basic read transaction#
The reads through a primary key uses TCKEYREQ as well. In this case the signal goes to DBTC and from there it is sent to one of the DBLQH blocks. For default tables it will always be sent to the DBLQH of the primary replica.
For READ BACKUP tables and fully replicated tables it depends on the type of the read. READ COMMITTED can be sent to any node with a replica. For reads using read locks or exclusive locks it is sent to the primary replica.
DBLQH sends the data immediately from the ldm thread back to the API using the signal TRANSID_AI. For READ COMMITTED there is no other signal sent to the API. I've highlighted the signals that are sent in this specific case in the figure above.
For all other reads there is a signal going back to DBTC called LQHKEYCONF to ensure that the read can be integrated in writing transactions, this signal will lead to a TCKEYCONF signal sent back to the API. TCKEYCONF can contain responses to multiple TCKEYREQ signals in one message.
Updates in fully replicated table#
Now let's look at how updates happens in a fully replicated table. The protocol is the same as for a normal write transaction. The difference lies in that a trigger is fired on the primary replica when updating the row. This trigger sends a signal called FIRE_TRIG_REQ to DBTC. The trigger contains the information to change in the rows. This information is packed into a number of LQHKEYREQ signals, one LQHKEYREQ signal per extra node group. The first LQHKEYREQ for fully replicated tables is always sent to the first node group. Thus in a single node group cluster the only difference for fully replicated tables is that the trigger signal is sent to DBTC from DBTUP.
The updates in the base fragment and the replicated fragments will then proceed in parallel and commit of each such replicated fragment will happen as if they were independent writes. But since they are part of a transaction the changes will be atomic, thus either all of them happens or none of them.
Using this trigger mechanism integrates the changes to the other fragments naturally into the transaction.
Maintaining a unique index#
Unique indexes are maintained using exactly the same technique as updates are performed in a fully replicated table. In this case there will be only one LQHKEYREQ to update the fragment where the unique key resides. A special case is a unique index in a fully replicated table, in this case the unique index trigger fires a transactions similar to the update in a fully replicated table.
Building new partitions#
When executing the SQL statement ALTER TABLE t1 REORGANIZE PARTITIONS we will reorganise the table to use the new nodes in the cluster.
While building the new partitions we will have triggers that ensure that operations on the table partitions that are reorganised will be sent to the new table fragments. These fragments are normally present in new nodes in the cluster.
This follows the same type of protocol as for fully replicated tables. In this case only rows that are moved to the new table fragments will execute those triggers.
Asynchronous operations in the NDB API is not the standard method to use the NDB API. But it is a very efficient method that at times have been shown to outperform other methods by as much as 10x.
The reason for a higher efficiency can easily be seen in the below figure showing the protocol at a higher level.
In this mode of operation we receive multiple operations on the table in a small number of TCP/IP packets. Next those processed signals send a batch of LQHKEYREQ signals. In principle the protocol is the same as before, but now everything is heavily parallelised. By sending 200 updates in a batch instead of 1-2 updates at a time, we can gain up to 10x higher throughput.
Asynchronous operations can only be key operations at the moment.
Unique index read operations#
Unique indexes can be used for both updates and reads. As mentioned before they constitue a challenge since they can cause deadlocks since reads of the index and updates of the base table can interact with primary key updates of the base table in sequences that can cause a deadlock.
The signalling is easy, it uses a normal read of the unique index table to get the primary key. This key is read back to DBTC that will use the key to start up a normal TCKEYREQ to read/write the row.
Table scan operations#
The second class of query operations that RonDB handles are scans of a table. Scans can use a full table scan if no indexes are present. Scans can use ordered indexes to get data in index order. There are three types of full table scans, one uses the hash index in DBACC, a second one scans in rowid order, a third one scans in disk order in DBTUP. Currently the scan in disk order isn't enabled.
Basic scan operation#
All scan operations use the same base protocol. It starts by sending SCAN_TABREQ from the API to the choosen DBTC. DBTC discovers where the partitions of the table reside. Next it sends SCAN_FRAGREQ in parallel to all table partitions. An exception is for pruned scans that only scan one partition. It is possible to limit the amount of parallelism using a parameter in the SCAN_TABREQ signal.
SCAN_FRAGREQs are sent in parallel to a set of DBLQHs that house the partitions. DBLQH has a local scan protocol to gather the data using DBACC, DBTUP, and DBTUX blocks. This protocol is local to one thread in the data node.
Each signal execution in DBLQH will gather up to five rows. This takes roughly from 2-3 microseconds to 20-30 microseconds or more. Thus even if a scan goes through 1 million rows, it will not cause real-time issues for other scans or key lookup operations. Such a scan will be divided up into hundreds of thousands of signal executions automatically. Thus millions of operations can execute on the same data in parallel with the scan operation even in this case.
DBTC serialises communication with the API. The communication between DBTC and the NDB API has one talker at a time. When one of the DBLQH blocks sends back SCAN_FRAGCONF to DBTC, two things can happen. If the NDB API is still waiting for input from DBTC one sends a SCAN_TABCONF signal back to the API immediately. If we have sent SCAN_TABCONF and are waiting for the SCAN_NEXTREQ signal to proceed we will queue the scan results in DBTC. When SCAN_NEXTREQ arrives we will immediately send a new SCAN_TABCONF to the API if any results were queued up. At the same time we will send SCAN_NEXTREQ to the DBLQH blocks that already reported ack to the API.
SCAN_TABCONF will normally report back rows read from multiple DBLQHs at a time. How many is completely dependent on current system load.
SCAN_TABREQ carries two signal parts. One part is the interpreted program to execute on each row scanned. The second is an index bound (only used by range scans).
The interpreted program carries out filtering and will report back whether the row read was sent back to the API or if it was skipped. The actual row data is sent directly from the ldm thread to the API using the TRANSID_AI signal.
The API can always send SCAN_NEXTREQ with a close flag if it decides that it has seen enough. This will start a close protocol of all the outstanding fragment scans.
Similarly DBTC can close a scan in DBLQH by sending a SCAN_NEXTREQ signal with the close flag set.
Range scans are by far the most common scan operations. It carries an index bound that can contain more than one range. All rows that lies within those index bounds will be scanned.
In executing pushdown joins it is very common that DBSPJ creates scans with multiple ranges to execute for one scan.
Pruned Scan operations#
Scans have two options, either they can scan all partitions of a table, or they can scan just one partition of the table. The scans that only scan a single partition are called pruned scans. The reason that they only need to scan one partition is that they have knowledge of the partition key.
This is similar to the design in sharded databases. In a sharded database it is necessary to know the sharding key to be able to query the data. Otherwise all shards must be queried.
This is exactly how RonDB works as well. A shard in RonDB is here a table partition and the sharding key is the partitioning key.
Designing a scalable solution with many data nodes and many ldm threads requires use of a carefully selected partition key on the tables.
Full Table scans#
Full table scans are necessary when no keys are present. Even in this case RonDB can scan fast. Since scans are automatically parallelised we can easily scan many millions of rows per second if the query uses a filter that ignores most rows. RonDB is capable of shipping many millions of rows per second to the NDB API as well for further analysis.
Pushdown joins are implemented on top of the key lookup protocol and scan protocol. They use a linked join algorithm where keys for the tables down in the join are sent to the DBSPJ block for use with later scans and key lookups. Columns that are read are immediately sent to the NDB API for storage in memory there waiting for the rest of the queried parts that are fetched further down in the join processing.
Foreign keys have two parts. The first part are the consistency checks that the foreign key defines. The checks are executed in a special prepare-to-commit phase executed when starting to commit. If this phase is successful the transaction will be committed, otherwise it will be aborted.
The second part is the cascading actions. These are executed exactly as the other triggering operations. One special thing to consider about foreign keys is however that triggers can fire recursively. One small change is capable of triggering very massive changes. This is not desirable, it is important to consider this when defining your foreign key relations.
In the block DBDIH in the file DbdihMain.cpp, there is a section called Transaction Handling module. This describes exactly how the global data in DBDIH used for query processing is maintained. In particular it describes how table reorganisations are executed in RonDB. It contains a description of the signal flows for this algorithm.
Meta data operations#
Meta data operations starts with a phase where the meta data transaction is defined by using signals such as CREATE_TABLE_REQ, DROP_TABLE_REQ, ALTER_TABLE_REQ, CREATE_FILE_REQ and so forth. These signals are spread to all data nodes and are processed by the DBDICT block.
After defining the meta data transaction, there are a great many steps for each of those meta data parts where there is a number of different phases to step forward, there is also a set of steps defined to step backwards when the schema change didn't work.
These steps are implemented by a signal SCHEMA_TRANS_IMPL_REQ and its response SCHEMA_TRANS_IMPL_CONF. All the nodes steps forward in lockstep until all phases have been completed. If something goes wrong in a specific phase, there is a well defined set of phases defined to abort the schema transaction. Obviosuly aborting a meta data operation is a lot more complex than a normal transaction.
Cluster Manager operations#
The node registration protocol and the heartbeat protocol following it are described in some detail in the Qmgr block in the QmgrMain.cpp file in the RonDB source code. It contains signal flows for this algorithm.
Local checkpoint protocol#
The local checkpoint protocol is actually a number of protocols. The first step is a protocol using signals TC_GETOPSIZEREQ/CONF, TC_CLOPSIZEREQ/CONF. These are used to decide when to start a new LCP based on the amount of changes in the data nodes.
We have to acquire a schema lock also for a short time to decide on which nodes to integrate into the LCP. Only the tables defined at the start of the LCP will be checkpointed. New tables created after starting a local checkpoint will have to wait until the next local checkpoint.
One important part of the LCP protocol added in 7.4 is the pause LCP protocol. This ensures that we can pause the LCP execution when a node is starting up to copy the meta data in the cluster. There is a long description of this protocol and the LCP protocol itself in the block DBDIH, in the source code file DbdihMain.cpp. The aim of this protocol is to ensure that DBLQH stops sending LCP_FRAG_REP signals that will change the state of distribution information in DBDIH. This protocol is only required during restart of a node.
The above figure shows the execution phase protocol of a local checkpoint. It starts by sending START_LCP_REQ to all nodes. Before sending this signal we acquired a mutex on the distribution information in DBDIH. The aim of this signal is to set the flags on what table fragments that will perform this local checkpoint. Since we have locked the distribution information all nodes will calculate this locally. After completing this preparatory step of a local checkpoint each node will respond to the DBDIH block in the master node with the START_LCP_CONF signal.
After receiving this signal from all nodes we have finished the preparation of the local checkpoint. We can now release the mutex.
The next step is to start sending the signal LCP_FRAG_ORD to DBLQH. Each such signal asks DBLQH to checkpoint a specific table fragment. Many such signals can be outstanding at a time, they are queued up if DBLQH isn't ready to execute them. DBDIH will send lots of those signals to give DBLQH the chance to control the speed of executing the local checkpoint.
Whenever DBLQH finishes execution of a local checkpoint for a table fragment it will send the signal LCP_FRAG_REP to report about the completed checkpoint. This signal is sent to all DBDIHs (one per node).
When the master DBDIH have completed sending all LCP_FRAG_ORD to all nodes, it will send a final LCP_FRAG_ORD with the last fragment flag set to true. This indicates to all DBLQHs that no more checkpoint requests will be sent. Thus when DBLQH finishes executing the last checkpoint in the queue and this signal have been received, it knows that its part of the local checkpoint execution is completed.
At completion of local checkpoint execution in a DBLQH it will send the signal LCP_COMPLETE_REP to all DBDIH blocks. This signal will first go to the DBLQH proxy block, only when all local DBLQH instances have sent this signal will the proxy block send this to the local DBDIH block, that in turn will broadcast it to all DBDIH blocks. This signal will be marked by DBLQH completing its part of the local checkpoint execution.
When a DBDIH block have heard this signal from all nodes and it has completed saving the changes to the distribution information from local checkpoint executions, it will send LCP_COMPLETE_REP to the master DBDIH block. This signal will be marked that DBDIH has completed its part of the local checkpoint execution.
When the master have received this signal from all DBDIH blocks, it will send LCP_COMPLETE_REP to all DBDIH blocks with a 0 indicating that now the entire local checkpoint execution is completed in all nodes.