Signal flows in RonDB#
We will go through some of the most important signaling flows to give an idea about what happens when a query is executed against RonDB.
For queries against the database, there are three main flows that we use. The key lookup flow is used to read and write data through primary keys or unique keys. The scan flow is used to perform full table scans, range scans and pruned versions of these. We have a pushdown join that is essentially linked key lookups and scan operations. Pushdown joins are built entirely on top of the key lookup and scan flows.
In addition, there are several protocols implemented in RonDB to handle all sorts of node failures, node restarts, schema changes and so forth. We will go through a few of those and in some cases simply point out where one can find descriptions of these in the source code.
To maintain various secondary data structures we use triggers, which are integrated into the key lookup protocol. We will describe how this happens later on.
Key operations#
RonDB was designed with efficient key lookup transactions in mind. Almost all applications that were analyzed before implementing RonDB had a major portion of key lookups. Most of them had a high percentage of writes as well.
One of the problems with DBMSs in the 90s was that the overhead of transactions was very high. This is still to a great extent true. For example, in other MySQL clustering solutions, there are limitations on how many transactions can commit in parallel.
With RonDB, each row operation is committed in the same fashion whether it is a single-row transaction or if it is a transaction updating 1000 rows. We have worked on streamlining this implementation and there is no limitation on how many transactions can commit in parallel. The only limit is the CPU resources, memory resources, networking resources and disk resources.
Basic write transaction#
We send one signal from the NDB API for each row to be read, updated, deleted, written or inserted. The same signal is used for all types of key operations.
The figure below shows the signal flow for a single-row transaction.
The signal used is called TCKEYREQ
and this signal is sent to any tc
thread based on some hinting techniques described in the chapter on the
NDB API.
DBTC uses DBDIH to discover the placement of all the live replicas in
the cluster of the row. Next, it sends LQHKEYREQ
to DBLQH in the node
where the primary replica belongs. The signal is sent directly to the
proper ldm thread owning the partition where this primary row resides.
The idea of sending it to the primary replica first is to avoid deadlocks which would easily happen if the writes are sent to replicas in any order.
There is no special signal to start a transaction. The start of a
transaction is indicated by one bit in the TCKEYREQ
signal.
There is a commit signal that can be used called TC_COMMITREQ
. In the
signal diagram above it is used. It is not necessary to use this signal.
One can also send a TCKEYREQ
with the commit bit set. In this case,
the transaction is committed immediately after completing the last
operation and no more TCKEYREQ
s will be accepted in this transaction
after receiving this bit in a TCKEYREQ
signal.
The first LQHKEYREQ
contains the list of nodes to update. The primary
uses this to send a signal to the backup replica. It sends a LQHKEYREQ
and the code executed in the backup replica is almost identical to that
executed in the primary replica. Both replicas will update the row in
the prepare phase. The new row will be stored in a copy row, the old row
is kept until the commit point. This means that readers who want to see
the latest committed row state can always proceed and perform the read.
When all replicas have been updated the last replica sends LQHKEYCONF
back to the DBTC that handles the transaction. In this case, the
TCKEYREQ
had both a start bit and a commit bit set, thus the
transaction will immediately be committed. We will get the GCI (global
checkpoint identifier) of the transaction from DBDIH and send COMMIT
to the backup replica. Backup replicas will simply record the commit
state and send COMMIT
to the next replica in the chain until it
reaches the primary replica.
At the primary replica, it will release the lock and commit the row
change by installing the copied row into the row state and writing the
commit message into the REDO log buffer. After this, any reader will see
the new row state. Next, the primary replica will send COMMITTED
to
DBTC.
In RonDB, all tables default to use the Read Backup feature added in MySQL Cluster 7.5. For those tables, the primary replica will not release the locks in the Commit phase. Instead, it will wait to release locks until the Complete phase.
When receiving the COMMITTED
signal in DBTC the behavior depends on
whether the table is a read backup table or not. If it isn’t, DBTC
will send an ack to the API using the TCKEYCONF
signal. In parallel it
will send the COMPLETE
signal to the last backup replica. In the
complete phase, the backup replicas will release the locks and commit
the row changes. It will also release operation records and any other
memory attached to the operation.
The primary replica will send COMPLETD
back to DBTC. At this point,
the read backup tables will send TCKEYCONF
. After executing the
COMPLETD
in DBTC the transaction is committed and we have released all
records belonging to the transaction.
A final part is happening in the background. When receiving TCKEYCONF
,
indicating transaction commit, the NDB API will send TC_COMMIT_ACK
to
the DBTC. This will generate several REMOVE_MARKER_ORD
signals. These
signals clear transaction state that we needed to keep around when for
some reason the NDB API didn’t receive the confirmation of the commit
decision.
Most of these signals are sent as packed signals. This is true for
LQHKEYCONF
, COMMIT
, COMMITTED
, COMPLETE
, COMPLETD
and
REMOVE_MARKER_ORD
. This saves 12 bytes of overhead for each of those
signals in the transporter part. The cost of sending data over TCP/IP
sockets is a large part of the cost of executing distributed
transactions.
Most signals are fairly static, thus the code that reads the signal data knows exactly where to find each item. This makes the code very efficient. The original idea for the NDB protocol was to use a standard telecom protocol using BER encoding. The protocol’s flexibility made it too expensive. Thus we invented a streamlined protocol where all parts of the protocol are 32 bits in size. Now we can process four bytes instead of one byte at a time. This uses the CPU instructions most efficiently.
A special note on key writes is that they can also read data while updating it. For example, it is possible to send a delete operation that reads the record while deleting it. These operations can read either the values before the update or after and it is even possible to send back calculated values.
Basic read transaction#
Reads using a primary key also utilize the TCKEYREQ
signal. In this
case, the signal goes to DBTC and from there, it is sent to one of the
DBLQH blocks. For default tables, it will always be sent to the DBLQH of
the primary replica. For read backup and fully replicated tables, it
depends on the type of the read. Read committed can be sent to any
node with a replica. For reads using shared or exclusive locks, it is
sent to the primary replica.
DBLQH sends the data immediately from the ldm thread back to the API
using the signal TRANSID_AI
. For read committed there is no other
signal sent to the API.
For reads with shared or exclusive locks, the LQHKEYREQ
signal will
cause the row to be locked. In this case, there is also a signal going
back to DBTC called LQHKEYCONF
to ensure that the read can be
integrated into write transactions. This signal will lead to a
TCKEYCONF
signal sent back to the API. TCKEYCONF
is a reference to
one or more TCKEYREQ
signals but does not contain any row data. One
transaction can have multiple TCKEYREQ
s, whereby the last one will
have a flag set, indicating that it is the last. TCKEYCONF
s can be
sent back to the API in the middle of a transaction (before other reads
have started).
Updates in fully replicated table#
Fully replicated tables are tables that are replicated across all node
groups. The protocol here is the same as for a normal write transaction.
The difference lies in that a trigger is fired from the primary replica
when updating the row. This trigger sends a signal called
FIRE_TRIG_REQ
to DBTC.
The trigger contains several LQHKEYREQ
signals - one LQHKEYREQ
signal per extra node group. The initial LQHKEYREQ
is always sent to
the first node group to avoid deadlocks. If the initial LQHKEYREQ
contains an executable program (e.g. increment), the first primary DBLQH
can execute this on behalf of the other primary DBLQHs. Thereby, the
trigger will only contain the information to change in the rows.
The updates in the base fragment and the replicated fragments will then proceed in parallel and commits of each such replicated fragments will happen as if they were independent writes. But since they are part of a transaction, the changes will be atomic. Either all or none of them happen. This is ensured by using a single transaction coordinator for all fragments.
In a single node group cluster, the only difference for fully replicated
tables is that the FIRE_TRIG_REQ
signal is sent to DBTC.
Note: Technically, the FIRE_TRIG_REQ
signal is sent from DBTUP to
DBTC. Both the DBTUP and the DBLQH blocks are part of ldm thread, they
work together closely.
Unique index read operations#
Unique indexes are hidden tables in RonDB’s "backend" that take care
of the UNIQUE
constraint in the schema of normal tables.
The unique index tables use the unique index as a primary key and then
store the primary key of the main table as a column.
In RonDB, each table is partitioned into its own fragments. The fragment that a row is placed in is decided by the hash of its partition key. In the case of the unique index table, the primary key is also the partition key. Since however each table is partitioned separately, the row in the unique index table can be on a different node than the row in the main table. Partitioning is further explained here.
Unique indexes can both be read and updated. Reading from the main table using a unique index requires a read from the unique index table to get the primary key of the main table. The signaling of this is shown in the figure below.
The DBTC sends a TCKEYREQ
signal to itself to trigger a read operation
using a shared lock that returns the primary key of the main table. This
key is then used to start up a normal TCKEYREQ
to read/write the row.
The reason why the read of the unique index table uses a shared lock and why this can cause deadlocks is described in the section on locking unique indexes.
Updating a unique index#
To update the unique index of a row, we require a lock on the row in the
main table. The update towards the main row, using the LQHKEYREQ
signal, will then trigger the FIRE_TRIG_REQ
signal, similar to the
fully replicated table case. In this case, the trigger’s FIRE_TRIG_REQ
signal will however only contain one LQHKEYREQ
to update the fragment
where the unique key resides.
A special case is a unique index in a fully replicated table. Here, the
trigger will contain LQHKEYREQ
s for each extra node group, like in an
update in a fully replicated table.
Building new partitions#
When executing the SQL statement ALTER TABLE t1 REORGANIZE PARTITIONS
we will reorganize the table to use the new nodes in the cluster.
While building the new partitions we will have triggers that ensure that operations on the table partitions that are reorganized will be sent to the new table fragments. These fragments are normally present in new nodes in the cluster.
This follows the same type of protocol as for fully replicated tables. In this case, only rows that are moved to the new table fragments will execute those triggers.
Asynchronous operations#
Asynchronous operations in the NDB API are not the standard method to use. But it is a very efficient method that at times has shown to outperform other methods by as much as 10x.
In contrast to the standard NDB API, the asynchronous NDB API lets
the client send multiple transactions in a single function
call. The results of these transactions can then be polled in a separate
function pollNdb()
. The advantage of this, as opposed to
executing one blocking transaction per thread, is that:
The transactions can be bundled into fewer TCP/IP packets
We avoid the overhead of creating and destroying threads
The figure below shows the signal flow for an asynchronous operation:
In principle, the protocol is the same as before, but now everything is heavily parallelized. By sending 200 update transactions in a batch instead of 1-2 updates at a time, we can gain up to 10x higher throughput. Note that also spinning up 200 threads to execute 200 transactions in parallel would not be a good alternative due to the threading overhead.
Currently, only key operations are supported asynchronously.
Table scan operations#
The most basic scan operation is a full table scan. This will be run on row-id order or disk order in DBTUP. The latter is however currently disabled.
A range scan is a scan operation that places a range on an ordered index. A range filter on a column that is not an ordered index is not a range scan and will instead cause a full table scan. Range scans are by far the most common scan operations.
A partition pruned scan is a scan operation where an equals filter is on the partition key. This will cause the scan to only scan one partition. Any other scan will scan all partitions of the table. Therefore, this scan is the most efficient. This also makes it very important to choose the partition key carefully.
The second class of query operations that RonDB handles are scans of a table.
Basic scan operation#
All scan operations use the same base protocol. It starts by sending
SCAN_TABREQ
from the API to the chosen DBTC. DBTC discovers where the
partitions of the table reside. Next, it sends SCAN_FRAGREQ
in
parallel to all table partitions. An exception is for pruned scans that
only scan one partition. It is possible to limit the amount of
parallelism using a parameter in the SCAN_TABREQ
signal.
SCAN_FRAGREQ
s are sent in parallel to a set of DBLQHs that house the
partitions. DBLQH has a local scan protocol to gather the data using
DBACC, DBTUP, and DBTUX blocks. This protocol is local to one thread in
the data node.
Each signal execution of SCAN_FRAGREQ
in DBLQH will scan up to 16 rows
at a time and then return these rows via the TRANSID_AI
signal. It
then sends a signal to itself to scan the next 16 rows in a new
execution. Execution times range from about 2-3 microseconds to over
20-30 microseconds. Thus, even if a scan goes through 1 million rows, it
will not cause real-time issues for other scans or key lookup
operations. Such a scan will be divided up into hundreds of thousands of
signal executions automatically. Thus, even in this case, millions of
operations can execute on the same data in parallel with the scan
operation.
DBTC serializes communication with the API. The communication between
DBTC and the NDB API has one talker at a time. When one of the DBLQH
blocks sends back SCAN_FRAGCONF
to DBTC, two things can happen. If the
NDB API is still waiting for input from DBTC one sends a SCAN_TABCONF
signal back to the API immediately. If we have sent SCAN_TABCONF
and
are waiting for the SCAN_NEXTREQ
signal to proceed we will queue the
scan results in DBTC. When SCAN_NEXTREQ
arrives we will immediately
send a new SCAN_TABCONF
to the API if any results were queued up. At
the same time, we will send SCAN_NEXTREQ
to the DBLQH blocks that
already reported ack to the API.
SCAN_TABCONF
will normally report back rows read from multiple DBLQHs
at a time. How many are completely dependent on the current system load.
SCAN_TABREQ
carries two signal parts. One part is the interpreted
program to execute on each row scanned. The second is an index bound
(only used by range scans).
The interpreted program carries out filtering and will report back
whether the row read was sent back to the API or if it was skipped. The
actual row data is sent directly from the ldm thread to the API using
the TRANSID_AI
signal.
The API can always send SCAN_NEXTREQ
with a close flag if it decides
that it has seen enough. This will start a close protocol of all the
outstanding fragment scans.
Similarly, DBTC can close a scan in DBLQH by sending a SCAN_NEXTREQ
signal with the close flag set.
Pushdown joins#
Pushdown joins are implemented on top of the key lookup protocol and scan protocol. They use a linked join algorithm where keys for the tables down in the join are sent to the DBSPJ block for use with later scans and key lookups. Columns that are read are immediately sent to the NDB API for storage in memory there waiting for the rest of the queried parts that are fetched further down in the join processing.
Foreign Keys#
Foreign keys have two parts. The first part is the consistency checks that the foreign key defines. The checks are executed in a special prepare-to-commit phase executed when starting to commit. If this phase is successful the transaction will be committed, otherwise it will be aborted.
The second part is the cascading actions. These are executed exactly as the other triggering operations. One special thing to consider about foreign keys is however that triggers can fire recursively. One small change is capable of triggering very massive changes. This is not desirable, it is important to consider this when defining your foreign key relations.
Table Reorganisation#
In the block DBDIH in the file DbdihMain.cpp, there is a section called Transaction Handling module. This describes exactly how the global data in DBDIH used for query processing is maintained. In particular, it describes how table reorganizations are executed in RonDB. It contains a description of the signal flows for this algorithm.
Metadata operations#
Metadata operations start with a phase where the metadata transaction is
defined by using signals such as CREATE_TABLE_REQ
, DROP_TABLE_REQ
,
ALTER_TABLE_REQ
, CREATE_FILE_REQ
and so forth. These signals are
spread to all data nodes and are processed by the DBDICT block.
After defining the metadata transaction, there are a great many steps for each of those metadata parts where there are several different phases to step forward, there is also a set of steps defined to step backward when the schema change doesn’t work.
These steps are implemented by a signal SCHEMA_TRANS_IMPL_REQ
and its
response SCHEMA_TRANS_IMPL_CONF
. All the nodes step forward in
lockstep until all phases have been completed. If something goes wrong
in a specific phase, there is a well-defined set of phases defined to
abort the schema transaction. Aborting a metadata operation is a lot
more complex than a normal transaction.
Cluster Manager operations#
The node registration protocol and the heartbeat protocol following it are described in some detail in the Qmgr block in the QmgrMain.cpp file in the RonDB source code. It contains signal flows for this algorithm.
Local checkpoint protocol#
The local checkpoint protocol consists of several protocols. The first
step is a protocol using signals TC_GETOPSIZEREQ
/_CONF
,
TC_CLOPSIZEREQ
/_CONF
. These are used to decide when to start a new
LCP based on the amount of changes in the data nodes.
We have to acquire a schema lock also for a short time to decide on which nodes to integrate into the LCP. Only the tables defined at the start of the LCP will be checkpointed. New tables created after starting a local checkpoint will have to wait until the next local checkpoint.
One important part of the LCP protocol added in 7.4 is the pause LCP
protocol. This ensures that we can pause the LCP execution when a node
is starting up to copy the metadata in the cluster. There is a long
description of this protocol and the LCP protocol itself in the block
DBDIH, in the source code file DbdihMain.cpp. This protocol aims to
ensure that DBLQH stops sending LCP_FRAG_REP
signals that will change
the state of distribution information in DBDIH. This protocol is only
required during the restart of a node.
The above figure shows the execution phase protocol of a local
checkpoint. It starts by sending START_LCP_REQ
to all nodes. Before
sending this signal we acquired a mutex on the distribution information
in DBDIH. This signal aims to set the flags on what table fragments that
will perform this local checkpoint. Since we have locked the
distribution information all nodes will calculate this locally. After
completing this preparatory step of a local checkpoint each node will
respond to the DBDIH block in the master node with the START_LCP_CONF
signal.
After receiving this signal from all nodes we have finished the preparation of the local checkpoint. We can now release the mutex.
The next step is to start sending the signal LCP_FRAG_ORD
to DBLQH.
Each such signal asks DBLQH to checkpoint a specific table fragment.
Many such signals can be outstanding at a time, they are queued up if
DBLQH isn’t ready to execute them. DBDIH will send lots of those signals
to give DBLQH the chance to control the speed of executing the local
checkpoint.
Whenever DBLQH finishes the execution of a local checkpoint for a table
fragment it will send the signal LCP_FRAG_REP
to report about the
completed checkpoint. This signal is sent to all DBDIHs (one per node).
When the master DBDIH has completed sending all LCP_FRAG_ORD
to all
nodes, it will send a final LCP_FRAG_ORD
with the last fragment flag
set to true. This indicates to all DBLQHs that no more checkpoint
requests will be sent. Thus when DBLQH finishes executing the last
checkpoint in the queue and this signal has been received, it knows that
its part of the local checkpoint execution is completed.
After the local checkpoint execution in a DBLQH, it will send the signal
LCP_COMPLETE_REP
to all DBDIH blocks. This signal will first go to the
DBLQH proxy block, only when all local DBLQH instances have sent this
signal will the proxy block send this to the local DBDIH block, which in
turn will broadcast it to all DBDIH blocks. This signal will be marked
by DBLQH completing its part of the local checkpoint execution.
When a DBDIH block has heard this signal from all nodes and it has
completed saving the changes to the distribution information from local
checkpoint executions, it will send LCP_COMPLETE_REP
to the master
DBDIH block. This signal will be marked that DBDIH has completed its
part of the local checkpoint execution.
When the master has received this signal from all DBDIH blocks, it will
send LCP_COMPLETE_REP
to all DBDIH blocks with a 0 indicating that now
the entire local checkpoint execution is completed in all nodes.