Transactions is a concept that was developed many thousands of years ago to ensure that business transactions could be done in a safe way, such that buyer and seller could trust that they will receive what they expect from the business transaction.
In databases transactions have been an important concept since the first DBMS was developed. Using transactions makes it easy to reason about changes to your data.
Basic transaction theory#
An important part of the transaction theory is ensuring the ACID properties. A stands for Atomic, thus a transaction is either fully done or nothing is done. If the transaction is committed all operations of the transaction is performed, when the transaction is aborted no part of the transaction survives.
C stands for Consistent. Today many different models of consistency are discussed, but when we specify that a database is ACID we mean that transactions can be seen as happening one at a time although they happen in parallel, this is called serialisable in database theory. Most relational databases implement at least a subset of the requirements of serialisable transactions. Fully serialisable transactions can easily become a bottleneck, most database systems have a consistency level which is slightly lower.
Serialisability makes it easy to reason about correctness of changes and correctness of your data. Having a model that is close to this makes it easier to develop your application.
Each step in relaxing the serialisability requirement means that one will have to consider this in developing the application as it can no longer be treated as serial events happening to one set of data.
I stands for Isolation. It means that two different transactions is isolated from each other, a transaction make changes based on the state at the start of the transaction and will not see any changes of transactions that haven't yet committed. It can see the result of committed transactions, even if those have started after it since DBMSs do not support serial execution, but rather serialisable execution. In principle this means that a transaction happens instantly at commit time. Thus the transaction that committed before it is viewed as it happened before it. Isolation means that we're allowed to see results of transactions happening before us, but not after us.
D stands for Durability, most early database users were banks and the database used for monetary transfers. It is important that each individual transaction is safely stored even in the presence of crashes and similar events.
RonDB is designed for ACID but does a few compromises. Serialisability is possible to achieve by the use of row locks. Many DBMSs ensure that ranges of rows can be protected. Given that the intended set of applications for RonDB had no immediate need of this type of range locks, this is not supported. A range lock would make it very hard to meet response time requirements and applications could easily be made less scalable by misuse of range locks. Implementing range locks in a distributed architecture will limit the scalability of the DBMS even with a correct application usage.
In addition RonDB supports a SQL mode called READ COMMITTED. This means that a transaction can read any row, even if it is currently being updated. If it is updated the read will return the latest committed value. This mode relaxes the serialisability requirement but ensures that reads can be performed without using locks. Thus reads in this mode have a very predictable latency which is independent of other transactions use of the rows that one reads.
Durability requires the transaction to be committed on disk. This is hard to meet while still meeting the response time requirements. An alternative model is used in RonDB. The transaction is considered safe if at least two computers have received the transaction at commit time. If multiple nodes crash simultaneously a transaction could be lost. The restored system will always be a consistent database and if it is important to not report a successful transaction before it is durable on all disks in all nodes, there are API calls that make it possible to wait for this durability level.
This type of Durability is called Network Durable.
Most applications using RonDB have many small transactions, each transaction have a value, they represent user changes and any disruption of the transaction model is a cost, but it is not the cost of a missed bank transfer of a billion dollar. RonDB strives for the highest availability requirements and durability requirements while at the same meeting requirements of response times in the order of parts of a millisecond.
Transactions in RonDB are fully atomic and transactions are isolated from each other.
Most DBMSs use transactions to update data. When updating data it is important to consider the state of the database at the time of the update. Those transactions require as much ACID capabilities as possible.
DBMSs are used for querying, analysing and many other things. Queries do not have the same requirements on consistency. A query could easily run for seconds, minutes or even hours. It isn't possible to use serialisable transactions for this.
There are two methods to support this. One method is that when scanning data you see the latest committed data, however the data isn't locked, so it might be concurrently updated. This model will always provide you with the most up-to-date version of the data. The returned data is not necessarily consistent with what the database contained at any specific point in time. This model is used by many DBMSs and will prioritise recentness of data for consistency. This mode is called read committed in MySQL. It is the model used in DB2 and normally in Microsoft SQL Server.
The other model is that the scan delivers the query results based on what the database content was at the start time of the transaction. This will deliver a consistent view of the database, but the view could be very old. Thus if a query runs for a minute, the query will not take into account anything of what changes have been made to the data in the last minute, it will only take into account those changes that were in the database one minute ago. This mode is called repeatable read in MySQL. It is used in MySQL/InnoDB and is used by the Oracle DBMS.
RonDB uses the read committed mode since RonDB is designed for real-time data. Thus it is important to consider any changes happening while the query is running. Supporting repeatable read would mean extra memory overhead, extra processing overhead and even the risk of running out of memory due to long-running queries.
Both of the models have advantages and disadvantages. The read committed is better for analysing data in real-time and is a more efficient implementation. repeatable read has advantages in analysing the data when the data isn't moving so fast and it is ok to provide a result that is based on old data.
RonDB is suitable for applications with fast-moving data.
Implementation of isolation can be done in numerous ways and the research literature is full of ideas on how to achieve isolation. The most common ways are pessimistic approaches that ensures that locks are held already in the prepare phase, the other set of popular approaches are optimistic approaches that hopes for the best and at commit time checks if the optimism was valid, if the optimism was valid and no transaction has conflicted with ours the commit is done, otherwise if a conflict is found, at least one of the conflicting transactions have to be aborted.
The most popular approaches uses locks both for pessimistic and optimistic approaches, but also timestamps, and also timestamps combined with locks are other approaches and there are many more variants.
When implementing a distributed DBMS it is notable that the selection of the implementation method for isolation have a large impact on all the recovery algorithms. Selection of the concurrency control implementation to a large extent drives the implementation of the recovery algorithms.
Research have shown that optimistic approaches works well when there is low contention, but in high contention these approaches tend to fall apart since no work gets done since there is too often a conflict.
Pessimistic approach perform less optimal in low contention but continues to do fairly well in high contention scenarios.
In RonDB the pessimistic locking approach was used. The main reason for this choice was simplicity. It is a lot easier to reason about things in recovery situations when one can hold a lock on a row, we make use of this both in node recovery situations and when reorganising a table.
The pessimistic approach is the most common approach in implementing distributed databases. But other approaches such as optimistic locking approaches are common as well. Another popular variant is the MVCC (Multi-version Concurrency Control) which combines timestamps and locks, it is an optimistic approach.
The most common approach when selecting the pessimistic approach is to use strict two-phase locking. What this means is that each transaction goes through two phases, one is the acquisition phase where locks are acquired. The second phase is the phase where locks are released. Using this approach one can prove that serialisability is achievable. The commit happens after the acquisition phase and before the release phase. In a distributed database the commit point is a distributed event as we will show later when describing the RonDB transaction algorithm, thus it's a tad more complex.
Using strict two-phase locking it is possible to reason about transactions using dependency graphs. A transaction T1 depends on a transaction T2 if T1 waits on T2 to release a lock. One important problem to consider when choosing this approach is deadlocks. A deadlock occurs when all of a sudden a set of transactions cannot proceed. In the dependency graph below we can see that there are 4 transactions and there are two deadlocks, first there is a cycle where T1 waits for T2, T2 waits for T3, T3 waits for T4 and T4 waits for T1. Additionally we see that T1 waits for T3 and thus we have a shorter cycle. Actually one can show that a deadlock occurs when a cycle is present in the dependency graph.
In order to get out of a deadlock the only method is to abort one or more transactions. In the case below we can abort T1, T3 or T4 to be able to proceed again. T2 isn't sufficient since it doesn't break the cycle between T1, T3 and T4.
To avoid deadlocks one can use two approaches. The first is to avoid deadlocks, the second is to detect deadlocks and select an appropriate candidate for abort.
The most well-known method of avoiding deadlocks is to take locks in an order that all transactions uses. As an example we might have transactions that are required to take locks on one row in each table T1, T2 and T3. If all transactions take locks in the order T1 followed by T2 followed by T3, no deadlock is possible since no cycles can be present in the dependency graph. If one transaction decides to take a lock on a row in T3 and then one in T2, immediately a deadlock is possible.
The ordering of rows can happen in many different ways other than by table. Thus the application developer have the biggest influence on whether deadlocks will occur or not. One can have deadlocks that arise due to implementations of unique indexes, foreign keys, BLOBs, recovery and so forth.
Algorithms to detect deadlocks can be shown to be NP-complete. This means that it isn't possible to construct an algorithm with cost that is linear to the problem size. This means that the cost of detecting deadlocks increases exponentially as the number of potential deadlocks increases.
The practical consequence of this is that all DBMSs that can be involved in deadlocks have to implement a time-out based abort scheme. Thus deadlocks are not detected per se, rather the stop of progress of a transaction is detected and this leads to an abort of the transaction. This is safe and easy method of handling deadlocks. It means that some transactions will be aborted that were never involved in deadlocks. In a distributed database this can be used to detect transactions that have stopped progressing due to a node failure.
In practice it could be a good idea to discover the simplest deadlocks, the most common are the once having 2 participants, by avoiding these deadlocks to take time to discover we are removing one obstacle from the database. Trying to discover more complex wait graphs can easily consume too much CPU and network resources since the number of messages needed quickly grows. Even more so with more than 3 participants.
RonDB currently implements a time-out based deadlock detection algorithm.
The next problem to consider in concurrency control based on strict two-phase locking is dirty reads. Dirty reads happens when a transaction T2 reads the non-committed values from a transaction T1 not yet completed. This means that T2 can read states that will never be committed, either T1 might update again the same row or the transaction T1 might be aborted in which case the value T2 read, never existed.
Using strict two-phase locking dirty reads will never happen as T1 holds an exclusive lock on the row and thus T2 cannot read it. As discussed before RonDB supports the mode read committed in which case the transaction T2 will read the row which T1 is currently updating although it is exclusively locked. We ensure that those reads are not dirty in the sense that they read uncommitted values. They will always read the row as it existed before T2 started updating it. In this manner a dirty read as defined by the NDB API is a read of the latest committed value. This is what a normal SELECT query in MySQL will use when in the mode read committed. To use locking one uses SELECT .... LOCK IN SHARED MODE in which case all rows that are read will be read locked until the end of the transaction. It is possible to use SELECT ... LOCK IN EXCLUSIVE MODE to acquire exclusive locks on the rows read.
The next problem in concurrency control to consider is unrepeatable read. This happens when T1 reads the value of a row, then T2 updates the value and commits, next T1 reads it again and now it has changed. Using strict two-phase locking this isn't possible since the read lock from the first lock will prevent T2 from writing the row. Using read committed mode means that T1 might read the latest committed row and thus in this mode unrepeatable reads can happen. To prevent this one has to use SELECT ... LOCK IN SHARED MODE again from SQL.
MVCC (Multi-version Concurrency Control) does allow for reads to be repeatable, however if someone went in an updated the row in the middle of the transaction, it must be aborted to ensure a serialisable execution. If we ignore this problem we are running in a mode called repeatable read. This mode doesn't ensure a serialisable execution, it simply means that we can always repeat a read inside a transaction, in MySQL it means that we will perform all reads based on the value that the rows had at the start of the transaction.
RonDB currently doesn't support the MVCC mode as discussed.
As we have touched upon there are many different dependencies we can have. We can have a transaction that reads a row and another that writes the row. These can conflict, similarly for two writing transactions. two reading transactions cannot conflict since they do not change the value. Thus strict two-phase locking differs between shared locks and exclusive locks. Shared locks can be held by many transactions in parallel whereas exclusive locks can only be held one at a time.
Other locking thoughts#
One could use a special lock for increments and decrements. This lock would conflict with shared and exclusive locks, but would not conflict with other increment locks. In a generic database used for SQL and many other use cases the difficulty in discovering those operations and the fact that it requires lock upgrade is an issue. We haven't implemented this approach in RonDB.
The final thing we will discuss here is lock upgrades. Upgrading a lock from shared lock to an exclusive lock is an operation which is similar to getting a new lock, however it blocks others from touching the row, lock upgrades increases the risk of deadlocks more than simply taking an exclusive lock.
This is why MySQL supports SELECT ... LOCK IN EXCLUSIVE MODE to ensure that one can acquire the exclusive lock immediately and not perform any later lock upgrades. Using the NDB API one can also control the locking mode through the API calls.
Since RonDB is a distributed database any transactions must be implemented as distributed transactions. The most common method to implement distributed transactions is the two-phase commit protocol. This means that first all participants are prepared, then if all participants decided to accept the coordinator, they will decide to commit. This approach have a few problems, first of all if the coordinator fails while doing a commit the transaction is blocked until the coordinator comes back. The second problem is that this approach can easily cause deadlocks by sending lock requests to all replicas in parallel. Thus we can easily end up in a deadlock situation although the application is designed to avoid deadlocks.
To handle this we do two things, first to solve the blocking state when the coordinator dies, we implement a coordinator take over algorithm. Each time a transaction coordinator fails another node will collect the transaction state from the participants and will ensure that the transaction is completed. If this coordinator fails another coordinator will take over until we run out of nodes in which case the cluster have crashed and then we will perform cluster recovery.
The second problem we solve by combining the normal two-phase commit protocol with the linear two-phase commit problem. Linear two-phase commit means that we send the prepare phase in linear order, thus the transaction is started by the first node and it is committed by the last node. This approach decreases the number of messages needed to commit a transaction. It also ensures that the transaction protocol doesn't create any deadlocks in itself.
We keep one coordinator, but we use linear commit for each row update. Thus we always start the prepare phase by going to the primary replica first, then we go to the first backup replica and so forth until we reach the last backup replica from where we go back to the coordinator.
When committing we move through the nodes in reverse order. Thus we reach the primary replicas last in the chain as seen in the figure.
The interesting thing is when the actual commit point happens. As it turns out the actual commit point is when the commit message reaches the first primary replica. At this point the only thing that can make the transaction fail is a complete cluster failure. Since the cluster will survive as long as not all nodes in a node group fails, thus when all replicas have received the commit decision then at least one live node will know about the commit decision even in the presence of multiple node failures. If the cluster fails another algorithm will come into play, this algorithm is called the global checkpoint algorithm. It ensures that we will always recover a consistent database even in the case of a cluster crash.
In the coordinator take over protocol all nodes will be queried on any state of transactions from the crashed node. Thus we will not miss any knowledge about committed state information even in the presence of node failures. The only problem occurs when the API node fails, then there is no one to report the outcome of the transaction to.
Thus the commit point becomes theoretically well defined, but not easy to pinpoint practically.
The commit and complete messages are small messages, to avoid paying the price of the full overhead of a message we use a message called PACKED_SIGNAL, this message contains a few types of messages including COMMIT, COMMITTED, COMPLETE, COMPLETED and the response back to the coordinator in the prepare phase (called LQHKEYCONF) are all such piggybacked messages. This means that we save a lot of network bandwidth. In addition the RonDB sending is performed for a set of signals and not just one signal. Thus we have double levels of piggybacking of messages in the RonDB transporter module. Thus when the transaction throughput increases, the RonDB efficiency increases.
This has the nice benefit that as RonDB nears its saturation point it becomes more and more efficient. This is very important in building a highly available system. We will show more reasons for how RonDB becomes more efficient as the load increases later in this book.
The commit protocol can deliver the acknowledge message to the API already after the commit phase and before starting the complete phase. At this point we know that the commit is safe and it can be communicated.
For tables where we want to support the ability to read from the backup replica we will wait until the complete phase is done before we send the acknowledge message to the application.
The reason is that there are still locks on the backup replica. Thus it is possible to start a read in the next transaction after the commit is done that will not see the updates we did ourselves. To avoid this anomaly we will wait a bit longer for tables that support read from backup replica. Read from any replica will be used for any reads that reads the latest committed results. It will not be used for locking reads, these will continue to use the primary replica for reading.