For an application developer it is important to understand the concurrency control model used by a DBMS. There are two approaches to concurrency control, there is an optimistic approach and there is a pessimistic approach. The optimistic approach is used when low number of conflicts between changes are expected. The optimistic approach goes ahead and does the updates without checking for other updates. At commit time the transactions are checked for conflicts, if any two transactions are in conflict one of them has to be aborted.
The pessimistic approach tries to ensure, already before commit, that we don't cause any concurrent actions on the same object to occur. Most commonly this employs a method called strict two-phase locking. This means that there is first a lock phase where the applications grab their locks and after getting all locks the transaction is committed (or aborted) whereafter all locks are released whereafter no one is allowed to lock any more rows in this particular transaction.
The rule of strict two-phase locking is that no locks are taken after the first row has been unlocked.
The pessimistic approach have much higher chance of handling a highly concurrent system better.
There is a great variety of different methods for both optimistic and pessimistic algorithms. Most of them is based on either locking or some timestamp method. It is one of the most well researched topics in the area of DBMSs.
So what to use? We opted for the pessimistic approach. The main reason for this was a number of factors. I read many, many research articles on the topic of concurrency control and there are tons of research papers in this area. The concurrency control mechanism used in most DBMSs is the strict two-phase locking.
The optimistic approach have a serious impact on application development. Any DBMS using the optimistic approach have to handle many more aborted transactions. In the pessimistic approach the only aborted transactions that comes about from concurrency control is due to deadlocks. These are much less frequent than conflicts. A conflict using the pessimistic approach causes a delay, but not an aborted transaction. Applications using the optimistic approach have to be written with the understanding that aborts are common, this makes it even more difficult to port an application to a new DBMS than otherwise would be the case (it is difficult enough as it is).
In addition the optimistic approach has the disadvantage that the CPU cost per transaction increases as load increases and the chance of conflicts increase.
Using the optimistic approach would have had an impact on all recovery algorithms. It is a lot easier to reason about recovery algorithms when using locks compared to when using the optimistic approach. Given that our focus is online algorithms for all metadata changes, it made a lot of sense to try to use simpler algorithms. As an example the ability to lock a row for a short time is used in the node restart algorithm when synchronising the live node with the starting node. This short lock makes it very easy to reason about the node restart algorithm.
For these reasons we opted to use the standard strict two-phase locking approach that acquires locks in the prepare phase and releases the locks in the commit phase and continues to release the locks in the complete phase.
The approach in RonDB is to use row locking. We always access records using a hash index when any type of lock is needed. We can scan using the hash index, but we can also scan in row order and in disk data order. In the two latter cases we have to ask the hash index for a lock on the row for each row we scan when a lock is needed.
In the hash index we implemented our row locks. If a scan comes from an ordered index and needs to lock the row it will use the hash index to get the lock. Most DBMS use a hash index to implement the lock data structure, in our case this hash index is also used as an access data structure, not just to handle locks.
RonDB has no support for range locks. This is required to fully support serialisability. Thus we support serialisability on everything except range queries. Our analysis of the applications that RonDB focused on suggested that serialisable mode for range queries was of limited use in the analysed applications.
Supporting range locks in a distributed database imposes serious scalability problems. Locking a range using hash partitioning requires locking the range in all nodes and the alternative to use range partitioning only supports range locks on the columns in the partition key. Range partitioning is much harder to make scalable in a distributed system. This is a major reason why most classic SQL DBMSs use a shared disk implementation to scale the DBMS. Shared nothing have a number of algorithms that are harder to implement such as range locking, deadlock handling and so forth.
The choice between shared nothing and shared disk is a trade off where you have to select what matters most to you. When implementing RonDB what mattered most was the time to handle a node failure. With shared nothing we were able to get this down to a small number of seconds and even below a second in a real-time environment. For shared disk it is hard to get this number down this far given that all nodes do not have access to the latest updates, the latest updates can only be retrieved from the logs. Thus shared disk can never be as fast to recover as shared nothing.
Another benefit of shared nothing is that it doesn't require a shared nothing file system. Interestingly all shared disk implementations are implemented on top of a file system that is more or less a shared nothing database. This is why it makes sense to build a file system on top of RonDB.
When developing RonDB we decided to go for shared nothing and thus decided that we will not support range locks in all situations.
If the application has special needs to lock certain parts of the database, it is possible to implement this by ensuring that the access to those parts always go through a specific record. Thus special records can be used in the same fashion as a read-write mutex is used in lower-level designs.
In essence it is possible to handle all locking needs using RonDB, but it might require a bit more work on the application level to get the most fancy lock variants.
A good example of this is HopsFS that had a special problem where they needed to support the mv command. This moves a whole hierarchy of directories. This was implemented by performing the move step by step and using application level locks while doing so. Every DBMS is implemented for a sweet spot and the engineers have to decide on which requirement is the most important for them. The aim of RonDB is to meet the highest availability requirements and at the same time extreme performance requirements for key lookups. We are aiming long-term to provide a very good parallelisation of complex queries since this is such a good fit in the RonDB architecture.
For the same reasons no sharded NoSQL system supports range locks. Since RonDB is a distributed system that automatically supports up to 24 shards using 2 replicas, it is a natural choice to not support range locks in RonDB.
In conclusion all rows in RonDB can be locked either exclusively (when updating or when using exclusive lock mode for reads) or locked for reads. It is possible to read the latest committed value without using locks.
DBMSs mainly uses two different consistency models. The first model is the one implemented in RonDB that performs read using read committed mode. The second popular consistency model in DBMS is to use repeatable read mode. We will explain those two variants, they both provide a consistency level that uses transactions to update the data.
Most of the NoSQL systems employs something called eventual consistency. Eventual consistency means that updates are not transactional, eventual consistency is more about replicating changes. In NoSQL systems all updates are first employed on a master node, after that various methods are used to replicate those changes to other nodes in the system.
Eventual consistency can not be combined with distributed transactions. One can keep the locks in the master node for such a long time that the changes are also employed in the nodes replicated to. But the term eventual consistency only promise that each node replicated to will eventually be consistent with the master node. Thus the only node where you can see the latest changes are in the master node.
InnoDB Cluster, Galera Cluster, MongoDB are examples of systems that uses eventual consistency. In InnoDB Cluster the other nodes have received the logs of the changes when the transaction is committed, but the other nodes have not yet updated the actual data at commit time. Reads that are sent to other nodes than the master node will not see the latest writes.
Thus a major difference between systems employing eventual consistency is that the application have to know in detail what the concurrency implications are due to the eventual consistency. This makes it a lot harder to implement a scalable application using NoSQL systems compared to using RonDB.
We will now try to explain how read committed differs from repeatable read and why the choice for RonDB to use the read committed mode was a natural choice.
Repeatable read means that the query results will be the result that would result if the database was frozen at the start time of the query. This means that the query can be repeated again and again and the result remains the same. During the query execution there is no locks on the database, what happens is rather that the DBMS have to save the old row until the query is completed.
This have two consequences. The first consequence is that the database size will grow, rows have to be kept until all queries have completed that was started at the time when the row change happened. The combination of long-running queries and high update rates will cause a high increase of storage needs. With RonDB we have a focus on in-memory data, thus we would get too much RAM used for old rows.
Additionally to use repeatable read requires adding a timestamp to each row, thus even more use of memory resources. It adds CPU overhead to handle all those extra old rows although this is not a major issue.
The second part is that repeatable read will not deliver results that is based on the latest changes. Thus repeatable read is completely focused on delivering consistency at the cost of recentness. The read committed mode focuses on delivering the most recent changes at the cost of complete consistency of the result.
Delivering a completely consistent result and up-to-date result can only be achieved with massive locks in the database, thus this is not an alternative for any scalable application.
So which mode is most appropriate is dependent on the application. But for the type of applications that RonDB was developed for, the read committed mode is the preferred one.
Read Committed mode#
The model used in RonDB is called read committed mode. Thus we will read the latest committed row. There could be one or two row versions at any point in time (except for transactions that update the same row many times). There is one row when the row isn't currently updated and there is two rows when a transaction is currently changing the row.
A read committed read will read the old committed rows in cases where the user is currently updating the row. These reads are lock-free.
When performing a read of the database in read committed mode we will always read the latest committed row, so even when running a long-running query the result of the query will be impacted by changes in the database.
Locking and Unique indexes#
Unique indexes are simply a special table that have the unique key columns as primary key and the primary key of the main table is the remaining columns. Thus we can with one key lookup translate the unique key to the primary key.
When we read the unique key we use a shared lock to ensure that no one is allowed to delete or insert the row we are looking for while we are performing this translation. As soon as we have found or completed the read of the main table we can release the lock on the unique index row. The NDB storage engine has special code to ensure that we release those unique key locks immediately after returning from the read/write of the main table.
We need to use this shared lock also when the read of the main table is using the read committed mode.
Reading through the unique index and updating the main table can happen in parallel. Given that those comes from different directions it is possible for them to cause a deadlock. These are normally rare, but frequent inserts and deletes of rows in a table using a unique key and at the same time reading this table using the unique key can cause deadlocks to occur. Deadlocks cause no harm other than timing out, but constitutes a problem to consider to achieve predictable latency of an application.
The best use of unique keys is to use them primarily for constraint checks. In this case they are only used during inserts and updates to ensure that only row with a given unique key exists. In this case we cannot have deadlocks due to reads using unique key since we never use such reads.
Locking and BLOB tables#
Locking for BLOB tables is a bit special since the BLOB consists of multiple rows in RonDB. The first row accessed is always the main row, any changes of any BLOB requires holding an exclusive lock on both the main row (even if no part of it is updated) and on the BLOB parts that are updated.
Reading of a BLOB is a bit special, especially when using the READ COMMITTED mode. Normally this mode requires no locks for reading since we always ensure that the read is reading the latest committed values. For BLOBs this method doesn't work since reading rows with BLOBs requires reading multiple rows.
The manner used here is that we take a shared lock on the main row of the BLOB table. Next we read the BLOB parts one by one. By holding the shared lock on the main row we ensure that no one is allowed to change any BLOB parts while we are reading the BLOB parts.
In addition the reads on the BLOB parts use a special lock mode when reading the BLOB parts. This mode takes a lock while reading the row, but releases the lock before sending the result back to the NDB API. This has the effect that we ensure that the read is consistent, if we would have read the BLOB parts using the latest committed we could read an old version of the BLOB through a race condition.
The short-lived lock is almost certain to be successfully acquired immediately, but in rare situations it might have to wait for some commit messages from an old transaction to arrive before it can start the read.
The effect of this is that BLOB tables have less concurrency than ordinary tables, but they will still be consistently read and updated.
Another positive effect of the above is that reads of BLOB tables are always done using any replica independent of the table mode for READ_BACKUP.