Many SQL queries are performed with some amount of parallelism in RonDB. It happens when the SQL queries uses scans that are not limited to scanning a single partition.
It can happen as part of table scans, it can also happen as part of more complex scans that are linked together. These linked scans is a variant of a parallel join algorithm. The scanning of data in the data nodes can be parallelised to the same level as the number of ldm threads in the cluster. The result rows are sent to the single MySQL Server that has a single user thread that is executing the query, but much of the filtering of the rows can be parallelised.
This parallel join algorithm is a bit more advanced compared to a parallel nested loop join. The best descriptive name for it is to call it a parallel linked scan algorithm.
Full table scan in RonDB#
Scans comes in a number of flavors, the first one is the full table scan. It will read each and every row in the table and apply a search filter on each row to see if it should be returned to the application. These scans are automatically parallelised such that they execute on a set of partitions in parallel. For example if the table contains 1 million rows and is divided into 8 partitions, eight concurrent scans will be started that each will scan around 125k rows. Conditions from the SQL query can be pushed to this scan such that only a subset of the rows is returned to the MySQL Server. When the row returns to the MySQL Server it has to be evaluated by the single thread that the SQL query is executed within, but during the table scan we can have up to eight CPUs running in parallel on filtering out the rows to be returned. This happens automatically, no special action is needed to make this happen, it is the default behaviour.
We will go through in more details later what conditions that can be pushed down as part of a scan (the same condition pushdown can be applied on a primary key or unique key request to flag if the read row is to be returned to the application).
Range scan in RonDB#
Next the same thing applies to range scans. Ordered indexes in RonDB are implemented per partition. Thus in order to perform a range scan the same scan algorithm is used as for full table scan except that we scan in index order between two boundaries. RonDB can also start multi-range scans using one scan operation. These scans can also have conditions pushed down and the scan can be parallelised.
Partition pruned range scans#
We have a special case which is very important, this is the partition pruned range scans. In this case the range scan is not performed on all partitions, it is rather performed only on one partition. There is an easy condition for when this pruned partition scan happens. As part of the condition the partition key must be fully set. For example in TPC-C there are lots of tables that contain a warehouse id. One manner to set up the tables in TPC-C is that all tables use the warehouse id as the partition key. Now any query that has a condition that contains warehouse_id = variable/constant will now be able to perform a partition pruned range scan. By retrieving the warehouse_id from the query we can figure out which partition to scan. We know that the data cannot be in any other partition and we can thus prune away all the other partitions.
This technique is what almost all Big Data applications is using when they are sharding. With RonDB it is possible to set up the cluster tables to be easily sharded. If all or at least most of the tables in a database or application uses a common partition key this becomes a sharding key of the application inside RonDB.
As an example in HopsFS where a file system meta data layer was implemented a clever partition key was selected that made it possible to ensure that almost all scan queries, e.g. the scan used to perform ls (list the files in a directory in Unix) was always directed to only one partition and thus no extra overhead was created in starting up many parallel scans since each parallel scan only accessed one partition each.
In a number of situations we want to perform queries that span the entire data set. In this case we are most likely analysing the data and we are scanning major portions of the data to find some specific property of the data. In this case the fact that scans are automatically parallelised is good and decreases the latency significantly in processing complex queries.
In the descriptions below we use the term scan, but often a scan is a key lookup. In the presentation below this would simply be a special scan that can return 0 or 1 rows.
Now in RonDB we have made another step forward. We implement a technique where these scans can be linked. Say that we start with a scan of table t1, this scan provides 10.000 rows. Now we want to join the table t1 with the table t2, from each of those 10.000 rows we get a key that is used in the condition for table t2 and we perform a new index scan for each of those 10.000 rows. This index scan can also be a key lookup, it can also be a full table scan and the scan can also be partition pruned such that it only scans one partition. Let's say that each of those 10.000 scans now gives us 3 more rows back, thus we now have 30.000 rows from t2. Now we can continue and use data from table t2 to join it with a table t3 in exactly the same manner.
Using this technique we can execute arbitrarily complex queries. In our testing of this we've used heavily a query that we got from an early user of RonDB. In the early days of RonDB this query took something like 6 seconds to execute (in those days MySQL had almost no batching techniques, the query was more or less executed one row at a time). Now in RonDB using the normal technique of single table scans performed in sequence and always returning to the MySQL Server the query takes 1.2 seconds to execute in a very small cluster with one data node and one LDM instance. Using condition pushdown we are able to avoid much of the interaction with the MySQL Server and most of the filtering is done before sending it to the MySQL Server. Using these techniques the query can be executed in 0.2 seconds instead. With a few more ldm threads, e.g. 4 of them the query is executed in 50 milliseconds. This query scans and analyses many tens of thousands of rows as part of the query execution, mostly in the data node.
One can perform more odd variants of joins using this technique, so for example if one starts by reading table t1 and based on the data retrieved in table t1 we can immediately start fetching the data in table t2, table t3 and so forth.
Limitations of pushdown joins#
The conditions used in joining tables that are part of the pushed down query must use equality conditions. The columns used in those equality conditions must be of exactly the same data type. It is currently not possible to use pushdown joins if any of the columns retrieved by the query is a BLOB column. Subqueries can sometimes inhibit the pushdown of the join from happening. One can use show warnings after executing the query to discover any problems in handling the pushdown of the join. Likewise the EXPLAIN command can be used to discover exactly how the pushdown join is handled.
Pushdown of joins can only occur in read only queries that are not using any locking constructs such as LOCK IN SHARED MODE or other similar construct. The join execution must be lock-free, so it cannot lock rows as part of the join execution which would be required if it is part of an updating transaction where the result can be used to update rows in a scan takeover.
Details of pushdown joins#
A complex join can be divided into multiple parts that are pushed down to RonDB. As an example assume that the MySQL Server comes up with a plan that does a 5-way join, for each row in the inner join of this 5-way join we have a subquery that is in itself a 4-way join. Now the 5-way join can be handled with a scan on the first table and for each row in the first table we execute a pushed down 4-way join to the NDB API.
Thus even if there are specific parts that makes it impossible to handle the entire query in the RonDB data nodes, we can still send parts of the query at a time down to the RonDB data nodes.
The MySQL Server does the query optimisation before the joins are pushed down to the RonDB data nodes. The query optimisation doesn't take the distribution aspects into account. However once RonDB decides to push a join down to RonDB data nodes it can decide using some heuristics to reorder the join order of the tables.
Aggregation, group by and order by processing is currently not handled by pushdown joins. Thus all the rows to be put into the group by processing must be returned to the MySQL Server as part of the join processing. This makes the MySQL Server user thread a bottleneck for complex queries using group by, at least if many rows are used to calculate the aggregates. The parallel filtering can still make those queries go significantly faster using pushdown join.
Pushdown of joins and fully replicated tables#
Fully replicated tables are a bit special, these tables are replicated in all data nodes. Thus a join where all tables are fully replicated can be handled in a completely local execution where the entire query execution is processed within one data node.
The same is true for a table using the read backup feature and executing in a cluster with only one node group (e.g. 2 nodes with 2 replicas).
This means that we can execute the entire query without performing any communication with other computers if the MySQL Server and data node are colocated.
A normal join execution starts up a number of parallel join execution parts that each take care of a subset of the query. This is handled by limiting each subquery to only scan a subset of the partitions on the first table. The result of the query in this is case is a union of the subqueries and can thus easily be merged back again. E.g. the first query part could take the first partition of the first joined table and join this table with the full tables of the remainder of the joined tables.
The join execution for each query part will be controlled from the node where the primary replica of the partitions used in the first table resides. This is independent of if this first table is fully replicated or not. The remainder of the reads will always go to the same node if a replica is stored on the node, otherwise it will go to the nearest replica available.
Execution of pushdown joins will often spread over all nodes to make use of all nodes and ldm threads in the cluster.
RonDB can be limited by Gigabit Ethernet very easily and even with 10G Ethernet it is fairly easy to overload the network in a dual socket server. Thus it makes a lot of sense to try to ensure that queries are executed as much as possible in the same node as the query was issued from.
Pushdown of conditions to the RonDB data node can happen in cases where we compare with constants either for equality, greater than and so forth and ranges of the same. We can perform checks if the column is NULL or not, we can perform like conditions and we can compare string values if they are of the same collation.
This can be used both for full table scan, index scans and key lookups and it can be part of a pushdown join execution. It is a very powerful concept when it gets used.