Disk columns#

Disk columns are columns that are stored on disk and is only cached in main memory when reading and writing them. Columns stored in main memory are pinned to a main memory location at startup and stays there as long as the node is up and running. Both the main memory columns and the disk column are still checkpointed and written to disk to be recoverable.

The main reason we decided to implement disk columns was to extend the size of databases that we can support. We have so far decided not to spend time on implementing disk-based indexes. The reason is that our focus have been on supporting applications that require predictable latency to queries and updates while maintaining a very high availability. While disk-based indexes would extend our reach to new users, it would not add to our unique selling points which are very high availability and predictable response times in a strongly consistent distributed environment with high throughput.

With the development of persistent memory, fast NVMe drives and SSD drives the picture is quickly changing. The use of persistent memory modules that come out in 2018 with Intel Cascade Lake will make it normal for a server to have 1 TByte of memory that can be used by RonDB.

The development of SSD’s and NVMe’s plays very well into our unique selling points. Given that they can respond in tens of microseconds and can sustain hundreds of thousands of IO operations per second (IOPS) we can use these as main storage for many more columns.

In RonDB 21.04 disk columns are always stored using a fixed format. In RonDB 22.10 disk columns are instead stored in the same fashion as in memory columns. Thus variable sized columns only use the storage they require, this will make the disk columns use up a considerably smaller amount of disk space as well as space in the disk page buffer memory (the page cache for disk columns).

Cloud machines like the one provided through the Oracle Cloud are perfect fits for running large RonDB installations with very low and predictable latency as well as very high availability using our new features for making use of the availability domains in the cloud. These machines come equipped with 768 GByte of memory and tens of TBytes of NVMe drives.

It is strongly recommended that all tablespaces are created on SSD devices or NVMe devices. The reason is simply that this will bring a thousand times better throughput and a hundred times better latency compared to using hard drives. Access to disk for disk data is always one page at a time (currently 32 kBytes in size), thus access to the disk is always random access. Random access to hard drives is limited to around 200 accesses per second and this is very unlikely to be sufficient for normal RonDB applications. For an SSD it is possible to achieve tens of thousands of random accesses per second and for NVMe devices it is even possible to achieve hundreds of thousands of random accesses per second. Clearly one should avoid using use hard drives for disk columns for production usage.

Hard drives are still an ok fit for REDO logs, UNDO logs, backups and local checkpoint files. The reason is that those are strictly written and read using sequential access. Thus the hard drives can sustain their full disk bandwidth. Hard drives are also less likely to wear out due to constant writing.

For larger systems where write speeds beyond 100 MBytes per second are needed, SSDs are needed to be able to sustain the high write speeds for logs, backups and checkpoints.

Many applications store fairly large chunks of data in BLOB’s and in variable sized character strings. There could be large amounts of statistical data in integers stored as information in the database. Often these columns are not indexed, these columns represent natural use cases for disk columns in RonDB.

One remaining limitation of disk columns is that it isn’t possible to extend with new disk columns as an online alter table statement. It is also not possible to move a column from a disk column to a main memory column in an online alter table statement.

Row format for disk columns#

The main memory row has a reference to the disk data part of the row. This reference is kept consistent by the recovery algorithms such that the row is connected also after recovery. The disk part also has a back reference to the main memory row. The reason for this is to ensure that we can perform scans in disk order. Thus we read the pages on disk and find the main memory part rather than the opposite (improves sequential reading of the disk).

The disk data part of the row is currently always a fixed size row. All fields, also the variable sized fields are stored as fixed size columns with the column size as the maximum possible size of the column.

Tablespaces#

The actual data in a disk column is stored in a tablespace. Each table that have disk columns is using one tablespace. There can be several tablespaces in a cluster. Each tablespace will exist with the same size on all data nodes in the cluster. The relative file name of the tablespace file(s) will be the same on all machines.

Tables allocate extents from the tablespace. The size of an extent is decided when the tablespace is created. By default the extent size is 1 MByte. Once an extent have been allocated to a table it stays in this table until the table is dropped and a local checkpoint after the drop table have been completed.

All disk data rows are stored in pages of 32kByte in size. The pages move back and forth between disk and main memory as pages. The pages are maintained by the Page Manager and uses a page cache to store the pages while in memory. The size of this page cache is given in the RonDB configuration option DiskPageBufferMemory.

Each tablespace contains a few pages describing each extent, to support fast decisions on where to place new rows, these pages are always in the page cache. The extent pages have a 4-word header for each extent and in addition they have 4 bits per page in the extent providing some state information of how full the page currently is.

This means that with a 4 TByte tablespace with 1 MByte per extent we will have 4M extents in the tablespace and each extent will use 20 bytes in main memory. Thus the extent pages will use about 80 MByte of the page cache.

It is recommended to have at least a few gigabytes of page cache when using disk columns in a larger scale. When pages are involved in updating transactions they will be temporarily locked into main memory when the commit operation starts and will remain locked until the UNDO log have been written to disk. This is to ensure that we follow the WAL principle of writing to the log before we write the actual disk data pages.

For writes to the data pages we use the NO STEAL approach. This means that we only write committed changes to the data pages. During the transaction the data resides in memory buffers and only at commit time is the changes written into the page cache and even later the data is flushed to disk.

Tablespaces consists of one or more data files. Size of existing data files cannot be changed, but new data files at any size can be added as online operations to a tablespace.

UNDO log#

The disk data uses the WAL principle (Write Ahead Logging). Thus before we write any disk data page we must write an UNDO log record to ensure that we can roll back the pages to the state at the start of a local checkpoint.

The UNDO log files are stored in a special tablespace called Logfile Group. At the moment the implementation only supports one Logfile Group. There can be multiple log files in this Logfile group and new log files can be added as online operations.

Syntax for disk data#

When using disk data we first need to create an UNDO logfile group. Second we need to create at least one tablespace that uses this UNDO logfile group. After that we can create the tables that makes use of this tablespace.

The managed version of RonDB always create a logfile group with the name lg_1 of size 4 GByte using the configuration parameter InitialLogfileGroup and a tablespace called ts_1 using the configuration parameter InitialTablespace of size 8 GByte and using extent size 16 MByte. The UNDO log buffer size is handled automatically also in the managed version of RonDB.

The size of the UNDO log and the tablespace is too small for any larger use cases, if you are using the managed version and are planning to use large amount of disk columns you should add new UNDO log files to the lg_1 logfile group and more tablespace files to the ts_1 tablespace.

When creating the data node VMs one can set the size of the storage large enough to handle the required size of these UNDO log files and tablespace files in addition to the other files required for recovery of in-memory tables.

Allocate 150% of the memory size of the data node VM for in-memory files for recovery. In addition one need to add 64 GByte of space for REDO logs. Thus if you use a VM with 256 GByte of you need 448 GByte plus the space for UNDO logs and tablespaces. Thus if you need another 500 GByte of tablespace a storage size of 1 TByte for the VM should be sufficient. This would include a 64 GByte UNDO log space.

UNDO log syntax#

When creating an UNDO log file group we need to add at least one file to it. We provide the size of it using the INITIAL_SIZE keyword. Each logfile group will also use an UNDO log buffer, the size of this we specify when creating the UNDO log.

CREATE LOGFILE GROUP lg_1 ADD UNDOFILE 'lg_1.dat'
INITIAL_SIZE 4G UNDO_BUFFER_SIZE 32M ENGINE=ndb;

This creates a logfile group lg_1 that is stored in the file lg_1.dat (this file is stored in the DataDir of the data node (more control can be applied using FileSystemPath or FileSystemPathDD. The size is 4 GBytes and the UNDO buffer is 32 MBytes in size.

Given that only one UNDO log file group is allowed one might as well create this already at initial start of the cluster. This happens using the RonDB configuration variable InitialLogfileGroup.

Here is an example of how to set this variable:

InitialLogfileGroup=undo_buffer_size=32M;lg_1.dat:4G;lg2.dat:4G

This creates an initial logfile group called DEFAULT-LG with two files lg_1.dat and lg2.dat both 4GByte in size. The UNDO buffer is set to be 32 MBytes in size. When creating an initial tablespace it is possible to specify the logfile group name. But since InitialTablespace cannot refer to any other than DEFAULT-LG this is not a good idea.

At any time we can add more log files to an existing logfile group. This is done by using an ALTER LOGFILE GROUP command.

Here is an example:

ALTER LOGFILE GROUP lg_1
ADD UNDOFILE 'lg2.dat'
INITIAL_SIZE 4G
ENGINE ndb;

The setting of InitialLogfileGroup is equivalent to these two example queries CREATE LOGFILE GROUP and ALTER LOGFILE GROUP. The last example creates a new log file that is 4GBytes in size. It uses the same UNDO buffer, there is only one UNDO buffer per logfile group, it isn’t possible to change that here.

Another good reason to use InitialLogfileGroup is that it means that the memory for the UNDO log buffer is not taken from the memory specified in the RonDB configuration option SharedGlobalMemoryPool.

Tablespace syntax#

Creating a tablespace adds the first datafile to the tablespace. When creating the tablespace one needs to refer to the logfile group to be used with the tablespace. It is necessary to set the extent size (defaults to 1MByte). Extents are allocated to a table from the tablespace, an extent is the allocation unit when allocating disk space to a table.

CREATE TABLESPACE ts_1 ADD DATAFILE 'ts_1.dat'
USE LOGFILE GROUP lg_1
INITIAL_SIZE 16G
EXTENT_SIZE 16M
ENGINE ndb;

The above statement creates a tablespace named ts_1 with a first data file ts_1.dat, the data file is 16 GByte in size and uses an extent size of 16 MByte and uses logfile group lg_1.

ALTER TABLESPACE ts_1
ADD DATAFILE 'ts2.dat'
INITIAL_SIZE 16G
ENGINE ndb;

The above statement adds another datafile to the tablespace ts_1 with the file called ts2.dat, the file will have an initial size of 16 GByte.

In the same fashion as with logfile groups it is possible to create an initial tablespace for the cluster using the RonDB configuration file. In this case the configuration variable is called InitialTablespace.

Here is an example of its use:

InitialTablespace=name=ts_1;extent_size=16M;ts_1.dat:16G;ts2.dat:16G

This creates the same tablespace as the CREATE TABLESPACE command and ALTER TABLESPACE command did together. The name is optional, if not provided the name will be DEFAULT-TS. It will always use the default log file group DEFAULT-LG.

Controlling which columns are stored on disk#

By default all columns are stored as in memory columns. One can set the storage type of a column by specifying either STORAGE MEMORY or STORAGE DISK. This can be set on the table, if it is set on the table it is applied on each non-indexed column.

Here is an example:

CREATE TABLE t1 (a INT, b INT STORAGE MEMORY)
ENGINE ndb
STORAGE DISK
TABLESPACE ts_1;

This creates a table with a hidden primary key stored in memory with 2 attributes, the first a stored on disk and the second b stored in memory.

Page Manager#

The Page Manager maintains the page cache. It uses a modern page cache algorithm. We will attempt to provide a basic understanding of this algorithm.

Each page in the page cache has a page cache record in the Page Manager. Each such page cache record goes through a number of states before it has a page connected to it.

At first a main memory page must be bound to the page cache record. Once this have been done the page must be mapped to a disk page. For new pages this is trivial since the page content on disk can be overwritten for new pages. For existing pages the page is mapped to the disk page by reading the page from disk (page in).

In order to find a clean page to use for binding to a page cache record it might be necessary to first clean the page. Cleaning the page means that it might need to be written to disk before the page can be used for another disk page (page out).

A basic part of the algorithm is that we have 10x as many page cache records as we have pages in the page cache. We maintain those pages in a linked list ensuring that we know the hotness of a lot of pages that are not present in the page cache.

As a page is used by the application we increment the usage counter of the page. This counter is maintained for pages that have been paged out as long as the page cache record haven’t been removed as well.

Thus we can see patterns of reuse that are longer than our page cache size can handle.

Considerations for disk data tables#

Initialisation of files#

Files can be added to both logfile groups and tablespaces as online operations. The addition of a new file will take some time. The file is created as part of the command to add the file. It will be created with its initial size. To ensure that the file system have allocated space for the entire size of the file, we will write the entire log file or tablespace file as part of adding it. Even though the operation to add a new file is an online operation, care is needed to ensure that the disk bandwidth used to initialise the file doesn’t harm the ongoing transactions.

Placement of disk data files#

It is possible to specify the full path of a file when adding a file to a logfile group or tablespace. It is not necessary advisable though since all machines might not look exactly the same.

If only a file name is provided, and the path is derived from the configuration parameter in the following order.

DataDir is normally a variable that should be set in the RonDB configuration. If not, it will use the working directory of the RonDB data node process. The default is to place all the files used by the data node process under this directory.

By specifying the RonDB configuration option FileSystemPath all the disk data files will be moved to this directory. However if FileSystemPathDD is set this directory will be used instead.

The algorithm to choose directory for a data file in a tablespace is as follows. If configuration option FileSystemPathDataFiles is set use this directory, else if configuration option FileSystemPathDD is set use this directory, else if configuration option FileSystemPath is set use this directory, else if configuration option DataDir is set use this directory, else use the working data directory of the data node process.

The algorithm for choosing directory for UNDO log files is the same except for the first step where FileSystemPathDataFiles is replaced by FileSystemPathUndoFiles.

DiskIOThreadPool#

One more important thing to consider for disk data is the number of OS threads to use to read and write pages in the UNDO log files and tablespace files. This number is set through the RonDB configuration variable DiskIOThreadPool. By default it is set to 2. It is quite likely that this parameter should be increased, in particular when using SSD’s and NVMe’s to store the tablespaces and logfile groups.

DiskPageBufferMemory and DiskPageBufferEntries#

Two important parameters are DiskPageBufferMemory which is simply the amount of memory dedicated to the page cache. DiskPageBufferEntries is the multiplication factor for how much memory we will have of old page cache entries that have been evicted from the cache. This is set to 10 by default. Thus we have 10 page cache entries for each actual page in the page cache. Since each such record consumes roughly 84 bytes (around that) it means that we use about 2.5% of the page cache memory to keep a record of old page cache entries. We could potentially increase this slightly or decrease it slightly but most likely this value is a good trade off. The more page cache we have, the better its hit rate will be.

Local checkpoints for disk data#

Local checkpoints ensure that all dirty disk data pages are written to disk. This is important to ensure that we can cut the REDO log and the UNDO log for disk data pages. The algorithms used to handle disk data is described in a fairly detailed manner in a large comment in pgman.cpp and some comment in pgman.hpp. In Backup.cpp a very detailed description of the local checkpoint algorithm as such is provided. When a fragment replica is checkpointed, its main memory data and its disk data is checkpointed together, to ensure that the references between them stays correct after a restart. The most detailed comments are always in the latest version of RonDB.

To ensure that we don’t oversubscribe too many writes due to local checkpoints we will never have more than 1 MByte of outstanding page writes due to local checkpoints (LCPs) per ldm thread.

Foreign keys in RonDB disk data tables#

It is fully possible to use foreign keys on tables with disk data attributes. It is important to consider the implications of cascading actions for these tables. If a cascading action tries to change too many rows we could easily hit the limit for transaction sizes for those tables. It is important to take this into consideration when designing your schema. Foreign key cascading actions can easily lead a simple transaction to update all rows in a table and this can cause havoc for transaction latency and it can easily lead to overload problems.

This problem is present for main-memory tables, for disk data tables it will be even worse since all deleted rows will cause disk pages to become dirty and in the need of disk writes in the next local checkpoint.