Skip to content

Ndbinfo Tables#

Managability of RonDB is something we have worked hard on. One of the most important features here are the ndbinfo tables that export lots of internal information about data node behaviour to the DBAs.

These tables are all accessed as normal tables in the ndbinfo database. They are read-only tables that can be used in any place in a query where one selects from a table. They are accessed from a MySQL server.

There are currently 48 different tables exporting many different types of data useful in understanding NDB operations. In addition there is a set of views on these tables.

We will describe most of them here, some in more detail and some we will gloss over. The MySQL manual have descriptions of every table and every column in these tables. In this chapter we will focus on explaining reasons why you could be interested in a specific table and drill a bit deeper into some specific tables and columns.

New ndbinfo tables are constantly added as we see the need to make information about the cluster accessible. It is a fairly simple process to add a new ndbinfo table.

Reasons to use ndbinfo#

There are many reasons to use ndbinfo such as understanding how the cluster is getting used, understanding the dynamic behaviour of it, solving various problems with your application, understanding RonDB itself, the hardware, the OS and so forth. It is important to get an understanding if the current configuration is appropriate and thus knowing when it is time to make various configuration changes. These configuration changes can include adding more nodes to the cluster.

Understanding how the cluster is used#

The most common use case is for the DBA to understand how much the cluster is used. Is it running close to its limits or is there plenty of resources?

There are many types of resources to manage, CPU resources, memory resources, disk resources, internal buffer resources and so forth.

Solving application problems#

One typical problem that can happen in a database application is that we get problems with locks. Hot locks can easily lead to deadlocks, lower throughput and many other issues. Other problems could be related to that we use more memory in the application than expected.

In addition we might need to watch the exact behaviour of the cluster in a live situation. This helps us understand what our application code does to the database.

Query analysis can be performed on the MySQL level. Some query analysis can also be performed by watching statistics from ndbinfo.

Finding scalability issues#

If an application is designed properly it can scale to the full cluster size of 144 data nodes. Very few applications reach that level of scalability. Understanding how to improve scalability requires understanding what impacts scalability.

ndbinfo can assist in this work by providing statistics about the number of local operations (operations where the transaction coordinator is in the same node as the data) versus the number of remote operations (not a local operation).

CPU statistics gives hints on which threads that are most active, there is also thread information about the number of signals executed per loop and other things that gives a better understanding of the behaviour of RonDB.

Solving problems with RonDB#

Some of the tables in ndbinfo provide very detailed information about important internal data structures in NDB. When users report issues, the ndbinfo tables is one tool that can be used to analyse these issues.

RonDB developers can use the tables to understand the dynamic behaviour of the programs in various situations. This can lead to new ideas on development for RonDB.

Understanding configuration issues#

Configuring a DBMS can be challenging. It is important to get a tradeoff between the usage of memory, CPU and disk resources and find an appropriate configuration that fits the application.

How ndbinfo works#

The ndbinfo tables are implemented using a special scan operation in the NDB API. It sends the scan operation to all nodes in the cluster using the table id as key to this scan. Each node will send the scan to every thread and every block in each thread. For the most part these query parts will return no rows.

If the block recognizes the table id as something that it has data for it will send the data in one or more rows per signal until all data have been provided.

There is no indexing of ndbinfo tables for the moment. The MySQL Server can apply any search conditions as on any SQL query, the actual data returned to the user will only be the requested subset of the table.

The ndbinfo scans are performed in NDB data nodes, the information retrieved is only coming from the data nodes. Thus any information in ndbinfo tables about MySQL Servers and API nodes are sent from those nodes to the data nodes.

The lowest level in ndbinfo is the base tables. All these base tables can be accessed directly using the table name ndb$base_table_name. The list of base tables and their columns is listed in the file NdbinfoTables.cpp in the directory storage/ndb/src/kernel/vm. These tables can be accessed using SQL, but they are not listed when you use the SHOW TABLES command in the MySQL client.

Most of the ndbinfo tables are fairly straightforward mappings of these tables into the available ndbinfo tables. The mappings are defined in the file ndbinfo.sql in the directory storage/ndb/tools. This file contains definitions also of the base tables.

Some of the ndbinfo tables are a bit more complex mappings where we use a bit of mapping integers to names, perform some join operations and so forth. Some of the tables in ndbinfo are also views.

As in any monitoring software there is an overhead when asking for monitoring information. For the most part the ndbinfo queries are cheap and will only return a limited amount of rows. There are some examples that could potentially return a substantial amount of rows.

In particular all tables that lists data about fragments and the fragment replicas can potentially contain millions of rows in a large cluster with many tables. These should be used with some level of care.

At the same time the ndbinfo implementation is done in such a way as to ensure that the queries against ndbinfo cannot overload the cluster. Thus there is a fair amount of serialisation. Thus if the ndbinfo query returns a large result, the query will take longer time.

Each query will scan one node at a time, each round trip to this node will retrieve at most 256 rows and at most 65535 bytes (both those sizes are configurable as MySQL options). It is also possible to make the ndb$... tables visible through SHOW TABLES by setting a MySQL option.

For the most part it is ok to query often. The ndb_top program that will be discussed in a later chapter uses a query against the cpustat table in ndbinfo. It issues such a query once per second and the load from this is barely noticable.

The performance schema in MySQL has a considerably higher overhead even when not queried. The ndbinfo tables only give an overhead when actually queried. All the statistics reported is gathered all the time and are most of the time simple variables that are incremented while performing other operations on the same data structures. The performance overhead from statistics in NDB is very small.

The manual contains information about all stable tables in ndbinfo. The base tables can change from version to version and can even disappear and new one can be added. So any tools developed towards ndbinfo should only use the official ndbinfo tables.

But for debugging and other purposes it can at times be useful to know about the base tables as well. Here we list also the base tables and their purposes.

Most of the detailed information on columns I will skip here. It can be found in the MySQL manual and in the files I mentioned above found in a source tarball of RonDB.

Metadata tables#

ndb$tables and ndb$columns#

These tables lists the tables and columns in the ndbinfo database.

dict_obj_info#

This table contains a list of meta data objects stored in the cluster. This includes objects for tables, ordered indexes, unique indexes, system tables, various trigger types, tablespaces, log file groups, data files for tablespaces, undo log files for log file groups, foreign keys.

For most tables about meta data it can be useful to join the table id with the id in this table to find out about the name of the table and by also joining with dict_obj_types we can get the type name of the table.

Some objects such as BLOB tables have a parent object. These are listed in this table. We can get the table version from this table.

dict_obj_types#

The dict_obj_info has a column called type. This is the type of object (e.g. table, ordered index, unique index, event object and so forth). The dict_obj_types table have a mapping from type id to type name. Joining with this table can be very useful when there is an interest in what type of object we are looking at.

table_info#

This contains information about table distribution, flags about reading from backup, fully replicated table, and at which GCI the table was created and so forth.

ndb$stored_tables#

This is the base table listing information retrieved from the DBDIH block. It contains a fair amount of flags on table distribution, special hidden columns. There is a fair amount of detailed flags available in the base table that is not available in table_info.

table_distribution_status#

This table contains information also from DBDIH. This information has a lot more status variables on local checkpoints, online table reorganisations, status of the table and number of partitions and fragments (can be changed through online reorganize of the table). Contains the same information as the base table except that in the base table you get the type of the table. This can be found also in the dict_obj_info table.

table_fragments#

This contains one row for each table fragment stored in RonDB. This is the information that RonDB uses to decide from where to read the data. The information is the same as in the base table ndb$table_fragments.

This table can be used to discover table distribution, it can be useful to understand what goes on if the load is unbalanced in the cluster.

We have both partition id and fragment id in this table. For most tables this is the same. But for fully replicated tables there is a difference between fragment and partition.

Partition is the division of data seen from the users point of view. Fragments contain the data of partitions internally in RonDB. Fully replicated tables store a partition in multiple fragments, one fragment for each node group. All other tables store partitions in one fragment and thus for those tables partition and fragment are synonyms.

ndb$table_fragments_all#

This contains the same information as the table_fragments table. The difference is that the table_fragments only retrieves information from the current master data node in the cluster.

The table distribution is stored in the DBDIH block in each node. This information should always be the same in all nodes. This base table can be used to verify that the table distribution information is correctly set in all nodes. This table can be useful in researching potential issues with table distribution.

table_replicas#

This contains one row for each fragment replica in RonDB. Each fragment is stored in a set of fragment replicas, one fragment replica per node in the node group where the fragment is stored. Most of the information in this table is low level information used by RonDB in recovery situations and is thus not so important for a user. This table is mostly intended for support cases where one can check this table to get a detailed glimpse into what is going on inside the DBDIH block.

ndb$table_replicas_all#

Same information as in table_replicas but here the information is fetched from all nodes, not only from current master data node.

Connection information#

A table was added to ensure that external tools had a way to discover the cluster when connecting to one of the MySQL Servers.

processes#

This is an important table in some cases. This is a table for cluster discovery. By pointing a tool to one of the MySQL servers in the cluster you can automatically find out where all the cluster processes are placed. Each node that is currently up and running is listed in this table.

A node can be a data node. In this case we get pid of the data node process and the pid of the angel process as well. In the service_URI column we can see the hostname of this node.

The same information is available also for RonDB management servers. In addition the service_URI column here also contains the port number to connect to this management server.

For MySQL servers we get information about hostname to connect to the MySQL server and the port it listens. We also get information of its server-id.

Here we can have multiple node ids using the same MySQL Server, they are separated by specifying which cluster connection they use in the service_URI column, in addition they have different node ids.

For MySQL servers the port number is the port that MySQL clients use to access MySQL.

One example usage of this table is to write your application such that it connects to one of the MySQL Servers first, from this server we get the rest of the MySQL Servers connected to this cluster, we also get the RonDB management servers. From this we can ensure that we can automatically set up round robin lists of MySQL Servers to connect to for those MySQL connectors that support load balancing (e.g. JDBC connector).

Configuration information#

It can be useful to discover the configuration from a MySQL client as well.

config_params#

This table contains a list of all configuration parameters and information about their name, description, default values, min value, max value, type information and whether the parameter is default or not.

This table can be useful to join with when checking the configuration in the cluster at the moment.

config_nodes#

This table contains a list of all configured nodes in the cluster with their node type, node id and the hostname where they are allowed to execute from.

config_values#

Each node receives a full configuration from the RonDB management server. This information is only updated when a node restarts. Thus during a configuration change the configuration can differ in the nodes. This table can be used to view the current configuration stored in each node. We can only get this information for data nodes.

Transaction information#

Transactions and the operations that they consist of are very important to analyse. We want to have statistics on how they behave to see long-term trends.

We need to understand latency issues. An important characteristic of RonDB is predictable latency. We want to enable you to track latency in the RonDB data nodes to understand if latency issues are coming from the RonDB data nodes or if they are added somewhere else in the software stack.

For application development it is important to see views on what is going in more detail currently in the cluster. We have tables that show the current locks, the currently executing operations and the currently executing transactions.

Transaction statistics#

counters table#

The counters table contain statistical counters from DBLQH, DBTC and DBSPJ. DBTC is the transaction coordinator so every transaction starts and ends here. Most every operation starts and stops here as well. The exception is committed reads using key lookups. In this case the query is routed to the owner of the data and the query never returns to DBTC. DBSPJ works together with DBTC to handle complex join queries. DBLQH handles the actual reads and writes and scans of data. Thus they are incremented only in the nodes that contain the data used in the queries. DBTC is incremented in whatever node the transaction is executed from.

In DBTC we collect information about transactions, commits, reads, writes, simple reads, aborts, full table scans, ordered index scans (range scans), local reads and local writes.

The meaning of a local read is that it reads data where DBTC is located in the same node as the data. Remote reads is the opposite where we don't have the data locally accessible. A local write is where the transaction coordinator is on the same node as the primary replica of the row to be write.

Each operation contains an interpreted program for execution. For reads this is normally one word (32 bits) per column read and for writes it is normally one word per column plus the new column value. A delete normally have no interpreted program sent at all. In some cases an interpreted program is sent as well that could be for example the filters in a scan. For joins that are pushed down it contains the join execution plan. These words are summed together in the counter named ATTRINFO.

Partition pruned scans will only scan one fragment since the partition key is bound in the index scan. There is currently no specific statistics about partition pruned scans in DBTC, there is more information in DBSPJ about those things for pushdown joins.

DBLQH counts the number of operations (one per row access and one per scan). In addition it collects detailed information about when an overload occurs. Overload occurs when a row operation is initiated in DBLQH and the communication to the nodes receiving (TC node, API node, event API nodes for updates) the query result is overloaded.

The following simple query collects the total amount of each counter for the entire cluster grouped on the counter name.

::: tcolorbox mysql> select counter_name, sum(val) from counters mysql> group by counter_name; :::

And the following query counts the total amount of transactions started in the cluster (includes both aborted and committed transactions and ones not yet completed) since the cluster was started.

::: tcolorbox mysql> select counter_name, sum(val) from counters mysql> where counter_name = 'TRANSACTIONS' mysql> group by counter_name; :::

The counters table provides an easy way to get transaction and operation statistics with easy queries. By executing a certain query with some time in between, for example the above query on transactions, will provide the current execution rate. E.g. if the above returns 10200 and 5 seconds later it returns 24200 we have executed 2400 transactions per second during these 5 seconds. By summing READS, WRITES and SIMPLE_READS we get the total number of key lookups and TABLE_SCANS provides the number of scans. These are interesting measurements to understand the throughput in the cluster for the moment.

If the application throughput is low and the cluster is performing many operations it can be interesting to drill down on what low level queries the application generates.

tc_time_track_stats#

Every time a transaction or an operation is completed we measure the time it took to execute it. We put the result in a histogram that starts at 50 microseconds and goes upto 14.3 seconds in 32 logarithmic steps.

We measure response time for queries, transactions, scans, fragment scans and so forth. Each entry in this histogram is one row in this table, the row lists the upper bound for the row, the lower bound is two thirds of the upper bound.

The below query collects the amount of transactions that have completed within 500 microseconds and the total amount of transactions executed, both since the nodes were started.

::: tcolorbox mysql> select sum(transactions) from tc_time_track_stats mysql> where upper_bound < 500 mysql> union mysql> select sum(transactions) from tc_time_track_stats; :::

By collecting this information regularly we can discover if the the percentage of transactions with higher latency goes up or down.

Transaction latency includes time the transaction is waiting for the application.

By focusing on a specific type of operation, e.g. write_key_ops we can see the latency is for individual operations that have no interaction with the API nodes as part of the latency.

Live transaction information#

cluster_locks#

Every locked row in the cluster is reported in this table, with information about the table it waits for and other things. Useful to discover application problems with concurrency.

cluster_operations#

This lists every operation currently ongoing in RonDB. There can be millions of those, thus this table is mostly intended for exploratory usage. Probably most interesting as education tool for RonDB.

cluster_transactions#

Same as cluster_operations except that it logs all ongoing transactions. Probably most interesting as education tool for RonDB.

server_*#

The server_locks, server_operations, and server_transactions are equivalent to the cluster_* tables except that they log only those locks, operations and transactions that was initiated on the MySQL Server from where the query originated.

These tables are views on the cluster_* tables.

Memory information#

We have a few ndbinfo tables that can be used to gather information about memory usage. In particular those gives insights into how we should configure the variable SharedGlobalMemory, DataMemory and LongMessageBuffer and many other configuration parameters that affect memory usage.

resources#

For the most part this table shows how much memory is allocated statically to a number of memory pools. A fair amount of the memory in RonDB is allocated in various pools. We are constantly adding more and more resources to these pools. There are still parts of the memory that is allocated without assignment to a special pool.

The generic reasoning is that a pool can reserve memory, this memory can not be shared with other pools. Each pool have a maximum allowed memory, if 0 there is no upper bound. The reserved memory for a pool is allocated when the data node starts.

The extra memory that a pool can allocate is coming from the memory allocated according to the configuration parameter SharedGlobalMemory. This memory is allocated when the data node starts. Almost no memory at all can be allocated after the data node starts. The data nodes manage the memory while up and running.

Allocation of memory at startup takes considerable time. In addition to allocating the memory we touch the memory before we move on with the start. The reason is that we need to ensure that the memory is allocated for real and not just in some swap area. This is necessary to achieve the predictable latency. It is possible to lock the memory in the data node such that it cannot be swapped out. This is achieved by setting the LockPagesInMemory configuration parameter to 1.

This ndbinfo table shows how much memory is reserved to a pool, how much memory the pool can allocate at most (0 means no limit) and how much memory is currently allocated to it. All the memory used above the reserved space is coming from SharedGlobalMemory. The table lists the number of pages (32 kByte pages) for this resources.

When a log file group for disk data is created the memory space for the UNDO log buffer is taken from the SharedGlobalMemory. If a log file group is created as part of the configuration, the memory for the UNDO log buffer is automatically added to the memory allocated for SharedGlobalMemory.

JOB_BUFFERS are used for communication between blocks inside the NDB data node, it is also used when sending to other nodes. This can be a fairly sizable memory chunk since it is important to not run out of memory for internal communication. The size of this memory is dependent on the number of threads in the data node. It is not configurable in any other way, all its memory is reserved and nothing extra can be allocated.

TRANSPORTER_BUFFERS is the memory used to buffer data before it is sent over the network (using TCP/IP sockets or shared memory). This is configurable and depends on SendBufferMemory and the number of threads in the data node. It can also be set directly through the configuration parameter TotalSendBufferMemory.

This memory is shared for communication with all nodes. The buffer used towards one node can use extra memory segments if available. It can allocate up to 25% more memory from SharedGlobalMemory compared to its configuration if necessary and if available.

DISK_RECORDS have no memory permanently attached to it, it allocates memory from the extra memory available through SharedGlobalMemory. This is used for all page and file operations associated with disk data columns and stores request information.

QUERY_MEMORY have no reserved memory, it has no upper bound on its allocation. It is memory used for executing complex join queries in the NDB data nodes. It allocates its memory from SharedGlobalMemory.

DISK_PAGE_BUFFER is the disk page cache. All its memory is reserved and no more can be allocated. The size is set through the configuration parameter DiskPageBufferMemory.

DATA_MEMORY is the memory used for in-memory data. This is the pool that the user want to pay special attention to since running out of memory in this pool means that no more data can be stored in NDB. There is a limit such that during normal operation one cannot allocate more than 95% of this memory. The rest is only available during restarts. During a restart it is possible that we have a small increase in memory during the various restart phases. To ensure that we don't run out of memory during recovery, we ensure that not all DataMemory is used during normal operation.

FILE_BUFFERS are used for REDO log buffers. This is configured with the configuration parameter RedoBuffer.

memoryusage#

The memory usage gives information about DataMemory again, it also provides information about LongMessageBuffer. The LongMessageBuffer is an important resource that is used for longer signals sent internally. When a message arrives that is a long message it allocates a memory segment from this resource, it is held until the signal have been executed, similarly resources are allocated when long signals are sent and released when put into the send buffers of the transporters. It is important to have a sizeable amount of memory available for this purpose, especially in a data node with much traffic and large amount of data being transported in and out of the database.

Tracking the amount of memory used for the long messages, to compare it with the total available memory, is a vital management operation to discover issues before they happen.

ndb$pools#

This table contains information not represented in any ndbinfo table. The reason is that most of this information is in the process of being moved towards the above resources table. From the beginning NDB had all its data in large memory arrays with fixed size of all of them. We have been moving data piece by piece from these fixed size arrays to pools instead. There is still a significant number of those fixed arrays remaining. Many of those can be seen in the ndb$pools table.

Moving those arrays into pools makes RonDB less susceptible to various configuration mistakes and makes RonDB much more capable in adapting to different user needs. The aim is to move the most memory consuming resources first, this ensures that the ones with fixed arrays can be allocated to ensure that they won't run out of memory.

Therefore this table represents mostly things that is not stable over many versions of RonDB. Thus the table is not a good idea to become part of the ndbinfo table.

It does still contain lots of information about how much memory resources that are allocated in NDB for various purposes. The table specifies the configuration parameters that these array pools are dependent on when allocated. There can be up to four configuration parameters affecting them. Some of them are static and not changeable as well.

Thread statistics#

The internal architecture of the RonDB data nodes uses several different thread types. Each thread type handles a functional part. Almost all databases today use multiple threads per program (some use many programs working together on one computer).

RonDB use a model internally where execution is divided functionally into a number of different thread types. We have one set of threads handling receiving from the network, another set of threads handling sending to the network, a set of threads handling transaction coordination and routing of query processing, a set of threads handling the local database parts, a set of threads handling interaction with the OS file system.

There is thread types that always have only one thread, it is the main thread handling management of meta data and some restart functions, the rep thread handling events and proxy functions for the other thread types and a set of restart functions and functions related to online reorganisation of data parts.

We keep track of the use of these threads in several dimensions.

In the cpustat one can track the exact amount of CPU usage that each thread has used for all thread types except the io threads that interact with the file system. The OS has exact knowledge about when a thread is active and when it isn't, this information RonDB retrieves through the API to the OS information.

In addition for those threads, RonDB also tracks the wall clock time that these threads are active. If the wall clock time is close to equal to the time reported by the OS there is very little contention for CPU resources. If the wall clock is much higher than the time reported by the OS, there is much contention on CPU resources. Contention on CPU resources means that the throughput can vary quite a lot over time. To avoid this RonDB provides the capability to lock each thread type to individual CPUs or groups of CPUs (including the io threads).

In addition we have the threadstat table that provides statistics on how block threads (all thread types except io and send) execute. Block threads executes signals in a loop, below is a simple pseudo code to show how this loop looks like.

::: tcolorbox while (true) { c_loop++ gather signals to execute for each signal { execute_signal, c_exec++ } if (no more signals to execute) { c_wait++ sleep to wait for signals to arrive } } :::

In the pseudo code we have shown where the columns c_loop, c_exec and c_wait are updated. If the number of signals executed for each signal gathering is high we make very efficient use of the CPU. Similarly we want to execute as many signals as possible before we go to sleep. This is one reason why NDB data nodes becomes more and more efficient as load increases.

A thread that goes to sleep is woken up by another thread that wants to communicate with it. The recv threads are woken up when signals arrive on a socket or in shared memory. Waking up a thread is quite fast, it takes at most a microsecond or two. But if it happens a lot it could hamper performance. RonDB provides the possibility to spin for a few microseconds before going to sleep. This removes the need to wait for a wakeup at the expense of not returning the CPU to the OS. In a system with CPUs exclusive to the RonDB data nodes it can be useful. In a shared environment where RonDB shares CPUs with other programs it is not a good idea to use it.

Tracking the threadstat table is an important tool to see if it is useful to employ spinning.

blocks#

This table contains a mapping from block number to block name that can be useful to join with.

threads#

This contains a mapping from thread number to thread name and thread description that i useful to join with.

threadblocks#

This table shows exactly which blocks that execute in which threads. Most blocks execute in more than one thread.

threadstat#

In addition to the statistics on the execution loop it provides information about signals sent of different priorities, how much OS time is reported on each thread.

cpustat#

This lists the CPU usage, both real from the OS and wall clock time per thread in percentage of the last second. This table is used as input to the ndb_top tool.

cpustat_50ms#

This table provides 20 measurements, one per 50 milliseconds. Each measurement lists the amount of time spent in user mode, system time and idle time as reported by OS, it reports amount of time executing and sleeping according to wall clock. It provides the amount of time spent sending, spinning, waiting for buffers when no buffers available and the time since the last measurement.

The measurements are in microseconds.

cpustat_1sec#

This is an aggregation of cpustat_50ms and reports one row per second for each thread the last 20 seconds. The columns are exactly the same.

cpustat_20sec#

This is an aggregation of cpustat_1sec and reports one row per 20 seconds for each thread the last 400 seconds.

Node information#

We have a few tables that show information about nodes.

nodes#

This table has one row per data node. It shows node id, uptime in seconds, the node status (e.g. STARTED), start phase currently executing (0 when alreadys started) and the version number of the cluster configuration it uses.

This table can be used instead of the SHOW command in the NDB management client to see the status of data nodes.

membership#

This is another table that shows information about the node group of a data node and various information related to selection of master node in the cluster and arbitrator.

When nodes enter the cluster, they do so one at a time. Each node gets a unique id based on when they started. The node with the lowest such id is always the master node in the cluster.

This table shows the master role, the succession order for who becomes the next master. The master role is called president in the heartbeat protocol. In higher level protocols the president is called master. One node handles both the presidency and the master role in a number of protocols. It is normally not a role that requires a lot of work, but it is required to know who is in charge of a protocol. Otherwise it is very hard to figure out what to do in error situations.

This table also presents the arbitration data. The president selects the arbitrator. The table shows which node is president, the state of arbitration and which nodes are the potential successor if the current one fails.

This table provides a view on the heartbeat order. Each node sends a heartbeat to its right neighbour and expects a heartbeat from its left neighbour. This table shows who is left and right neighbour. The selection of neighbours is also based on the order of joining the cluster.

arbitraror_validity_detail#

This table is a view on the membership that focus on only the arbitration data.

arbitraror_validity_summary#

Yet another view on the membership table that presents the arbitrator, its ticket, its status and the number of nodes that agree about it.

Restart information#

Analyzing restarts is important. We have a table called restart_info that provides information about 19 different restart phases and how many seconds they took to execute.

This is an important tool to understand recovery performance in your NDB setup. Most of the crucial phases benefit from parallelising to many ldm threads.

restart_info#

The information about the latest restarts is available in this table. It can only report data about node restarts. Inital cluster start is not logged and neither is complete cluster restarts.

Most of the times measured in these phases is very short. But we provide the details about each phase to ensure that we can understand what is going on in the cluster also in the not so normal states.

Only one row per node can exist. If the table is used during a node restart it will report 0 in those parts of the restart that haven't started yet. If the node haven't made any node restarts since the master node started there will be no records in this table about that node.

The first column secs_to_complete_node_failure keeps track of the current time used to complete the node restart.

A node restart consists of the following stages:

First the cluster must complete the node failure handling. This ensures that all nodes know about the failed node. A node with the same node id cannot join again until this is completed. It includes cleaning up all transactions where the failed node was transaction coordinator. It usually takes a few seconds. When this is completed the node is allowed to allocate its node id again.

After allocating the node id, the node must perform its initial startup code. The major time consumers here are allocation of memory (can take about 1 second per GByte of memory) and for initial node restarts there is also initialisation of REDO logs and disk data files. Entry into the heartbeat also takes a bit of time since only one node is allowed to enter at a time. A few seconds it will take per node that is restarting in parallel.

secs_until_wait_for_ndbcntr_master is the time it takes to get permission to start from the master. Normally this is immediate, but only node at a time is allowed to copy the meta data from the master node. There could be some wait here for other nodes to complete their meta data copying.

secs_wait_for_ndbcntr_master is the time it takes to wait for master to allow to continue the restart. Normally immediate.

secs_to_get_start_permitted is the time it takes to receive the permit also from all other nodes. Normally immediate.

Next we measure the time it takes to wait until we can copy the meta data from the master node. The meta data is locked for a short time during startup of a local checkpoint.

Next we measure the time it takes to copy the meta data. Normally not a very long time unless there is a large amount of tables and other meta data in the cluster.

Next we measure the time it takes to include node in various cluster protocols such as local checkpoint and global checkpoint. Normally immediate.

Next we measure the time it takes for the starting node to start requesting to restore its data. Normally immediate.

secs_for_local_recovery is the total time it took to restore fragments, execute UNDO log, execute REDO log and rebuild indexes.

secs_for_restore_fragments is the time it takes to restore the local checkpoint files from disk. Each ldm thread can restore about one million rows of about a hundred bytes per second. Roughly 100-200 MBytes per second is restored here. This phase can takes several minutes to complete if the database is hundreds of GBytes.

secs_undo_disk_data is the time it takes to execute the UNDO log. The UNDO log is used to ensure that the disk data rows are synchronized with the in-memory rows before we start executing the REDO log to generate a consistent checkpoint of the local data. The time it takes to perform this part is dependent on the length of the checkpoint before crash and the amount of writes to disk data before the crash.

secs_exec_redo_log is the time it takes to execute the REDO log.

secs_index_rebuild is the time it takes to rebuild all ordered indexes. This is heavily parallelised since it can easily become a major time consumer with many ordered indexes.

secs_to_synchronize_starting_node is the timing of an important phase where we synchronize our data with a live node in our node group. The time it takes to perform this phase is very much dependent on how much changes have occurred since we crashed.

secs_wait_lcp_for_restart contains the time it takes to wait for a local checkpoint to complete that we are part of. It is necessary to participate in a local checkpoint before we complete the node restart. Without a completed local checkpoint our node would not be recoverable and we could create an unrecoverable cluster if we were to continue running the cluster without the aid of a recoverable node.

A local checkpoint is started as soon as possible when a node is in this wait state. But we allow some extra time before we start a local checkpoint if other nodes are close to coming to this state.

secs_wait_subscription_handover is the final step in a restart. This synchronizes the new node with all MySQL replication servers and other API nodes that uses the NDB Event API. If such an API node or MySQL server is currently down this phase will time out after two minutes.

total_restart_secs is the total time the restart took.

Local checkpoint information#

We have a few tables that provide information about the local checkpoint characteristics and REDO logging. The disk_write_speed_aggregate_node table is a natural place to start looking. This table contains a summary per node of disk write speed of local checkpoint and REDO logs. It has columns for last second, last 10 seconds and last 60 seconds.

Next the disk_write_speed_aggregate table have more detailed information. First it provides the information per ldm thread. Next it also provides standard deviation of the aggregate numbers.

The disk write speed of a local checkpoint is set by an adaptive algorithm. It is slowed down when the CPU usage is too high, it is slowed down when we see a too high IO lag on the REDO log. Each second we make adjustments to the disk write speed of local checkpoints. If no slowdown is performed we will use the maximum disk write speed as configured by MaxDiskWriteSpeed. Slowdowns can slow it down to MinDiskWriteSpeed. The current disk write speed is reported in this table in the column current_target_disk_write_speed. At restarts we can use MaxDiskWriteOtherNodeRestart or MaxDiskWriteSpeedOwnRestart as the maximum local checkpoint speed.

In RonDB a more elaborate scheme is implemented on top of the above where it is possible to increase the speed of writing the checkpoints to ensure that we don't run out of REDO log. This is explained in a separate chapter on checkpointing in RonDB.

The disk_write_speed_base table contains the base data for the disk write speed of local checkpoints and REDO log for the last 60 seconds.

All the speeds are reported in bytes per second.

Watching those tables together with the table logspaces is useful to track if the size of our REDO log is sufficient. If it isn't we can either make it bigger (increased time for recovery), or we can increase the disk write speed (increased CPU overhead for checkpoints).

Increasing the disk write speed is handled by the *DiskWriteSpeed* parameters.

Increasing the REDO log size must be performed using an initial node restart since RonDB currently doesn't support changing the size of the REDO log as an online operation. It does support this for the UNDO log.

Disk page cache information#

RonDB have a normal disk page cache used to access data for disk columns. The diskpagebuffer have statistics for the accesses to this page cache.

diskpagebuffer table#

There is one row for each ldm thread, in addition there is one row for the rep thread. This is the last thread, the so called extra PGMAN worker thread. It is the proxy block of PGMAN that is used to read and write the extent pages. The extent pages are read in at creation of the data file of a tablespace. Once it has been read, it will not be swapped out again from the page cache. Thus mostly the extra thread (the one with highest thr_no) will only perform page writes.

The column pages_written contain the number of pages written due to congestion in the page cache, the write is needed to remove a dirty page from the page cache to be able to read in new pages from disk.

The column pages_written_lcp is the number of pages written as part of a local checkpoint. This is a normal write where the page was written to and the next local checkpoint will write the page to ensure that we can cut the UNDO log and REDO log.

The counter pages_read is incremented every time a page completes a read from disk.

The counter log_waits shows the operation of the WAL protocol for disk data pages. Thus if a page is made dirty by an updating transaction an UNDO log is written. The page cannot be written until the UNDO log entry have been flushed to disk. If we attempt to write before the UNDO log have been flushed, this counter is incremented.

page_requests_direct_return counts the number of times that the page cache contains the row and it is returned immediately.

page_requests_wait_queue is the number of times we request a page and the page have been requested already, but not yet arrived into the page cache. We wait in the queue for this page to be delivered.

page_requests_wait_io is the number of times we request a page and we have to start an IO request to fetch it from the disk.

page_requests_direct_return divided by the sum of page_requests_direct_return, page_requests_wait_io and page_requests_wait_queue is the page cache hit rate.

By making regular SELECTs against this table we can gather the page cache hit rate as it changes over time.

Communication information#

To keep track of communication speed and potential overload problems in the communication paths we have the transporters table to our disposal.

transporters#

There is one row for each communication link. Each data node is linked to all other nodes. There are no links between API nodes and there are no links from the RonDB management servers to other nodes (other than when fetching the confguration from the RonDB management server).

The row is identified by the node id and the remote node id of the communication link. Communication links between the data nodes is reported from both data nodes. This communication link can be a bottleneck in heavy update scenarios.

status is the status of the link, CONNECTED when connected and CONNECTING when not connected. It can also be in some intermediate states while disconnecting.

When the status is CONNECTED we also have a hostname of the remote node.

We provide bytes_sent and bytes_received as counters that are constantly decreasing. To measure the current transfer rate we need to make constant queries against this table.

Every time a connection uses more than 60% of the configured SendBufferMemory it enters the slowdown state. The slowdown column is 1 if it is in this state and 0 otherwise. The slowdown_count is incremented every time we enter this state.

In the slowdown state any scan operations using this connection will automatically have its batch size set to a minimum to avoid overloading it even more.

Every time a connection uses more than 80% of the configured SendBufferMemory it enters the overload state. The overload column is 1 if it is in this state and 0 otherwise. The overload_count is incremented every time we enter this state.

Any attempt to start a new transaction or a new scan in this state will lead to an error indicating that the send buffer is overloaded.

We also track the number of times we connected between those nodes.

Status of REDO and UNDO logs#

The REDO log and UNDO log are important to track. If we run out of buffer space for any of those logs we will have to abort transactions.

The ndb$logspaces and ndb$logbuffers tables have one additional column high that records the highest use so far of the log space/log buffer.

logspaces table#

This is a simple table that have one row for each log. The REDO log thus have one row per log part (by default 4 of those) whereas the UNDO log only have one row since there is always just one part.

In total we have the total size of the log part and in used we have the current usage.

Using this table it is very easy to see the percentage of the various log parts that are currently in use.

The columns in this table are node_id, log_type which is either REDO or DD-UNDO, log_id that is 0 for REDO logs and 10 for UNDO logs, log_parts is always 0 for UNDO logs and it is the log part id for REDO logs (starting at 0).

logbuffers table#

This table has similar columns where the most important ones are total and used. It represents the memory buffer for the UNDO log buffer and the REDO log buffer for each ldm thread. The log part id for REDO logs starts at 1 and is equal to the number of ldm threads.

Fragment statistics#

We have a number of tables that list information about all the fragment replicas in the cluster. There can be quite many of those fragments in a large cluster with many nodes, it is recommended to use these tables with a bit of care.

The base tables have the same information, the ndbinfo tables are views that have joined to retrieve a bit useful information about table names and object types. The base tables for those tables are ndb$frag_locks, ndb$frag_operations and ndb$frag_mem_use.

locks_per_fragment#

This table have lots of statistics about how many accesses of various types that a specific fragment have received. We have columns tracking the number of requests with exclusive lock and other columns showing how many of those requests are immediately ok, how many put us in a lock queue with successful ending and how many locks that result in an aborted transaction. Similar columns for shared locks.

There is also a column on how many milliseconds that we spent waiting for row locks in each fragment and how many milliseconds we spent waiting that resulted in aborts.

This table can be useful to get an overview of how many lock issues we have in our application. If we discover that we have issues we can drill down on those using the cluster_locks and server_locks table that shows the current locks and the waiters for those.

operations_per_fragment#

This table gives very detailed statistics for each fragment. We can see number of key reads, updates, writes, deletes and number of key lookups that failed.

We can see total number of bytes of interpreted programs we send, total number of key information we send, total number of instructions executed in the interpreter by key lookups and total number of bytes returned to the application from key lookups.

For scan operations we record number of scans, number of rows examined while scanning and number of rows returned from scans, also number of bytes returned from scans, number of bytes interpreted program received, number of index bound bytes received for scans, total number of interpreted instructions executed as part of scans, total number of queued scans and current number of active scans and current number of queued scans.

Finally the number of commits on this fragment. There is a lot of statistics we can derive for each fragment.

This table can be used to get a very detailed view on what tables and fragments that are being used and in what way they are used. One should though be a bit careful in using it small delays between each query since the overhead for querying this table is fairly high, especially if the number of tables and nodes are high.

memory_per_fragment#

This table gives a very precise view on how much memory is allocated to each table fragment.

fixed_elem_alloc_bytes is the number of bytes allocated to the fragment for fixed size rows in total. This is always a multiple of 32 kBytes since pages are allocated to fragments.

var_elem_alloc_bytes is the number of bytes allocated to the fragment for variable sized rows in total. This is always a multiple of 32 kBytes since pages are allocated to fragments.

hash_index_alloc_bytes is the number of bytes allocated to the fragment for the hash index. This is always a multiple of 8 kBytes since a quarter of a 32 kByte is the allocation unit for hash pages.

fixed_elem_size_bytes is the size of the fixed size part of the table and fixed_elem_count is the number of rows in the fragment (each row has a fixed size element, thus this is a count of the rows as well as a count of the fixed size parts in the fragment).

fixed_elem_free_bytes is the number of free bytes available in the already allocated space for the fixed size part of the rows. This free size is equal to fixed_elem_free_count multiplied by fixed_elem_size_bytes.

var_elem_free_bytes is the number of free bytes on already allocated variable sized pages.

var_elem_count is the number of variable sized parts allocated to the table.

Error messages#

The error_messages table is a static table that reports all errors in RonDB. It lists the error_code, the description of the error (error_description), the error status (error_status), the classification of the error (error_classification)

This table can be used to quickly translate an error message to something hopefully understandable.

Summary of ndbinfo information#

ndbinfo is a useful tool to gather information about CPU usage, memory usage, network usage, disk usage, buffer spaces, log spaces, about meta data and detailed information about table fragments and transaction statistics and live information about transactions.