Monitoring Restarts#

There is an ndbinfo table called restart_info that records how long each of the 20 phases of a restart take. This can be used for debugging restarts and for understanding how to improve restart times.

Let’s say we have just started a new cluster with one node group and a replication factor of two. We therefore have two data nodes.

After the initial cluster start, the table is empty:

mysql> USE ndbinfo;
mysql> SELECT * FROM restart_info;
Empty set (0.01 sec)

Now we stop one data node and check the table again after a few seconds:

mysql> SELECT * FROM restart_info \G
*************************** 1. row ***************************
                                     node_id: 2
                         node_restart_status: Node failure handling complete
                     node_restart_status_int: 3
               secs_to_complete_node_failure: 0
                    secs_to_allocate_node_id: 0
       secs_to_include_in_heartbeat_protocol: 0
          secs_until_wait_for_ndbcntr_master: 0
                secs_wait_for_ndbcntr_master: 0
                 secs_to_get_start_permitted: 0
     secs_to_wait_for_lcp_for_copy_meta_data: 0
                      secs_to_copy_meta_data: 0
                        secs_to_include_node: 0
secs_starting_node_to_request_local_recovery: 0
                     secs_for_local_recovery: 0
                      secs_restore_fragments: 0
                         secs_undo_disk_data: 0
                          secs_exec_redo_log: 0
                          secs_index_rebuild: 0
           secs_to_synchronize_starting_node: 0
                   secs_wait_lcp_for_restart: 0
             secs_wait_subscription_handover: 0
                          total_restart_secs: 0
1 row in set (0.01 sec)

As one can see, one of the first phases of the restart is complete: we have completed Node failure handling.

Now, we restart the data node again and check the table after a few seconds:

mysql> SELECT * FROM restart_info \G
*************************** 1. row ***************************
                                        node_id: 2
                            node_restart_status: Wait LCP to ensure durability
                        node_restart_status_int: 17
                secs_to_complete_node_failure: 0
                    secs_to_allocate_node_id: 57
        secs_to_include_in_heartbeat_protocol: 2
            secs_until_wait_for_ndbcntr_master: 0
                secs_wait_for_ndbcntr_master: 0
                    secs_to_get_start_permitted: 0
        secs_to_wait_for_lcp_for_copy_meta_data: 0
                        secs_to_copy_meta_data: 0
                        secs_to_include_node: 3
secs_starting_node_to_request_local_recovery: 0
                        secs_for_local_recovery: 0
                        secs_restore_fragments: 0
                            secs_undo_disk_data: 0
                            secs_exec_redo_log: 0
                            secs_index_rebuild: 0
            secs_to_synchronize_starting_node: 0
                    secs_wait_lcp_for_restart: 0
                secs_wait_subscription_handover: 0
                            total_restart_secs: 0
1 row in set (0.00 sec)

Running the distributed local checkpoint is one of the last phases as shown previously, and is required to be able to recover the latest data from the new node.

To make sure that the restart has completed, we check one last time and see this confirmed:

mysql> SELECT * FROM restart_info \G
*************************** 1. row ***************************
                                        node_id: 2
                            node_restart_status: Restart completed
                        node_restart_status_int: 19
                secs_to_complete_node_failure: 0
                    secs_to_allocate_node_id: 57
        secs_to_include_in_heartbeat_protocol: 2
            secs_until_wait_for_ndbcntr_master: 0
                secs_wait_for_ndbcntr_master: 0
                    secs_to_get_start_permitted: 0
        secs_to_wait_for_lcp_for_copy_meta_data: 0
                        secs_to_copy_meta_data: 0
                        secs_to_include_node: 3
secs_starting_node_to_request_local_recovery: 0
                        secs_for_local_recovery: 0
                        secs_restore_fragments: 0
                            secs_undo_disk_data: 0
                            secs_exec_redo_log: 0
                            secs_index_rebuild: 0
            secs_to_synchronize_starting_node: 0
                    secs_wait_lcp_for_restart: 4
                secs_wait_subscription_handover: 8
                            total_restart_secs: 77
1 row in set (0.01 sec)

When reading the secs columns, keep in mind that we did not load the database with any data, so the restart was particularly fast.