Recovery in RonDB#
The next set of chapters deals with something that we put a lot of effort into, how to handle failures in RonDB. Failures of nodes in RonDB are expected, but we work hard to ensure that these failures only cause temporary failures, and that the cluster is operational again within milliseconds for software failures and within seconds for a hardware or operating system failure.
The first chapter in this part goes through how we handle node failures in the cluster, which failures lead to cluster crashes and how we handle network partitioning problems.
Next we go through our recovery architecture. It contains a large amount of protocols required to handle node failures, starting up nodes and recovering data in a crash situation.
Next we go through what the user can do to improve the restart times. To understand we go through the major time-consuming parts of the restarts and how to get information about how much time is consumed in each phase of the restart.
The next chapter goes through the start types we use and in which situations they are used.
The restart of an RonDB data node goes through a set of phases, we go through the details of these phases and also point to some places in the code where we cover those things in even more deep detail.