Timescale Cloud availability features are defined by the service plan level being used:
- Basic & Dev Plans: These are single-node plans and have limited availability combined with two day (one day for Dev) backup histories.
- Pro Plans: These are two node plans (master + standby) with higher availability and three day backup histories.
Minor failures such as a service process crashes or temporary loss of network access are handled automatically in all plans without any major changes to the service deployment. The service automatically restores normal operation once the crashed process is automatically restarted or when the network access is restored.
However, more severe failure modes such as losing a single node entirely, require more drastic recovery measures. Losing an entire node (virtual machine) could happen for example due to hardware failure or a severe enough software failure.
A failing node is automatically detected by the Timescale Cloud monitoring infrastructure. Either the node starts reporting that its own self-diagnostics is reporting problems or the node stops communicating entirely. The monitoring infra automatically schedules a new replacement node to be created when this happens.
Note that in case of database failover the Service URL of your service will remain the same, only the IP address will change to point at the new master node.
Single-Node: Basic & Dev Plans
Losing the only node from the service immediately starts the automatic process of creating a new replacement node. The new node starts up, restores its state from the latest available backup and resumes serving customers.
Since there was just a single node providing the service, the service will be unavailable for the duration of the restore operation. Also any writes made since the backup of the latest Write Ahead Log (WAL) file will be lost. Typically this time window is limited to either one of five minutes of time or one WAL file.
Highly Available: Pro Plans
When the failed node is a PostgreSQL Standby, the Master node keeps on running normally and provides normal service level to the client applications. Once the new replacement Standby node is ready and synchronized with the master, it starts replicating the master in real time as the situation reverts back to normal.
When the failed node is a PostgreSQL Master, the combined information from the Timescale Cloud monitoring infra and the Standby node is used to make a failover decision. On the nodes themselves we use the Open Source monitoring daemon PGLookout in combination with the information from the Timescale Cloud system infra. If it looks like the master node is gone for good, the Standby node will promote itself as the new Master node and will immediately start serving clients. A new replacement node is automatically scheduled and will become the new Standby node as described in the Standby node failure case above.
If both Master and Standby nodes fail at the same time, two new nodes are automatically scheduled for creation and will become the new Master and Standby nodes respectively. The Master node will restore itself from the latest available backup, which means that there can be some degree of data loss involved. Namely any writes made since the backup of the latest Write Ahead Log (WAL) file will be lost. Typically this time window is limited to either one of five minutes of time or one WAL file.
The amount of time it takes to replace a failed node depends mainly on the used cloud region and the amount of data that needs to be restored. However, in the case of services with two-node Pro plans the surviving node will keep on serving clients even during the recreation of the other node. All of this is automatic and requires no administrator intervention.
For backups and restoration Timescale Cloud utilizes the popular Open Source backup daemon PGHoard that Timescale Cloud maintains. It makes real-time copies of Write Ahead Log (WAL) files to an object store in compressed and encrypted format.