For Timescale Cloud Pro plans, there is one standby read-replica server configured. Read-replica servers can be queried, but no writes are accepted. In case master server fails, standby replica server is automatically promoted as master. This is different from read-replica services that can be added after service creation: these manually created read-replica services will not be promoted if master server fails.
There are two distinct cases for failovers / switchovers to occur:
- Unexpected master/replica leaving (for example, hardware hosting the virtual machine fails)
- Controller switchover during rolling-forward upgrades
Uncontrolled master/replica leaving
For an unexpectedly leaving server, there is no way to know whether the server really disappeared, or whether there is a temporary network glitch with cloud provider's network.
For replica, there is a 300 seconds timeout before Timescale Cloud management platform automatically decides the server is gone and spins up a new server. During this 300s period replica.servicename.timescaledb.io points to a server that may not serve queries anymore. DNS record pointing to the master - servicename.timescaledb.io -works fine. If replica server does not come back up within this 300s period, replica.servicename.timescaledb.io is pointed to the master server, until new replica server is built.
In case the master disappears, a replica server waits for 60 seconds before promoting itself as master. During this 60 second timeout the master is unavailable (i.e., servicename.timescaledb.io does not respond), and replica.servicename.timescaledb.io works fine (in read-only mode). After replica server promotes itself as master, servicename.timescaledb.io points to the new master server, and replica.servicename.timescaledb.io does not change (i.e., it continues to point at the new master server. New replica server is built automatically, and after it is in sync, replica.servicename.timescaledb.io is pointed to new replica server.
Controlled switchover during upgrades
When applying upgrades (or plan changes) for business or premium plans, we first replace the standby server(s):
- A new server is started up, a backup is restored, and the new server starts following old master server. After new server is up and running, replica.servicename.timescaledb.io is changed, and old replica server is deleted. For premium plans this step is executed for both replica servers before master server is replaced.
- Another server is started up, backup is restored, and new server is synced up to old master server. After this is done, replication is changed to quorum commit synchronous replication where available (lower performance, higher guarantees on avoiding data-loss when changing master server). At this point there is one extra server running: the old master server, and two or three new replica servers (for business and premium plans, respectively).
- When it is time to switch the master to a new server, the old master is terminated (synchronous replication guarantees data has been received by at least one of the new replica servers), and one of the new replica servers is immediately promoted as a master. At this point servicename.timescaledb.io is updated to point at the new master server. Similarly, the new master is removed from replica.servicename.timescaledb.io record.