There are three ways to recover from disasters, depending on the scope and severity of the failure:Documentation Index
Fetch the complete documentation index at: https://cantonfoundation-issue-365-details-history.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
- Single node affected, network healthy — Restore the node from backup.
- Full backup unavailable but identities backup exists — Re-onboard the SV and recover Amulets through a standalone validator.
- Network-wide failure — Follow the CometBFT disaster recovery procedure across all SV operators.
Recovery of assets requires at least one of the following: a recent database backup, an up-to-date identities backup, or an external KMS that still retains the participant keys.
Restoring from data corruption
While all components in the system are designed to be crash-fault-tolerant, there is always a chance of failures from which the system cannot immediately recover, e.g. due to misconfiguration or bugs. In such cases, you will need to restore a component, a full SV node, or even the whole network, from backups or from dumps from components that are still operational.Restoring a full SV node from backups
The backup taken from the apps database instance must be taken at a time strictly earlier than that of the participant backup. This ordering is required for the set of backup snapshots to be consistent. Given a consistent set of backups, follow these steps:- Scale down all components in the SV node to 0 replicas (replace
-0with the correct migration ID if a migration has already been performed):
- Restore the storage and databases of all components from the backups. The exact process depends on your storage and database setup.
- Scale up all components back to 1 replica:
Re-onboard an SV and recover Amulets with a validator
In the case of a catastrophic failure of the SV node, the amulets owned by the SV can be recovered via deploying a standalone validator node with control over the SV’s participant keys. The SV node can be re-onboarded with a new identity and the amulets can be transferred from the validator to the new SV node.Prerequisites
You need the backup of your SV node identities (see the backup procedures documentation). From that backup, extract the participant identities:Recovery steps
-
Wait for the failed SV node to be offboarded by a majority of SVs through a governance vote on an
OffboardMemberaction. -
Deploy a standalone validator node following the standard validator installation steps. When doing so:
- Set
validatorPartyHintto the name you chose when creating the SV identity. - Restore the validator with the identities from the backup, using the
dump_identities.jsonfile prepared above.
- Set
-
Log in to the wallet at
https://wallet.validator.YOUR_HOSTNAMEwith the validator user account. Confirm that the wallet balance matches what the original SV owned. - Deploy and onboard a fresh SV node (reusing your SV identity but otherwise starting from a clean slate).
- Log in to the wallet of the new SV node and copy the new party ID.
- In the validator wallet, create a transfer offer sending the amulets to the new SV node using the copied party ID.
- In the new SV node wallet, accept the transfer offer and verify that the amulets arrived as expected.
Disaster recovery from loss of CometBFT layer
This procedure applies when the entire CometBFT layer of the network is lost beyond repair. It follows a process similar to migration dumps but is unscheduled, and the recovery timestamp will likely be earlier than the time of the incident. Data loss between the recovery timestamp and the incident is expected.High-level steps
- All SVs agree on a timestamp from which they will recover.
- Each SV operator fetches a data dump from their SV app for that timestamp.
- Each SV operator creates a migration dump file combining the data dump with their backed-up identities.
- SV operators deploy a new synchronizer.
- Each SV operator copies the migration dump file to their SV app’s PVC and restarts the app to migrate the data.
Finding a consistent recovery timestamp
Unlike the migration process, the synchronizer in case of disaster has not been paused in an orderly manner, therefore you cannot assume that all SVs are caught up to the same point. The SV operators therefore need to agree on a timestamp that many of the SV nodes have reached. Search your participant log files forCommitmentPeriod to identify the periods for which your participant has committed to the Active Contract Set (ACS). Look for “Commitment correct for sender” messages such as:
toInclusive time indicates when your participant agreed with another node on the ACS state. SVs should find a period for which most of them have mutually committed.
All transactions after the chosen timestamp will be lost. Any SV that has not committed to the chosen timestamp must either be re-onboarded or copy a data dump from another SV that has reached that timestamp.
It is better to go a bit further back in time than the last agreed-upon ACS commitment — roughly 15 minutes is a reasonable margin, since most validators have at least two transactions per round.
Creating a migration dump
Fetch the data dump from the SV app:<token> with an OAuth2 Bearer Token and <timestamp> with the agreed-upon timestamp in ISO format (e.g., 2024-04-17T19:12:02Z).
Both the participant and sequencer must still be running for this call to succeed. If the call fails with a 400 error, your participant has been pruned beyond the chosen timestamp. If it fails with a 429, the timestamp is too late for your participant.
Merge the identities dump with the data dump: