Regular backups are essential for Canton Network nodes. This page covers backup procedures for both validators and SV nodes, restore operations, and disaster recovery scenarios.Documentation Index
Fetch the complete documentation index at: https://cantonfoundation-issue-365-details-history.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Validator Backups
Identity Backups
Back up your validator’s node identities immediately after initial setup. The identities backup contains the participant’s private keys and is required for disaster recovery. Without an identities backup or a database backup, it is not possible to recover your validator’s assets in a disaster scenario. If your participant uses an external KMS, the KMS-stored keys can serve as a fallback, though recovery from KMS keys alone is more complex.Database Backups
Back up all PostgreSQL databases used by your validator. You can usepg_dump, cloud provider snapshot tools, or persistent volume snapshots.
Database backups must be less than 30 days old to be usable for restore. Due to sequencer pruning, a participant that is more than 30 days behind cannot catch up on the synchronizer.
SV Node Backups
Identity Backups
Once your SV node is onboarded, back up the node identities. This information is highly sensitive and contains the private keys of your participant, sequencer, and mediator. Store it in a secure location such as a secrets manager, outside the cluster. Fetch identities from the SV app:<token> is an OAuth2 Bearer Token obtained from your OAuth provider.
PostgreSQL Backups
There is one strict ordering requirement: the apps PostgreSQL backup must be taken at a point in time strictly earlier than the participant backup. Complete the apps backup before starting the participant backup. Back up usingpg_dump, persistent volume snapshots, or cloud provider tools (RDS snapshots, Cloud SQL backups). Back up at least every 4 hours.
Historical Backups
Historical backups preserve a gap-less history from genesis for audit purposes and proving synchronizer state correctness. For the sequencer (when pruning is enabled): keep backups with a time difference smaller than the configuredretentionPeriod.
Retain backups across major upgrades — both historical sequencer backups and backups of other components.
CometBFT Backups
Back up CometBFT storage (persistent volume snapshots) at least every 4 hours. CometBFT does not use PostgreSQL. CometBFT has pruning enabled by default, targeting at least 30 days of retained blocks. Keep historical backups such that the block height difference between two backups is smaller than the configured retention count. We recommend taking and preserving backups every 2 weeks.Restoring a Validator
From Database Backups
Restoring from database backups is possible when all of the following hold:- A database backup is available
- The backup is less than 30 days old
- If taken before a major upgrade, synchronizer nodes on the old migration ID are still available
- Scale down all validator components to 0 replicas
- Restore all databases and storage from backups
- Scale components back to 1 replica
- Stop the validator:
./stop.sh - Wipe the existing database volume:
docker volume rm compose_postgres-splice - Start only PostgreSQL:
docker compose up -d postgres-splice - Wait for readiness:
docker exec splice-validator-postgres-splice-1 pg_isready - Restore the validator database:
docker exec -i splice-validator-postgres-splice-1 psql -U cnadmin validator < $validator_dump_file - Restore the participant database:
docker exec -i splice-validator-postgres-splice-1 psql -U cnadmin participant-$migration_id < $participant_dump_file - Stop PostgreSQL:
docker compose down - Start your validator as usual
From Identity Backups (Re-onboarding)
If a full database backup is unavailable but you have an identities backup, you can recover Canton Coin balances and CNS entries by deploying a new validator with the original namespace key. The new validator uses the identity backup to migrate parties from the original validator. SVs assist by providing contract information for the migrated parties.Restoring an SV Node
There are three recovery paths depending on the failure scope:- Single node failure (network healthy) — Restore from backup. Scale down all SV components, restore all databases and CometBFT storage, scale back up. Components catch up from peer SVs automatically.
- No full backup but identities available — Deploy a standalone validator using the SV’s participant keys, recover assets, then onboard a fresh SV and transfer assets.
- Network-wide failure (CometBFT layer lost) — All SVs coordinate a disaster recovery procedure (see below).
Full Node Restore (Kubernetes)
Network-Wide Disaster Recovery
If the entire CometBFT layer is lost beyond repair, SVs follow a coordinated recovery process similar to the migration dump procedure used for major upgrades. The key difference is that the downtime is unscheduled and the recovery timestamp will likely be earlier than the incident, resulting in some data loss. High-level steps:- All SVs agree on a recovery timestamp
- Each SV fetches a data dump from their SV app for that timestamp
- Each SV creates a migration dump by merging the data dump with their identity backups
- SVs deploy a new synchronizer
- Each SV copies the migration dump to their SV app’s PVC and restarts to import it
Next Steps
- Security Operations — Key management and security hardening
- Upgrade Procedures — Major and minor upgrade processes
Repair procedures
Choose one of the following repair procedures based on the type of Participant Node fault in need of repair. Repairing Participant Nodes is dangerous and complex. Therefore, you are discouraged from attempting to repair Participant Nodes on your own unless you have expert knowledge and experience. Instead, you are strongly advised to only repair Participant Nodes with the help of technical support.Recovering from a lost Synchronizer
If a Synchronizer is irreparably lost or no longer trusted, the Participant Nodes previously connected to and sharing active contracts via that Synchronizer can recover and migrate their contracts to a new Synchronizer. The Participant Node administrators need to coordinate as follows:- Disconnect all Participant Nodes from the Synchronizer deemed lost or untrusted.
- Identify the Synchronizer configuration of the newly provisioned Synchronizer.
- All Participant Node administrators invoke the
migrate_synchronizermethod specifying the “old” Synchronizer as “source” and the “new” synchronizer. Additionally, set the “force” flag to true given that the old Synchronizer is either lost or untrusted in this scenario.
- The Participant Nodes connect to the new Synchronizer.
- The Synchronizer loss may have resulted in active contract set inconsistencies. In such cases the Participant Nodes administrator need to agree on whether contracts in an inconsistent state on the different Participant Nodes should be removed or added. Refer to troubleshooting ACS commitments to identify and repair ACS differences. As these methods are powerful but dangerous, you should not attempt to repair your Participant Nodes on your own as you risk severe data corruption, but rather in the context of technical support.
Fully rehydrating a Participant Node
Fully rehydrating a Participant Node means recovering the Participant Node after its database has been fully emptied or lost. If a Participant Node needs to be rehydrated, but no Participant Node backup is available or the backup is faulty, you may be able to rehydrate the Participant Node from a Synchronizer as long as the Synchronizer has never been pruned, and you have local console access to the Participant Node. You can preserve the Participant Node’s identity and secret keys. If you are running your production Participant Node in a container, you need to create a new configuration file that allows you to access the database of the Participant Node from an interactive console. Make sure that the Participant Node process is stopped and that nothing else is accessing the same database. Ensure that database replication is turned on if the Participant Node has been configured for high availability. Also, make sure that the Participant Nodes are configured to not perform auto-initialization to prevent creating a new identity by setting disabling the auto-init configuration option:--manual-start option to prevent the Participant Node from reconnecting to the Synchronizers:
dars.download command is a convenience command to download all Daml archive files (DARs) that have been added to the participant via the console command participant.dars.upload. DARs that were uploaded through the Ledger API need to be manually re-uploaded to the new Participant Node.
After downloading, stop the Participant Node, back up the database, and then truncate the database. Then, restart the Participant Node and upload the data again:
Unblocking a Participant Node Synchronizer connection
If a Participant Node is unable to process events from a Synchronizer, and the Participant Node process continuously crashes when reconnecting to the Synchronizer, it may be necessary to “ignore” the problematic event. This procedure describes how to unblock the Participant Node.-
Ensure that the cause is not a more common issue such as lack of database connectivity. Only proceed with the following steps if the Participant Node logs rule out more common issues. Inspect the Participant Node logs for “internal” errors referring to the Synchronizer ID. If the logs show an
ApplicationHandlerException, note the first and last “sequencing timestamps”. -
Disconnect the Participant Node from the Synchronizer whose events you want to ignore, and restart the Participant Node after enabling the repair commands and internal state inspection.
Make sure that the Participant Node is disconnected from the Synchronizer.
-
Determine the “Sequencer counters” to ignore.
-
If in step 1 you identified the first and last sequencing timestamps, translate the timestamps to first and last sequencer counters:
-
Otherwise you may choose to ignore the last event that the Participant Node has processed from the Synchronizer and look up the sequencer counter as follows:
-
If in step 1 you identified the first and last sequencing timestamps, translate the timestamps to first and last sequencer counters:
-
Use the
repair.ignore_eventscommand to ignore the sequenced events that the Participant Node is unable to process. The command takes the Synchronizer ID and the first and last sequencer counters of the events to ignore as parameters. You should not set the force flag.If you find that you have made a mistake and chose the wrong sequencer counters, you can invokerepair.unignore_eventsto “unignore” the events: - Next, reconnect the Participant Node to the Synchronizer and check that the Participant Node is able to process sequenced events again. If the Participant Node is still blocked, you may need to repeat the previous steps.
- Once the Participant Node is unblocked consuming events from the Synchronizer, Participant Nodes may have an inconsistent ACS. Look for errors in the log that begin with “ACS_COMMITMENT”, for example “ACS_COMMITMENT_MISMATCH”. Inspect not only the repaired Participant Node, but also the other Participant Nodes that were also involved in the transactions behind the ignored events.
- If there are ACS inconsistencies, refer to troubleshooting ACS commitments to identify and repair ACS differences.
- Disable repair commands and internal state inspection by removing the configurations added in step 2 and restarting the Participant Node. This is important to prevent accidental misuse of the repair commands in the future.