Skip to main content

Documentation Index

Fetch the complete documentation index at: https://cantonfoundation-issue-365-details-history.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Regular backups are essential for Canton Network nodes. This page covers backup procedures for both validators and SV nodes, restore operations, and disaster recovery scenarios.

Validator Backups

Identity Backups

This section was adapted from existing reviewed documentation. Source: validator_operator/validator_disaster_recovery.rst Reviewers: Skip this section. Remove markers after final approval.
Back up your validator’s node identities immediately after initial setup. The identities backup contains the participant’s private keys and is required for disaster recovery.
Identity backups contain private keys. Store them in a secure location such as a secrets manager, separate from your cluster.
Without an identities backup or a database backup, it is not possible to recover your validator’s assets in a disaster scenario. If your participant uses an external KMS, the KMS-stored keys can serve as a fallback, though recovery from KMS keys alone is more complex.

Database Backups

This section was adapted from existing reviewed documentation. Source: validator_operator/validator_disaster_recovery.rst Reviewers: Skip this section. Remove markers after final approval.
Back up all PostgreSQL databases used by your validator. You can use pg_dump, cloud provider snapshot tools, or persistent volume snapshots. Database backups must be less than 30 days old to be usable for restore. Due to sequencer pruning, a participant that is more than 30 days behind cannot catch up on the synchronizer.

SV Node Backups

This section was adapted from existing reviewed documentation. Source: sv_operator/sv_backup.rst Reviewers: Skip this section. Remove markers after final approval.

Identity Backups

Once your SV node is onboarded, back up the node identities. This information is highly sensitive and contains the private keys of your participant, sequencer, and mediator. Store it in a secure location such as a secrets manager, outside the cluster. Fetch identities from the SV app:
curl "https://sv.sv.YOUR_HOSTNAME/api/sv/v0/admin/domain/identities-dump" \
  -H "authorization: Bearer <token>"
Where <token> is an OAuth2 Bearer Token obtained from your OAuth provider.

PostgreSQL Backups

There is one strict ordering requirement: the apps PostgreSQL backup must be taken at a point in time strictly earlier than the participant backup. Complete the apps backup before starting the participant backup. Back up using pg_dump, persistent volume snapshots, or cloud provider tools (RDS snapshots, Cloud SQL backups). Back up at least every 4 hours.

Historical Backups

Historical backups preserve a gap-less history from genesis for audit purposes and proving synchronizer state correctness. For the sequencer (when pruning is enabled): keep backups with a time difference smaller than the configured retentionPeriod. Retain backups across major upgrades — both historical sequencer backups and backups of other components.

CometBFT Backups

Back up CometBFT storage (persistent volume snapshots) at least every 4 hours. CometBFT does not use PostgreSQL. CometBFT has pruning enabled by default, targeting at least 30 days of retained blocks. Keep historical backups such that the block height difference between two backups is smaller than the configured retention count. We recommend taking and preserving backups every 2 weeks.

Restoring a Validator

This section was adapted from existing reviewed documentation. Source: validator_operator/validator_disaster_recovery.rst Reviewers: Skip this section. Remove markers after final approval.

From Database Backups

Restoring from database backups is possible when all of the following hold:
  • A database backup is available
  • The backup is less than 30 days old
  • If taken before a major upgrade, synchronizer nodes on the old migration ID are still available
Kubernetes:
  1. Scale down all validator components to 0 replicas
  2. Restore all databases and storage from backups
  3. Scale components back to 1 replica
Users onboarded after the backup was taken must be re-onboarded manually. Docker Compose:
  1. Stop the validator: ./stop.sh
  2. Wipe the existing database volume: docker volume rm compose_postgres-splice
  3. Start only PostgreSQL: docker compose up -d postgres-splice
  4. Wait for readiness: docker exec splice-validator-postgres-splice-1 pg_isready
  5. Restore the validator database: docker exec -i splice-validator-postgres-splice-1 psql -U cnadmin validator < $validator_dump_file
  6. Restore the participant database: docker exec -i splice-validator-postgres-splice-1 psql -U cnadmin participant-$migration_id < $participant_dump_file
  7. Stop PostgreSQL: docker compose down
  8. Start your validator as usual

From Identity Backups (Re-onboarding)

If a full database backup is unavailable but you have an identities backup, you can recover Canton Coin balances and CNS entries by deploying a new validator with the original namespace key. The new validator uses the identity backup to migrate parties from the original validator. SVs assist by providing contract information for the migrated parties.

Restoring an SV Node

This section was adapted from existing reviewed documentation. Source: sv_operator/sv_restore.rst Reviewers: Skip this section. Remove markers after final approval.
There are three recovery paths depending on the failure scope:
  1. Single node failure (network healthy) — Restore from backup. Scale down all SV components, restore all databases and CometBFT storage, scale back up. Components catch up from peer SVs automatically.
  2. No full backup but identities available — Deploy a standalone validator using the SV’s participant keys, recover assets, then onboard a fresh SV and transfer assets.
  3. Network-wide failure (CometBFT layer lost) — All SVs coordinate a disaster recovery procedure (see below).

Full Node Restore (Kubernetes)

# Scale down all components
kubectl scale deployment --replicas=0 -n sv \
  global-domain-0-cometbft \
  global-domain-0-mediator \
  global-domain-0-sequencer \
  participant-0 \
  scan-app \
  sv-app \
  validator-app

# Restore all databases and storage from backups
# (process depends on your storage and DB setup)

# Scale back up
kubectl scale deployment --replicas=1 -n sv \
  global-domain-0-cometbft \
  global-domain-0-mediator \
  global-domain-0-sequencer \
  participant-0 \
  scan-app \
  sv-app \
  validator-app
Components will catch up their state from peer SVs once healthy.

Network-Wide Disaster Recovery

This section was adapted from existing reviewed documentation. Source: sv_operator/sv_restore.rst Reviewers: Skip this section. Remove markers after final approval.
If the entire CometBFT layer is lost beyond repair, SVs follow a coordinated recovery process similar to the migration dump procedure used for major upgrades. The key difference is that the downtime is unscheduled and the recovery timestamp will likely be earlier than the incident, resulting in some data loss. High-level steps:
  1. All SVs agree on a recovery timestamp
  2. Each SV fetches a data dump from their SV app for that timestamp
  3. Each SV creates a migration dump by merging the data dump with their identity backups
  4. SVs deploy a new synchronizer
  5. Each SV copies the migration dump to their SV app’s PVC and restarts to import it
Validators follow a parallel process: they fetch their own data dump, copy it to the validator app’s volume, and redeploy with migration configuration enabled. The recovery timestamp is chosen by inspecting participant logs for ACS commitment periods where most SVs have mutually committed. Going 15 minutes further back than the last agreed commitment is a good safety margin.

Next Steps

This section was copied from existing reviewed documentation. Source: docs-website:docs/replicated/canton/3.4/participant/howtos/recover/repairing.rst Reviewers: Skip this section. Remove markers after final approval.

Repair procedures

Choose one of the following repair procedures based on the type of Participant Node fault in need of repair. Repairing Participant Nodes is dangerous and complex. Therefore, you are discouraged from attempting to repair Participant Nodes on your own unless you have expert knowledge and experience. Instead, you are strongly advised to only repair Participant Nodes with the help of technical support.

Recovering from a lost Synchronizer

If a Synchronizer is irreparably lost or no longer trusted, the Participant Nodes previously connected to and sharing active contracts via that Synchronizer can recover and migrate their contracts to a new Synchronizer. The Participant Node administrators need to coordinate as follows:
  1. Disconnect all Participant Nodes from the Synchronizer deemed lost or untrusted.
  2. Identify the Synchronizer configuration of the newly provisioned Synchronizer.
  3. All Participant Node administrators invoke the migrate_synchronizer method specifying the “old” Synchronizer as “source” and the “new” synchronizer. Additionally, set the “force” flag to true given that the old Synchronizer is either lost or untrusted in this scenario.
participants.all.foreach { participant =>
  participant.repair.migrate_synchronizer(
    source = oldSynchronizer,
    target = newSynchronizerConfig,
    force = true,
  )
}
  1. The Participant Nodes connect to the new Synchronizer.
  2. The Synchronizer loss may have resulted in active contract set inconsistencies. In such cases the Participant Nodes administrator need to agree on whether contracts in an inconsistent state on the different Participant Nodes should be removed or added. Refer to troubleshooting ACS commitments to identify and repair ACS differences. As these methods are powerful but dangerous, you should not attempt to repair your Participant Nodes on your own as you risk severe data corruption, but rather in the context of technical support.

Fully rehydrating a Participant Node

Fully rehydrating a Participant Node means recovering the Participant Node after its database has been fully emptied or lost. If a Participant Node needs to be rehydrated, but no Participant Node backup is available or the backup is faulty, you may be able to rehydrate the Participant Node from a Synchronizer as long as the Synchronizer has never been pruned, and you have local console access to the Participant Node. You can preserve the Participant Node’s identity and secret keys. If you are running your production Participant Node in a container, you need to create a new configuration file that allows you to access the database of the Participant Node from an interactive console. Make sure that the Participant Node process is stopped and that nothing else is accessing the same database. Ensure that database replication is turned on if the Participant Node has been configured for high availability. Also, make sure that the Participant Nodes are configured to not perform auto-initialization to prevent creating a new identity by setting disabling the auto-init configuration option:
canton.participants.participant1.init = {
    generate-topology-transactions-and-keys = false
    identity.type = manual
}
Then start Canton interactively and with the using the --manual-start option to prevent the Participant Node from reconnecting to the Synchronizers:
./bin/canton -c myconfig --manual-start
Then, download the identity state of the Participant Node to a directory on the machine you are running the process:
repair.identity.download(
  participant1,
  synchronizerId,
  testedProtocolVersion,
  tempDirParticipant,
)
repair.dars.download(participant1, tempDirParticipant)
participant1.stop()
This stores the topology state, the identity, and, if the Participant Node is not using a Key Management System, the secret keys on the disk in the specified directory. The dars.download command is a convenience command to download all Daml archive files (DARs) that have been added to the participant via the console command participant.dars.upload. DARs that were uploaded through the Ledger API need to be manually re-uploaded to the new Participant Node. After downloading, stop the Participant Node, back up the database, and then truncate the database. Then, restart the Participant Node and upload the data again:
repair.identity.upload(
  participant2,
  tempDirParticipant,
  synchronizerId,
  EnvironmentDefinition.defaultStaticSynchronizerParameters,
  sequencerConnections,
)
repair.dars.upload(participant2, tempDirParticipant)
Reconnect the Participant Node to the Synchronizer using a normal connect:
participant2.synchronizers.connect_by_config(
  SynchronizerConnectionConfig(
    synchronizerAlias = daName,
    sequencerConnections = sequencerConnections,
    initializeFromTrustedSynchronizer = true,
  )
)
Note that this replays all transactions from the Synchronizer. However, command deduplication is only fully functional once the Participant Node catches up with the Synchronizer. Therefore, you need to ensure that applications relying on command deduplication do not submit commands during recovery.

Unblocking a Participant Node Synchronizer connection

If a Participant Node is unable to process events from a Synchronizer, and the Participant Node process continuously crashes when reconnecting to the Synchronizer, it may be necessary to “ignore” the problematic event. This procedure describes how to unblock the Participant Node.
  1. Ensure that the cause is not a more common issue such as lack of database connectivity. Only proceed with the following steps if the Participant Node logs rule out more common issues. Inspect the Participant Node logs for “internal” errors referring to the Synchronizer ID. If the logs show an ApplicationHandlerException, note the first and last “sequencing timestamps”.
  2. Disconnect the Participant Node from the Synchronizer whose events you want to ignore, and restart the Participant Node after enabling the repair commands and internal state inspection.
    canton.features.enable-repair-commands = true
    canton.features.enable-testing-commands = true
    
    Make sure that the Participant Node is disconnected from the Synchronizer.
  3. Determine the “Sequencer counters” to ignore.
    • If in step 1 you identified the first and last sequencing timestamps, translate the timestamps to first and last sequencer counters:
      import com.digitalasset.canton.store.SequencedEventStore.ByTimestamp
      val sequencedEvent = participant.testing.state_inspection
        .findMessage(synchronizerId, ByTimestamp(sequencedEventTimestamp))
        .getOrElse(throw new Exception("Sequenced event not found"))
      val sequencerCounter = sequencedEvent.counter
      
    • Otherwise you may choose to ignore the last event that the Participant Node has processed from the Synchronizer and look up the sequencer counter as follows:
      import com.digitalasset.canton.store.SequencedEventStore.SearchCriterion
      val lastSequencedEvent = participant.testing.state_inspection
        .findMessage(synchronizerId, SearchCriterion.Latest)
        .getOrElse(throw new Exception("Sequenced event not found"))
      val lastCounter = lastSequencedEvent.counter
      
  4. Use the repair.ignore_events command to ignore the sequenced events that the Participant Node is unable to process. The command takes the Synchronizer ID and the first and last sequencer counters of the events to ignore as parameters. You should not set the force flag.
    participant.repair.ignore_events(
      synchronizerId,
      fromInclusive,
      toInclusive,
      force = false,
    )
    
    If you find that you have made a mistake and chose the wrong sequencer counters, you can invoke repair.unignore_events to “unignore” the events:
    participant.repair.unignore_events(
      synchronizerId,
      fromInclusive,
      toInclusive,
      force = false,
    )
    
  5. Next, reconnect the Participant Node to the Synchronizer and check that the Participant Node is able to process sequenced events again. If the Participant Node is still blocked, you may need to repeat the previous steps.
  6. Once the Participant Node is unblocked consuming events from the Synchronizer, Participant Nodes may have an inconsistent ACS. Look for errors in the log that begin with “ACS_COMMITMENT”, for example “ACS_COMMITMENT_MISMATCH”. Inspect not only the repaired Participant Node, but also the other Participant Nodes that were also involved in the transactions behind the ignored events.
  7. If there are ACS inconsistencies, refer to troubleshooting ACS commitments to identify and repair ACS differences.
  8. Disable repair commands and internal state inspection by removing the configurations added in step 2 and restarting the Participant Node. This is important to prevent accidental misuse of the repair commands in the future.
As the above steps are powerful but dangerous, you should perform the procedure in the context of technical support.