Disaster Recovery

There are three ways to recover from disasters, depending on the scope and severity of the failure:

Single node affected, network healthy — Restore the node from backup.
Full backup unavailable but identities backup exists — Re-onboard the SV and recover Amulets through a standalone validator.
Network-wide failure — Follow the CometBFT disaster recovery procedure across all SV operators.

Recovery of assets requires at least one of the following: a recent database backup, an up-to-date identities backup, or an external KMS that still retains the participant keys.

Restoring from data corruption

While all components in the system are designed to be crash-fault-tolerant, there is always a chance of failures from which the system cannot immediately recover, e.g. due to misconfiguration or bugs. In such cases, you will need to restore a component, a full SV node, or even the whole network, from backups or from dumps from components that are still operational.

Restoring a full SV node from backups

The backup taken from the apps database instance must be taken at a time strictly earlier than that of the participant backup. This ordering is required for the set of backup snapshots to be consistent. Given a consistent set of backups, follow these steps:

Scale down all components in the SV node to 0 replicas (replace -0 with the correct migration ID if a migration has already been performed):

kubectl scale deployment --replicas=0 -n sv \
  global-domain-0-cometbft \
  global-domain-0-mediator \
  global-domain-0-sequencer \
  participant-0 \
  scan-app \
  sv-app \
  validator-app

Restore the storage and databases of all components from the backups. The exact process depends on your storage and database setup.
Scale up all components back to 1 replica:

kubectl scale deployment --replicas=1 -n sv \
  global-domain-0-cometbft \
  global-domain-0-mediator \
  global-domain-0-sequencer \
  participant-0 \
  scan-app \
  sv-app \
  validator-app

Once all components are healthy, they will start catching up their state from peer SVs and eventually become fully functional again.

Re-onboard an SV and recover Amulets with a validator

In the case of a catastrophic failure of the SV node, the amulets owned by the SV can be recovered via deploying a standalone validator node with control over the SV’s participant keys. The SV node can be re-onboarded with a new identity and the amulets can be transferred from the validator to the new SV node.

Prerequisites

You need the backup of your SV node identities (see the backup procedures documentation). From that backup, extract the participant identities:

jq '.identities.participant' backup_state.json > dump_identities.json

Recovery steps

Wait for the failed SV node to be offboarded by a majority of SVs through a governance vote on an OffboardMember action.
Deploy a standalone validator node following the standard validator installation steps. When doing so:
- Set validatorPartyHint to the name you chose when creating the SV identity.
- Restore the validator with the identities from the backup, using the dump_identities.json file prepared above.
Log in to the wallet at https://wallet.validator.YOUR_HOSTNAME with the validator user account. Confirm that the wallet balance matches what the original SV owned.
Deploy and onboard a fresh SV node (reusing your SV identity but otherwise starting from a clean slate).
Log in to the wallet of the new SV node and copy the new party ID.
In the validator wallet, create a transfer offer sending the amulets to the new SV node using the copied party ID.
In the new SV node wallet, accept the transfer offer and verify that the amulets arrived as expected.

Disaster recovery from loss of CometBFT layer

This procedure applies when the entire CometBFT layer of the network is lost beyond repair. It follows a process similar to migration dumps but is unscheduled, and the recovery timestamp will likely be earlier than the time of the incident. Data loss between the recovery timestamp and the incident is expected.

High-level steps

All SVs agree on a timestamp from which they will recover.
Each SV operator fetches a data dump from their SV app for that timestamp.
Each SV operator creates a migration dump file combining the data dump with their backed-up identities.
SV operators deploy a new synchronizer.
Each SV operator copies the migration dump file to their SV app’s PVC and restarts the app to migrate the data.

Finding a consistent recovery timestamp

Unlike the migration process, the synchronizer in case of disaster has not been paused in an orderly manner, therefore you cannot assume that all SVs are caught up to the same point. The SV operators therefore need to agree on a timestamp that many of the SV nodes have reached. Search your participant log files for CommitmentPeriod to identify the periods for which your participant has committed to the Active Contract Set (ACS). Look for “Commitment correct for sender” messages such as:

Commitment correct for sender PAR::Digital-Asset-Eng-4::12205f1149bc...
and period CommitmentPeriod(fromExclusive = 2024-04-21T23:24:00Z,
toInclusive = 2024-04-21T23:25:00Z)

The toInclusive time indicates when your participant agreed with another node on the ACS state. SVs should find a period for which most of them have mutually committed. All transactions after the chosen timestamp will be lost. Any SV that has not committed to the chosen timestamp must either be re-onboarded or copy a data dump from another SV that has reached that timestamp. It is better to go a bit further back in time than the last agreed-upon ACS commitment — roughly 15 minutes is a reasonable margin, since most validators have at least two transactions per round.

Creating a migration dump

Fetch the data dump from the SV app:

data=$(curl -sSLf \
  "https://sv.sv.YOUR_HOSTNAME/api/sv/v0/admin/domain/data-snapshot?timestamp=<timestamp>" \
  -H "authorization: Bearer <token>" \
  -X GET -H "Content-Type: application/json")

Replace <token> with an OAuth2 Bearer Token and <timestamp> with the agreed-upon timestamp in ISO format (e.g., 2024-04-17T19:12:02Z). Both the participant and sequencer must still be running for this call to succeed. If the call fails with a 400 error, your participant has been pruned beyond the chosen timestamp. If it fails with a 429, the timestamp is too late for your participant. Merge the identities dump with the data dump:

id=$(cat dump_identities.json)
echo "$id" "$data" | jq -s add > dump_identities.json

Copy the merged dump to the SV app PVC:

kubectl cp dump_identities.json sv/<sv_app_pod_name>:/domain-upgrade-dump/domain_migration_dump.json

Migrating the data

Follow the standard migration procedure to update the SV app configuration to consume the migration dump file and seed the new synchronizer.

Overview

Validator Deployment

Super Validator Deployment

Splice Fundamentals

Canton Console

Production Operations

Extension Synchronizers

Troubleshooting

Release Notes

Reference

Restoring from data corruption

Restoring a full SV node from backups

Re-onboard an SV and recover Amulets with a validator

Prerequisites

Recovery steps

Disaster recovery from loss of CometBFT layer

High-level steps

Finding a consistent recovery timestamp

Creating a migration dump

Migrating the data

Overview

Validator Deployment

Super Validator Deployment

Splice Fundamentals

Canton Console

Production Operations

Extension Synchronizers

Troubleshooting

Release Notes

Reference

Documentation Index

​Restoring from data corruption

​Restoring a full SV node from backups

​Re-onboard an SV and recover Amulets with a validator

​Prerequisites

​Recovery steps

​Disaster recovery from loss of CometBFT layer

​High-level steps

​Finding a consistent recovery timestamp

​Creating a migration dump

​Migrating the data

Restoring from data corruption

Restoring a full SV node from backups

Re-onboard an SV and recover Amulets with a validator

Prerequisites

Recovery steps

Disaster recovery from loss of CometBFT layer

High-level steps

Finding a consistent recovery timestamp

Creating a migration dump

Migrating the data