Skip to main content

Documentation Index

Fetch the complete documentation index at: https://cantonfoundation-issue-365-details-history.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Every Canton Network node exposes metrics on port 10013 at the /metrics path in Prometheus format. These metrics are built with OpenTelemetry and cover the participant, validator app, and (for super validators) the SV and scan apps.

Enabling metrics

Kubernetes (Helm)

Set metrics.enable to true in your Helm values. This creates a ServiceMonitor custom resource, which requires the Prometheus Operator to be installed in your cluster. Alternatively, add Prometheus scrape annotations targeting port 10013.

Docker Compose

Histogram format

To revert to regular histograms on a specific node, set the environment variable:
ADDITIONAL_CONFIG_DISABLE_NATIVE_HISTOGRAMS="canton.monitoring.metrics.histograms=[]"

Health metrics

This section was copied from existing reviewed documentation. Source: docs/src/deployment/observability/validator_health.rst Reviewers: Skip this section. Remove markers after final approval.

Validator Health

You can check your validator’s health using the readiness endpoints. All CN applications provide the /readyz and /livez endpoints, which are used for readiness and liveness probes.
  • Checking readiness
    • In Kubernetes: readiness and liveness probes are already configured. You can also manually check validator readiness with the following command:
      kubectl exec <pod-name> -n <namespace> -- curl -v https://localhost:5003/api/validator/readyz
      
    • In Docker: run for example this command to check validator liveness inside a container:
      docker exec <container-name> -- curl -v https://localhost:5003/api/validator/livez
      
    You should expect in both case HTTP status code 200 if the validator is ready and live.
  • Using metrics The splice_store_last_ingested_record_time_ms metric represents the last ingested record time in each validator store. It can be used to track general activity of the node:
    • If this value continue to increase over time, your node is active and stays in sync with the network. Note that it only advances if your node actually ingests new transactions. For a validator collecting validator liveness rewards this happens every round so you should expect your lag to never go above 20min.
    • If it remains static, further investigation may be required.
    For more details and to visualize this metric on its dedicated dashboard Splice Store Last Ingested Record Time, refer to the documentation about Metrics <metrics>.
You can also check health through the readiness and liveness HTTP endpoints (/readyz and /livez) on port 5003. In Kubernetes, these probes are preconfigured. For Docker, query them with:
docker exec <container-name> curl -v https://localhost:5003/api/validator/readyz
An HTTP 200 response means the node is ready.

Splice app metrics

Topology metrics (optional)

This section will be expanded in a future update. Topology metrics track synchronizer membership changes and party-to-participant mappings. For monitoring guidance, see Performance Optimization.

Key participant metrics

These are the most operationally significant metrics from the participant node. For the full catalog of several hundred metrics, see the Canton 3.x metrics reference.

Sequencer client

The sequencer client metrics tell you whether your node is keeping up with the synchronizer’s message stream.
MetricTypeWhat to watch for
daml.sequencer-client.handler.delaygaugeEvent processing delay in milliseconds. A large, growing value means the node is falling behind. Cross-reference with clock skew before assuming a processing bottleneck.
daml.sequencer-client.handler.sequencer-eventscounterNumber of events received from the sequencer. Tracks overall event throughput.
daml.sequencer-client.handler.actual-in-flight-event-batchescounterHow many event batches are being processed in parallel. If this constantly sits near max-in-flight-event-batches, the node’s resources may be under-utilized (raise the limit). If you see OOM errors, lower it.
daml.sequencer-client.submissions.droppedcounterSend requests that were not sequenced within the max-sequencing-time. An increasing count points to sequencer capacity issues or network problems.
daml.sequencer-client.submissions.overloadedcounterRequests that received an overloaded response from the sequencer.
daml.sequencer-client.submissions.sequencingtimerEnd-to-end time from submission to sequencing confirmation.
daml.sequencer-client.submissions.in-flightcounterSubmissions waiting for an outcome. High values indicate backpressure.

Connection pool

These metrics track the health of connections between your node and the synchronizer’s sequencers.
MetricTypeWhat to watch for
daml.sequencer-client.sequencer-connection-pool.active-subscriptionsgaugeNumber of active sequencer subscriptions. Should stay at or above the subscription threshold.
daml.sequencer-client.sequencer-connection-pool.validated-connectionsgaugeConnections that are up and validated. A drop signals connectivity problems.
daml.sequencer-client.sequencer-connection-pool.trust-thresholdgaugeThe minimum number of consistent sequencer connections needed before the pool will initialize.

ACS commitments

Commitment metrics reveal whether your participant is staying in sync with counter-participants during reconciliation.
MetricTypeWhat to watch for
daml.participant.sync.commitments.computetimerTime spent computing bilateral commitments. If this approaches or exceeds the reconciliation interval, the participant will perpetually lag behind.
daml.participant.sync.commitments.sequencing-timegaugeTime between the end of a commitment period and when the sequencer observes the commitment. An unexplained increase may indicate the participant is falling behind.
daml.participant.sync.commitments.catch-up-mode-triggeredmeterHow often catch-up mode has been activated. A healthy value is 0. An increasing count signals intermittent performance degradation.

Ledger API command processing

MetricTypeWhat to watch for
daml.participant.api.commands.submissionstimerTime to validate and interpret a command before it is sent for finalization.
daml.participant.api.commands.submissions_runningcounterCommands currently being processed. Indicates load on the Ledger API server.
daml.participant.api.commands.failed_command_interpretationsmeterCommands rejected by the Daml interpreter (for example, unauthorized actions).
daml.grpc.server.requests.rejectionscounterRequests rejected due to active request limits. Sustained rejections mean you need to raise limits or reduce load.

Database

Database metrics share a common pattern across all node types. The general pool handles reads; the write pool handles writes.
MetricTypeWhat to watch for
daml.db-storage.{general,write}.executor.loadgaugeCurrent queries running divided by available connections. A sustained value near 1.0 means the pool is saturated.
daml.db-storage.{general,write}.executor.queuedcounterTasks waiting for a database connection. A growing queue indicates the database cannot keep up.
daml.db-storage.{general,write}.executor.waittimetimerTime tasks spend waiting in the queue before execution.
daml.db-storage.{general,write}.executor.exectimetimerTime tasks spend executing on the database.

Pruning

MetricTypeWhat to watch for
daml.pruning.max-event-agegaugeAge of the oldest unpruned event in hours. A large or growing value means pruning is falling behind, which increases storage consumption.
daml_services_pruning_prune_started_totalcounterNumber of pruning processes started.
daml_services_pruning_prune_completed_totalcounterNumber of pruning processes completed. Compare with started count to detect stalled pruning.

Traffic control

These metrics are relevant if your node participates in traffic-based rate limiting.
MetricTypeWhat to watch for
daml.sequencer-client.traffic-control.event-deliveredcounterEvents that were sequenced and delivered.
daml.sequencer-client.traffic-control.event-rejectedcounterEvents sequenced but not delivered (insufficient traffic credits).
daml.sequencer-client.traffic-control.submitted-event-costmeterCost of events submitted. May not exactly match actual consumed traffic since some events may not be sequenced.

JVM metrics

Grafana dashboards

The dashboards use queries specific to Prometheus native histograms, so make sure native histogram support is enabled in your Prometheus instance.

Alerting recommendations

The following thresholds are starting points. Adjust them based on your environment and workload patterns.
  • Node health: Alert when daml_health_status equals 0 for any component for more than 2 minutes.
  • Sequencer delay: Alert when daml.sequencer-client.handler.delay exceeds 30 seconds and is increasing.
  • Dropped submissions: Alert on any sustained increase in daml.sequencer-client.submissions.dropped.
  • Overloaded responses: Alert on any increase in daml.sequencer-client.submissions.overloaded.
  • Database pool saturation: Alert when daml.db-storage.*.executor.load exceeds 0.85 for more than 5 minutes.
  • Pruning backlog: Alert when daml.pruning.max-event-age exceeds your retention policy threshold.
  • Store ingestion lag: Alert when splice_store_last_ingested_record_time_ms stops advancing for more than 20 minutes.
  • JVM memory: Alert when runtime_jvm_memory_area with area=heap and type=used exceeds 85% of type=max over a sustained period.
  • Commitment computation: Alert when daml.participant.sync.commitments.catch-up-mode-triggered is non-zero.