# xrpld Telemetry Operator Runbook ## Overview xrpld supports OpenTelemetry distributed tracing to provide visibility into RPC requests, transaction processing, and consensus rounds. ## Quick Start ### 1. Start the observability stack ```bash docker compose -f docker/telemetry/docker-compose.yml up -d ``` This starts: - **OTel Collector** on ports 4317 (gRPC) and 4318 (HTTP) - **Jaeger** UI on http://localhost:16686 - **Prometheus** on http://localhost:9090 - **Grafana** on http://localhost:3000 ### 2. Enable telemetry in xrpld Add to your `xrpld.cfg`: ```ini [telemetry] enabled=1 endpoint=http://localhost:4318/v1/traces ``` ### 3. Build with telemetry support ```bash conan install . --build=missing -o telemetry=True cmake --preset default -Dtelemetry=ON cmake --build --preset default ``` ## Configuration Reference | Option | Default | Description | | -------------------------- | --------------------------------- | --------------------------------------------------------- | | `enabled` | `0` | Master switch for telemetry | | `endpoint` | `http://localhost:4318/v1/traces` | OTLP/HTTP endpoint | | `service_name` | `xrpld` | OpenTelemetry service name resource attribute | | `service_instance_id` | node public key | OpenTelemetry service instance ID resource attribute | | `sampling_ratio` | `1.0` | Head-based sampling ratio (0.0--1.0) | | `trace_rpc` | `1` | Enable RPC request tracing | | `trace_transactions` | `1` | Enable transaction tracing | | `trace_consensus` | `1` | Enable consensus tracing | | `trace_peer` | `0` | Enable peer message tracing (high volume) | | `trace_ledger` | `1` | Enable ledger tracing | | `consensus_trace_strategy` | `deterministic` | Consensus trace ID strategy (`deterministic` or `random`) | | `batch_size` | `512` | Max spans per batch export | | `batch_delay_ms` | `5000` | Delay between batch exports | | `max_queue_size` | `2048` | Max spans queued before dropping | | `use_tls` | `0` | Use TLS for exporter connection | | `tls_ca_cert` | (empty) | Path to CA certificate bundle | ## Span Reference All spans instrumented in xrpld, grouped by subsystem: ### RPC Spans (Phase 2) | Span Name | Source File | Attributes | Description | | -------------------- | ----------------- | -------------------------------- | ----------------------------------------------------- | | `rpc.http_request` | ServerHandler.cpp | — | Top-level HTTP RPC request | | `rpc.ws_upgrade` | ServerHandler.cpp | — | WebSocket upgrade handshake | | `rpc.ws_message` | ServerHandler.cpp | — | WebSocket RPC message | | `rpc.process` | ServerHandler.cpp | — | RPC processing (child of rpc.http_request/ws_message) | | `rpc.command.` | RPCHandler.cpp | `command`, `version`, `rpc_role` | Per-command span (e.g., `rpc.command.server_info`) | ### Transaction Spans (Phase 3) | Span Name | Source File | Attributes | Description | | ------------ | -------------- | ------------------------------------------------------------------------- | ------------------------------------- | | `tx.process` | NetworkOPs.cpp | `xrpl.tx.hash`, `local`, `path` | Transaction submission and processing | | `tx.receive` | PeerImp.cpp | `xrpl.peer.id`, `xrpl.tx.hash`, `peer_version`, `suppressed`, `tx_status` | Transaction received from peer relay | ### Transaction Queue Spans (Phase 3) | Span Name | Source File | Attributes | Description | | ------------------ | ----------- | ----------------------------------------------- | -------------------------------------------------- | | `txq.enqueue` | TxQ.cpp | `xrpl.tx.hash` | Transaction enqueue decision (child of tx.process) | | `txq.apply_direct` | TxQ.cpp | -- | Direct apply attempt (bypassing queue) | | `txq.batch_clear` | TxQ.cpp | -- | Batch clear of queued transactions for an account | | `txq.accept` | TxQ.cpp | `queue_size` | Ledger-close accept loop over queued transactions | | `txq.accept_tx` | TxQ.cpp | `xrpl.tx.hash`, `retries_remaining`, `ter_code` | Per-transaction apply during accept | | `txq.cleanup` | TxQ.cpp | `xrpl.ledger.seq` | Post-close cleanup of expired queue entries | ### Consensus Spans (Phase 4) | Span Name | Source File | Attributes | Description | | ------------------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | | `consensus.round` | RCLConsensus.cpp | `xrpl.consensus.ledger_id`, `xrpl.ledger.seq`, `xrpl.consensus.mode`, `trace_strategy`, `xrpl.consensus.round_id` | Root span for a consensus round (deterministic or random trace ID) | | `consensus.phase.open` | Consensus.h | -- | Open phase duration (child of round) | | `consensus.proposal.send` | RCLConsensus.cpp | `xrpl.consensus.round` | Consensus proposal broadcast | | `consensus.ledger_close` | RCLConsensus.cpp | `xrpl.ledger.seq`, `xrpl.consensus.mode` | Ledger close event | | `consensus.establish` | Consensus.h | `converge_percent`, `establish_count`, `proposers` | Establish phase duration (child of round) | | `consensus.update_positions` | Consensus.h | `converge_percent`, `proposers`, `disputes_count` | Position update and dispute resolution (see Events below) | | `consensus.check` | Consensus.h | `agree_count`, `disagree_count`, `converge_percent`, `have_close_time_consensus`, `threshold_percent`, `consensus_result` | Consensus threshold check | | `consensus.accept` | RCLConsensus.cpp | `proposers`, `round_time_ms`, `quorum` | Ledger accepted by consensus | | `consensus.accept.apply` | RCLConsensus.cpp | `xrpl.ledger.seq`, `close_time`, `close_time_correct`, `close_resolution_ms`, `consensus_state`, `proposing`, `round_time_ms`, `parent_close_time`, `close_time_self`, `close_time_vote_bins`, `resolution_direction`, `tx_count` | Ledger application with close time details (see Events below) | | `consensus.validation.send` | RCLConsensus.cpp | `xrpl.ledger.seq`, `proposing` | Validation sent after accept (follows-from link) | | `consensus.mode_change` | RCLConsensus.cpp | `mode_old`, `mode_new` | Consensus mode transition | | `consensus.proposal.receive` | PeerImp.cpp | `trusted`, `xrpl.consensus.round` | Proposal received from peer (extracts parent context from TraceContext when present; falls back to standalone span for older peers) | | `consensus.validation.receive` | PeerImp.cpp | `trusted`, `xrpl.ledger.seq` | Validation received from peer (extracts parent context from TraceContext when present; falls back to standalone span for older peers) | #### Consensus Span Events | Parent Span | Event Name | Event Attributes | Description | | ---------------------------- | ----------------- | ---------------------------------------------------------------- | ------------------------------------------------------- | | `consensus.update_positions` | `dispute.resolve` | `xrpl.tx.id`, `dispute_our_vote`, `dispute_yays`, `dispute_nays` | Emitted per dispute when votes are tallied | | `consensus.accept.apply` | `tx.included` | `xrpl.tx.id` | Emitted per transaction included in the accepted ledger | #### Close Time Queries (Tempo TraceQL) ``` # Find rounds where validators disagreed on close time {name="consensus.accept.apply"} | close_time_correct = false # Find consensus failures (moved_on) {name="consensus.accept.apply"} | consensus_state = "moved_on" # Find slow ledger applications (>5s) {name="consensus.accept.apply"} | duration > 5s # Find specific ledger's consensus details {name="consensus.accept.apply"} | xrpl.ledger.seq = 92345678 # Find all spans in a consensus round (deterministic trace strategy) {name="consensus.round"} | xrpl.consensus.round_id = "" # Find dispute resolutions {name="consensus.update_positions"} >> {event:name="dispute.resolve"} ``` ## Cross-Node Trace Propagation xrpld propagates trace context across nodes via protobuf `TraceContext` fields embedded in peer-to-peer messages. When Node A sends a transaction, proposal, or validation, it injects its active span's trace/span IDs into the protobuf message. Node B extracts that context on receipt and creates a child span, linking the two nodes into a single distributed trace. ### How It Works ``` Node A (sender) Node B (receiver) +-----------------------------+ +-------------------------------+ | tx.process / consensus.* | | PeerImp::onMessage() | | | | | | | | v | | v | | SpanGuard::getTraceBytes() | | extract TraceContext from | | | | | protobuf message | | v | send | | | | injectSpanContext() --------|--------->| v | | sets TraceContext fields | proto | txReceiveSpan() | | (trace_id, span_id, flags) | msg | proposalReceiveSpan() | +-----------------------------+ | validationReceiveSpan() | | | | | v | | child span with parent link | +-------------------------------+ ``` ### Send-Side Injection | Message Type | Injection Point | Mechanism | | ------------- | -------------------------- | ------------------------------------------ | | TMTransaction | `NetworkOPs::apply()` | Injects `tx.process` span into relay msg | | TMProposeSet | `RCLConsensus::propose()` | Injects active context into proposal msg | | TMValidation | `RCLConsensus::validate()` | Injects active context into validation msg | ### Receive-Side Extraction | Message Type | Extraction Point | Helper Function | | ------------- | ----------------------------------- | -------------------------------------------------- | | TMTransaction | `PeerImp::onMessage(TMTransaction)` | `TxTracing::txReceiveSpan()` | | TMProposeSet | `PeerImp::onMessage(TMProposeSet)` | `ConsensusReceiveTracing::proposalReceiveSpan()` | | TMValidation | `PeerImp::onMessage(TMValidation)` | `ConsensusReceiveTracing::validationReceiveSpan()` | ### Key Files | File | Role | | ------------------------------------------------- | ----------------------------------------------- | | `src/xrpld/telemetry/PropagationHelpers.h` | `injectSpanContext()` — SpanGuard to protobuf | | `include/xrpl/telemetry/TraceContextPropagator.h` | OTel context <-> protobuf conversion primitives | | `src/xrpld/telemetry/ConsensusReceiveTracing.h` | Proposal/validation receive span factories | | `src/xrpld/telemetry/TxTracing.h` | Transaction receive span factory | ### Backwards Compatibility Older peers that do not populate `TraceContext` fields in their messages will simply produce empty trace bytes on the receive side. The extraction helpers detect this and create standalone (root) spans instead of child spans. No errors are logged and no data is lost — the receive span is still created with all its normal attributes, it just lacks a cross-node parent link. ### Example Tempo Queries ``` # Find cross-node transaction traces (tx.process -> tx.receive across nodes) {name="tx.receive"} && status != error # Find proposals received with cross-node parent context {name="consensus.proposal.receive"} && nestedSetParent > 0 # Trace a transaction across the network by its hash {name=~"tx\\..*"} | xrpl.tx.hash = "" # Find all spans in a cross-node consensus trace {rootServiceName="xrpld"} | xrpl.consensus.round_id = "" # Compare latency between sender and receiver for validations {name="consensus.validation.send" || name="consensus.validation.receive"} ``` ## Prometheus Metrics (Spanmetrics) The OTel Collector's spanmetrics connector automatically derives RED (Rate, Errors, Duration) metrics from every span. No custom metrics code is needed in xrpld. ### Generated Metric Names | Prometheus Metric | Type | Description | | -------------------------------------------------- | --------- | ---------------------------- | | `traces_span_metrics_calls_total` | Counter | Total span invocations | | `traces_span_metrics_duration_milliseconds_bucket` | Histogram | Latency distribution buckets | | `traces_span_metrics_duration_milliseconds_count` | Histogram | Latency observation count | | `traces_span_metrics_duration_milliseconds_sum` | Histogram | Cumulative latency | ### Metric Labels Every metric carries these standard labels: | Label | Source | Example | | -------------- | ------------------ | ---------------------------------------- | | `span_name` | Span name | `rpc.command.server_info` | | `status_code` | Span status | `STATUS_CODE_UNSET`, `STATUS_CODE_ERROR` | | `service_name` | Resource attribute | `xrpld` | | `span_kind` | Span kind | `SPAN_KIND_INTERNAL` | Additionally, span attributes configured as dimensions in the collector become metric labels (dots → underscores): | Span Attribute | Metric Label | Applies To | | --------------------- | --------------------- | ------------------------------ | | `command` | `xrpl_rpc_command` | `rpc.command.*` spans | | `rpc_status` | `xrpl_rpc_status` | `rpc.command.*` spans | | `xrpl.consensus.mode` | `xrpl_consensus_mode` | `consensus.ledger_close` spans | | `local` | `xrpl_tx_local` | `tx.process` spans | ### Histogram Buckets Configured in `otel-collector-config.yaml`: ``` 1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s ``` ## Grafana Dashboards Three dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`: ### RPC Performance (`xrpld-rpc-perf`) | Panel | Type | PromQL | Labels Used | | --------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- | | RPC Request Rate by Command | timeseries | `sum by (xrpl_rpc_command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))` | `xrpl_rpc_command` | | RPC Latency p95 by Command | timeseries | `histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))` | `xrpl_rpc_command` | | RPC Error Rate | bargauge | Error spans / total spans × 100, grouped by `xrpl_rpc_command` | `xrpl_rpc_command`, `status_code` | | RPC Latency Heatmap | heatmap | `sum(increase(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])) by (le)` | `le` (bucket boundaries) | ### Transaction Overview (`xrpld-transactions`) | Panel | Type | PromQL | Labels Used | | --------------------------------- | ---------- | -------------------------------------------------------------------------------------------- | --------------- | | Transaction Processing Rate | timeseries | `rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m])` and `tx.receive` | `span_name` | | Transaction Processing Latency | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="tx.process"})` | — | | Transaction Path Distribution | piechart | `sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))` | `xrpl_tx_local` | | Transaction Receive vs Suppressed | timeseries | `rate(traces_span_metrics_calls_total{span_name="tx.receive"}[5m])` | — | ### Consensus Health (`xrpld-consensus`) | Panel | Type | PromQL | Labels Used | | ----------------------------- | ---------- | ---------------------------------------------------------------------------------- | ----------- | | Consensus Round Duration | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept"})` | — | | Consensus Proposals Sent Rate | timeseries | `rate(traces_span_metrics_calls_total{span_name="consensus.proposal.send"}[5m])` | — | | Ledger Close Duration | timeseries | `histogram_quantile(0.95, ... {span_name="consensus.ledger_close"})` | — | | Validation Send Rate | stat | `rate(traces_span_metrics_calls_total{span_name="consensus.validation.send"}[5m])` | — | | Ledger Apply Duration | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept.apply"})` | — | | Close Time Agreement | timeseries | `rate(traces_span_metrics_calls_total{span_name="consensus.accept.apply"}[5m])` | — | ### Span → Metric → Dashboard Summary | Span Name | Prometheus Metric Filter | Grafana Dashboard | | ------------------------------ | -------------------------------------------- | --------------------------------------------- | | `rpc.http_request` | `{span_name="rpc.http_request"}` | -- (available but not paneled) | | `rpc.ws_upgrade` | `{span_name="rpc.ws_upgrade"}` | -- (available but not paneled) | | `rpc.ws_message` | `{span_name="rpc.ws_message"}` | -- (available but not paneled) | | `rpc.process` | `{span_name="rpc.process"}` | -- (available but not paneled) | | `rpc.command.*` | `{span_name=~"rpc.command.*"}` | RPC Performance (all 4 panels) | | `tx.process` | `{span_name="tx.process"}` | Transaction Overview (3 panels) | | `tx.receive` | `{span_name="tx.receive"}` | Transaction Overview (2 panels) | | `txq.enqueue` | `{span_name="txq.enqueue"}` | -- (available but not paneled) | | `txq.apply_direct` | `{span_name="txq.apply_direct"}` | -- (available but not paneled) | | `txq.batch_clear` | `{span_name="txq.batch_clear"}` | -- (available but not paneled) | | `txq.accept` | `{span_name="txq.accept"}` | -- (available but not paneled) | | `txq.accept_tx` | `{span_name="txq.accept_tx"}` | -- (available but not paneled) | | `txq.cleanup` | `{span_name="txq.cleanup"}` | -- (available but not paneled) | | `consensus.round` | `{span_name="consensus.round"}` | -- (available but not paneled) | | `consensus.phase.open` | `{span_name="consensus.phase.open"}` | -- (available but not paneled) | | `consensus.establish` | `{span_name="consensus.establish"}` | -- (available but not paneled) | | `consensus.update_positions` | `{span_name="consensus.update_positions"}` | -- (available but not paneled) | | `consensus.check` | `{span_name="consensus.check"}` | -- (available but not paneled) | | `consensus.accept` | `{span_name="consensus.accept"}` | Consensus Health (Round Duration) | | `consensus.proposal.send` | `{span_name="consensus.proposal.send"}` | Consensus Health (Proposals Rate) | | `consensus.ledger_close` | `{span_name="consensus.ledger_close"}` | Consensus Health (Close Duration) | | `consensus.validation.send` | `{span_name="consensus.validation.send"}` | Consensus Health (Validation Rate) | | `consensus.accept.apply` | `{span_name="consensus.accept.apply"}` | Consensus Health (Apply Duration, Close Time) | | `consensus.mode_change` | `{span_name="consensus.mode_change"}` | -- (available but not paneled) | | `consensus.proposal.receive` | `{span_name="consensus.proposal.receive"}` | -- (available but not paneled) | | `consensus.validation.receive` | `{span_name="consensus.validation.receive"}` | -- (available but not paneled) | ## Troubleshooting ### No traces appearing in Tempo 1. Check xrpld logs for `Telemetry starting` message 2. Verify `enabled=1` in the `[telemetry]` config section 3. Test collector connectivity: `curl -v http://localhost:4318/v1/traces` 4. Check collector logs: `docker compose -f docker/telemetry/docker-compose.yml logs otel-collector` 5. Verify Tempo is receiving data: open Grafana → Explore → select Tempo datasource → search by `service.name = xrpld` 6. Check Tempo logs: `docker compose -f docker/telemetry/docker-compose.yml logs tempo` ### High memory usage - Reduce `sampling_ratio` (e.g., `0.1` for 10% sampling) - Reduce `max_queue_size` and `batch_size` - Disable high-volume trace categories: `trace_peer=0` ### Collector connection failures - Verify endpoint URL matches collector address - Check firewall rules for ports 4317/4318 - If using TLS, verify certificate path with `tls_ca_cert` ## Performance Tuning | Scenario | Recommendation | | ------------------------ | ------------------------------------------------- | | Production mainnet | `sampling_ratio=0.01`, `trace_peer=0` | | Testnet/devnet | `sampling_ratio=1.0` (full tracing) | | Debugging specific issue | `sampling_ratio=1.0` temporarily | | High-throughput node | Increase `batch_size=1024`, `max_queue_size=4096` | ## Disabling Telemetry Set `enabled=0` in config (runtime disable) or build without the flag: ```bash cmake --preset default -Dtelemetry=OFF ``` When telemetry is compiled out, all trace macros expand to no-ops with zero overhead.