# Observability Data Collection Reference
> **Audience**: Developers and operators. This is the single source of truth for all telemetry data collected by xrpld's observability stack.
>
> **Related docs**: [docs/telemetry-runbook.md](../docs/telemetry-runbook.md) (operator runbook with alerting and troubleshooting) | [03-implementation-strategy.md](./03-implementation-strategy.md) (code structure and performance optimization) | [04-code-samples.md](./04-code-samples.md) (C++ instrumentation examples)
## Data Flow Overview
```mermaid
graph LR
subgraph xrpldNode["xrpld Node"]
A["Trace Macros
XRPL_TRACE_SPAN
(OTLP/HTTP exporter)"]
B["beast::insight
OTel native metrics
(OTLP/HTTP exporter)"]
C["MetricsRegistry
OTel SDK metrics
(OTLP/HTTP exporter)"]
end
subgraph collector["OTel Collector :4317 / :4318"]
direction TB
R1["OTLP Receiver
:4317 gRPC | :4318 HTTP
(traces + metrics)"]
BP["Batch Processor
timeout 1s, batch 100"]
SM["SpanMetrics Connector
derives RED metrics
from trace spans"]
R1 --> BP
BP --> SM
end
subgraph backends["Trace Backend"]
D["Grafana Tempo :3200
TraceQL search &
S3/GCS long-term storage"]
end
subgraph metrics["Metrics Stack"]
E["Prometheus :9090
scrapes :8889
span-derived + system metrics"]
end
subgraph viz["Visualization"]
F["Grafana :3000
13 dashboards"]
end
A -->|"OTLP/HTTP :4318
(traces + attributes)"| R1
B -->|"OTLP/HTTP :4318
(gauges, counters, histograms)"| R1
C -->|"OTLP/HTTP :4318
(counters, histograms,
observable gauges)"| R1
BP -->|"OTLP/gRPC :4317"| D
SM -->|"span_calls_total
span_duration_ms
(6 dimension labels)"| E
R1 -->|"xrpld_* gauges
xrpld_* counters
xrpld_* histograms"| E
E -->|"Prometheus
data source"| F
D -->|"Tempo
data source"| F
style A fill:#4a90d9,color:#fff,stroke:#2a6db5
style B fill:#4a90d9,color:#fff,stroke:#2a6db5
style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
style BP fill:#449d44,color:#fff,stroke:#2d6e2d
style SM fill:#449d44,color:#fff,stroke:#2d6e2d
style D fill:#f0ad4e,color:#000,stroke:#c78c2e
style E fill:#f0ad4e,color:#000,stroke:#c78c2e
style F fill:#5bc0de,color:#000,stroke:#3aa8c1
style xrpldNode fill:#1a2633,color:#ccc,stroke:#4a90d9
style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de
```
There are two independent telemetry pipelines entering a single **OTel Collector** via the same OTLP receiver:
1. **OpenTelemetry Traces** — Distributed spans with attributes, exported via OTLP/HTTP (:4318) to the collector's **OTLP Receiver**. The **Batch Processor** groups spans (1s timeout, batch size 100) before forwarding to trace backends. The **SpanMetrics Connector** derives RED metrics (rate, errors, duration) from every span and feeds them into the metrics pipeline.
2. **beast::insight OTel Metrics** — System-level gauges, counters, and histograms exported natively via OTLP/HTTP (:4318) to the same **OTLP Receiver**. These are batched and exported to Prometheus alongside span-derived metrics. The StatsD UDP transport has been replaced by native OTLP; `server=statsd` remains available as a fallback.
**Trace backend** — The collector exports traces via OTLP/gRPC to:
- **Grafana Tempo** — Preferred trace backend. Supports TraceQL queries at `:3200`, S3/GCS object storage for cost-effective long-term trace retention, and integrates natively with Grafana.
> **Further reading**: [00-tracing-fundamentals.md](./00-tracing-fundamentals.md) for core OpenTelemetry concepts (traces, spans, context propagation, sampling). [07-observability-backends.md](./07-observability-backends.md) for production backend selection, collector placement, and sampling strategies.
---
## 1. OpenTelemetry Spans
### 1.1 Complete Span Inventory (~37 spans)
> **See also**: [02-design-decisions.md §2.3](./02-design-decisions.md#23-span-naming-conventions) for naming conventions and the full span catalog with rationale. [04-code-samples.md §4.6](./04-code-samples.md#46-span-flow-visualization) for span flow diagrams.
> **Span names vs. attribute keys**: span names use dotted `subsystem.operation`
> form (e.g. `rpc.http_request`). Span _attribute_ keys use the bare/underscore
> form from the 2026-05-13 naming redesign (e.g. `tx_hash`, not `xrpl.tx.hash`).
> The dotted `xrpl.*` form is reserved for OTel **resource** attributes set once
> at startup. See §1.2 for the full attribute inventory.
#### RPC Spans
Controlled by `trace_rpc=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| -------------------- | ------------------ | ----------------- | ------------------------------------------------------------------------ |
| `rpc.http_request` | — | ServerHandler.cpp | Top-level HTTP JSON-RPC request entry point |
| `rpc.ws_message` | — | ServerHandler.cpp | WebSocket message handling (one per inbound frame) |
| `rpc.ws_upgrade` | — | ServerHandler.cpp | WebSocket upgrade handshake (records handshake failures) |
| `rpc.process` | `rpc.http_request` | ServerHandler.cpp | RPC processing pipeline (single or batch request) |
| `rpc.command.` | `rpc.process` | RPCHandler.cpp | Per-command span (e.g., `rpc.command.server_info`, `rpc.command.ledger`) |
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"rpc.http_request|rpc.command.*"}`
**Grafana dashboard**: _RPC Performance_ (`xrpld-rpc-perf`)
#### gRPC Spans
Controlled by `trace_rpc=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| ------------------- | ------ | -------------- | ------------------------------------------------------------------------------------------------------------------------- |
| `grpc.` | — | GRPCServer.cpp | One flat span per gRPC method (e.g., `grpc.GetLedger`, `grpc.GetLedgerData`, `grpc.GetLedgerDiff`, `grpc.GetLedgerEntry`) |
The method name is embedded in the span name (formed at the call site as
`grpc.`), so dashboards break out per-method latency and error
rates without TraceQL attribute filters.
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"grpc.*"}`
**Grafana dashboard**: _RPC Performance_ (`xrpld-rpc-perf`)
#### Transaction Spans
Controlled by `trace_transactions=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| --------------- | -------------- | --------------- | ----------------------------------------------------------------- |
| `tx.process` | — | NetworkOPs.cpp | Transaction submission entry point (local or peer-relayed) |
| `tx.receive` | — | PeerImp.cpp | Raw transaction received from peer overlay (before deduplication) |
| `tx.apply` | `ledger.build` | BuildLedger.cpp | Transaction set applied to new ledger during consensus |
| `tx.preflight` | — | applySteps.cpp | Stateless checks stage (`stage=preflight`) |
| `tx.preclaim` | — | applySteps.cpp | Ledger-aware checks stage before fee claim (`stage=preclaim`) |
| `tx.transactor` | — | Transactor.cpp | Apply stage — the transactor runs (`stage=apply`) |
The three apply-pipeline spans share a deterministic `trace_id` derived from
`txID[0:16]`, so preflight, preclaim, and transactor for one transaction group
under a single trace even though they run sequentially and often on different
threads. A transaction that hard-fails preflight or preclaim never reaches the
later spans — the `stage` attribute identifies where it stopped.
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"tx.process|tx.receive"}`
or, for the apply pipeline: `{resource.service.name="xrpld" && name=~"tx.preflight|tx.preclaim|tx.transactor"}`
**Grafana dashboard**: _Transaction Overview_ (`xrpld-transactions`)
#### Transaction Queue (TxQ) Spans
Controlled by `trace_transactions=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| ------------------ | ------------- | ----------- | --------------------------------------------------- |
| `txq.enqueue` | `tx.process` | TxQ.cpp | Enqueue decision when a tx is submitted |
| `txq.apply_direct` | `txq.enqueue` | TxQ.cpp | Direct apply attempt that bypasses the queue |
| `txq.batch_clear` | `txq.enqueue` | TxQ.cpp | Batch clear of an account's queued txs |
| `txq.accept` | — | TxQ.cpp | Ledger-close accept loop (drains the queue) |
| `txq.accept.tx` | `txq.accept` | TxQ.cpp | Per-queued-transaction apply inside the accept loop |
| `txq.cleanup` | — | TxQ.cpp | Post-close cleanup of expired queue entries |
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"txq.*"}`
**Grafana dashboard**: _Transaction Overview_ (`xrpld-transactions`)
#### Consensus Spans
Controlled by `trace_consensus=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| ------------------------------ | ------------------ | ---------------- | ------------------------------------------------------------------- |
| `consensus.round` | — (root) | RCLConsensus.cpp | Root span for one consensus round (deterministic trace per round) |
| `consensus.phase.open` | `consensus.round` | Consensus.h | Open phase — collecting transactions before close |
| `consensus.proposal.send` | `consensus.round` | RCLConsensus.cpp | Node broadcasts its transaction set proposal |
| `consensus.ledger_close` | `consensus.round` | RCLConsensus.cpp | Ledger close event triggered by consensus |
| `consensus.establish` | `consensus.round` | Consensus.h | Establish phase — converging on the transaction set |
| `consensus.update_positions` | `consensus.round` | Consensus.h | Position update with per-dispute vote details |
| `consensus.check` | `consensus.round` | Consensus.h | Consensus threshold check (agree/disagree tally) |
| `consensus.accept` | `consensus.round` | RCLConsensus.cpp | Consensus accepts a ledger (round complete) |
| `consensus.accept.apply` | `consensus.accept` | RCLConsensus.cpp | Ledger application with close-time details (jtACCEPT thread) |
| `consensus.validation.send` | `consensus.round` | RCLConsensus.cpp | Validation message sent after ledger accepted (follows-from link) |
| `consensus.mode_change` | `consensus.round` | RCLConsensus.cpp | Operating-mode transition during the round |
| `consensus.proposal.receive` | (context) | PeerImp.cpp | Proposal received from a peer (context-propagated into the round) |
| `consensus.validation.receive` | (context) | PeerImp.cpp | Validation received from a peer (context-propagated into the round) |
The `.receive` spans are created per-message in the overlay and joined to the
round trace via context propagation rather than direct parenting. The
`consensus.validation.send` span uses a follows-from link off the round.
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"consensus.*"}`
**Grafana dashboard**: _Consensus Health_ (`xrpld-consensus`)
#### Ledger Spans
Controlled by `trace_ledger=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| ----------------- | ------ | ----------------- | ---------------------------------------------- |
| `ledger.build` | — | BuildLedger.cpp | Build new ledger from accepted transaction set |
| `ledger.validate` | — | LedgerMaster.cpp | Ledger promoted to validated status |
| `ledger.store` | — | LedgerMaster.cpp | Ledger stored to database/history |
| `ledger.acquire` | — | InboundLedger.cpp | Fetch a missing ledger from peers |
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"ledger.*"}`
**Grafana dashboard**: _Ledger Operations_ (`xrpld-ledger-ops`)
#### Peer Spans
Controlled by `trace_peer=1` in `[telemetry]` config. **Disabled by default** (high volume).
| Span Name | Parent | Source File | Description |
| ------------------------- | ------ | ----------- | ------------------------------------- |
| `peer.proposal.receive` | — | PeerImp.cpp | Consensus proposal received from peer |
| `peer.validation.receive` | — | PeerImp.cpp | Validation message received from peer |
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"peer.*"}`
**Grafana dashboard**: _Peer Network_ (`xrpld-peer-net`)
#### PathFind Spans
Controlled by `trace_rpc=1` in `[telemetry]` config.
| Span Name | Parent | Source File | Description |
| --------------------- | ------------------ | --------------- | ---------------------------------------------------------- |
| `pathfind.request` | `rpc.command.*` | PathRequest.cpp | `path_find` / `ripple_path_find` RPC entry |
| `pathfind.compute` | `pathfind.request` | PathRequest.cpp | Path computation for one request (`PathRequest::doUpdate`) |
| `pathfind.discover` | `pathfind.compute` | Pathfinder.cpp | Graph exploration (one per RPC call) |
| `pathfind.update_all` | — | PathRequest.cpp | Async recomputation of all active requests at ledger close |
**Where to find**: Tempo → TraceQL: `{resource.service.name="xrpld" && name=~"pathfind.*"}`
---
### 1.2 Complete Attribute Inventory (bare/underscore keys)
> **See also**: [02-design-decisions.md §2.4.2](./02-design-decisions.md#242-span-attributes-by-category) for attribute design rationale and privacy considerations.
Every span can carry key-value attributes that provide context for filtering and
aggregation. Per the 2026-05-13 naming redesign, span-attribute keys use the
**bare** field name (the span name already carries the domain), or the
`_` underscore form where a bare name would collide (e.g.
`rpc_status`, `grpc_status`, `tx_status`, `txq_status`).
> **Dotted exceptions** (do not confuse with span attributes):
>
> - `xrpl.ledger.hash` is the **only** dotted span attribute. It is a shared
> constant set on `peer.validation.receive`. Note that `consensus.validation.send`
> uses the **bare** `ledger_hash` instead.
> - `xrpl.network.id` and `xrpl.network.type` are **resource** attributes set
> once at startup on the OTel resource — not span attributes. They appear on
> every span's resource scope, queried as `{resource.xrpl.network.id=...}`.
#### RPC Attributes
| Attribute | Type | Set On | Description |
| ---------------------- | ------- | --------------------------------- | ------------------------------------------------ |
| `command` | string | `rpc.command.*`, `rpc.ws_message` | RPC command name (e.g., `server_info`, `ledger`) |
| `version` | int64 | `rpc.command.*` | API version number |
| `rpc_role` | string | `rpc.command.*` | Caller role: `"admin"` or `"user"` |
| `rpc_status` | string | `rpc.command.*` | Result: `"success"` or `"error"` |
| `request_payload_size` | int64 | `rpc.http_request` | Bytes of inbound request payload |
| `is_batch` | boolean | `rpc.process` | `true` if the request is a JSON-RPC batch |
| `batch_size` | int64 | `rpc.process` | Number of sub-requests in a batch |
| `load_type` | string | `rpc.command.*` | Resource cost category after execution |
**Tempo query**: `{span.command="server_info"}` to find all `server_info` calls.
**Prometheus label**: `command` (used as a SpanMetrics dimension).
#### gRPC Attributes
| Attribute | Type | Set On | Description |
| ------------- | ------ | ------------------- | ------------------------------------ |
| `method` | string | `grpc.` | gRPC method name (e.g., `GetLedger`) |
| `grpc_role` | string | `grpc.` | Caller role: `"admin"` or `"user"` |
| `grpc_status` | string | `grpc.` | Result: `"success"` or `"error"` |
**Tempo query**: `{span.method="GetLedger"}` or `{name="grpc.GetLedger"}`.
**Prometheus labels**: `method`, `grpc_role`, `grpc_status` (SpanMetrics dimensions).
#### Transaction Attributes
| Attribute | Type | Set On | Description |
| -------------- | ------- | ------------------------------------------------------------ | --------------------------------------------------------------------- |
| `tx_hash` | string | `tx.process`, `tx.receive` | Transaction hash (hex-encoded) |
| `local` | boolean | `tx.process` | `true` if locally submitted, `false` if peer-relayed |
| `path` | string | `tx.process` | Submission path: `"sync"` or `"async"` |
| `tx_type` | string | `tx.process`, `tx.preflight`, `tx.preclaim`, `tx.transactor` | Transaction type name (e.g., `Payment`) |
| `fee` | int64 | `tx.process` | Transaction fee in drops |
| `sequence` | int64 | `tx.process` | Transaction sequence number |
| `suppressed` | boolean | `tx.receive` | `true` if transaction was suppressed (duplicate) |
| `tx_status` | string | `tx.receive` | Transaction status (e.g., `"known_bad"`) |
| `peer_id` | int64 | `tx.receive` | Peer identifier (also set on peer spans) |
| `peer_version` | string | `tx.receive` | Peer protocol version string |
| `stage` | string | `tx.preflight`, `tx.preclaim`, `tx.transactor` | Apply-pipeline stage: `preflight`, `preclaim`, or `apply` |
| `ter_result` | string | `tx.preflight`, `tx.preclaim`, `tx.transactor` | Engine result token for that stage (e.g., `tesSUCCESS`, `terPRE_SEQ`) |
| `applied` | boolean | `tx.transactor` | `true` if the transaction was applied to the ledger |
**Tempo query**: `{span.tx_hash=""}` to trace a specific transaction across nodes.
**Prometheus labels**: `local`, `suppressed`, `tx_type`, `ter_result`, `stage` (SpanMetrics dimensions).
#### Transaction Queue (TxQ) Attributes
| Attribute | Type | Set On | Description |
| -------------------- | ------- | ------------------------------ | ----------------------------------------------------------- |
| `tx_hash` | string | `txq.enqueue`, `txq.accept.tx` | Transaction hash |
| `tx_type` | string | `txq.enqueue` | Transaction type name |
| `txq_status` | string | `txq.enqueue`, `txq.accept.tx` | Queue outcome (e.g. `queued`, `applied_direct`, `rejected`) |
| `fee_level_paid` | int64 | `txq.enqueue` | Fee level paid by the queued tx |
| `required_fee_level` | int64 | `txq.enqueue` | Minimum fee level for inclusion |
| `num_cleared` | int64 | `txq.batch_clear` | Entries cleared in a batch |
| `queue_size` | int64 | `txq.accept` | Current TxQ depth |
| `ledger_changed` | boolean | `txq.accept` | Whether the ledger changed since last attempt |
| `ter_code` | int64 | `txq.accept.tx` | Transaction engine result code |
| `retries_remaining` | int64 | `txq.accept.tx` | Retries left before discard |
| `ledger_seq` | int64 | `txq.cleanup` | Ledger sequence number |
| `expired_count` | int64 | `txq.cleanup` | Number of expired entries cleared |
**Prometheus label**: `txq_status` (SpanMetrics dimension).
#### Consensus Attributes
| Attribute | Type | Set On | Description |
| -------------------------- | ------- | -------------------------------------------------------------------------------------------------- | -------------------------------------------------------- |
| `consensus_ledger_id` | string | `consensus.round` | Previous-ledger id anchoring the round |
| `ledger_seq` | int64 | `consensus.round`, `consensus.ledger_close`, `consensus.accept.apply`, `consensus.validation.send` | Ledger sequence number |
| `consensus_mode` | string | `consensus.round`, `consensus.ledger_close` | Node mode: `"Proposing"`, `"Observing"`, `"Wrong"`, etc. |
| `consensus_round_id` | int64 | `consensus.round` | Round identifier |
| `consensus_phase` | string | `consensus.round` | Current phase name (updated on each transition) |
| `trace_strategy` | string | `consensus.round` | Trace-id strategy (`deterministic` / `random`) |
| `previous_ledger_seq` | int64 | `consensus.round` | Sequence of the previous ledger |
| `previous_proposers` | int64 | `consensus.round` | Proposer count in the previous round |
| `previous_round_time_ms` | int64 | `consensus.round` | Duration of the previous round |
| `consensus_round` | int64 | `consensus.proposal.send` | Proposal sequence number for the broadcast proposal |
| `is_bow_out` | boolean | `consensus.proposal.send` | Whether the proposal is a bow-out (resigning the round) |
| `tx_count_open` | int64 | `consensus.ledger_close` | Transactions in the open ledger at close |
| `close_time_resolution_ms` | int64 | `consensus.ledger_close` | Close-time rounding granularity |
| `converge_percent` | int64 | `consensus.establish`, `consensus.update_positions` | Convergence percentage |
| `establish_count` | int64 | `consensus.establish` | Establish-phase iteration count |
| `proposers` | int64 | `consensus.establish`, `consensus.update_positions`, `consensus.accept` | Number of proposers |
| `disputes_count` | int64 | `consensus.establish`, `consensus.update_positions` | Number of disputed transactions |
| `tx_id` | string | `consensus.update_positions` | Disputed transaction id (per-dispute event) |
| `dispute_our_vote` | boolean | `consensus.update_positions` | Our vote on the disputed tx |
| `dispute_yays` | int64 | `consensus.update_positions` | Yes votes on the disputed tx |
| `dispute_nays` | int64 | `consensus.update_positions` | No votes on the disputed tx |
| `agree_count` | int64 | `consensus.check` | Agreeing proposer count |
| `disagree_count` | int64 | `consensus.check` | Disagreeing proposer count |
| `threshold_percent` | int64 | `consensus.check` | Agreement threshold percentage |
| `consensus_result` | string | `consensus.check` | Check outcome |
| `quorum` | int64 | `consensus.check`, `consensus.accept` | Quorum required |
| `round_time_ms` | int64 | `consensus.accept`, `consensus.accept.apply` | Total consensus round duration in milliseconds |
| `consensus_state` | string | `consensus.accept.apply` | Consensus outcome: `"finished"` or `"moved_on"` |
| `close_time` | int64 | `consensus.accept.apply` | Agreed-upon ledger close time (epoch seconds) |
| `close_time_correct` | boolean | `consensus.accept.apply` | Whether validators agreed on close time |
| `close_resolution_ms` | int64 | `consensus.accept.apply` | Close-time rounding granularity in milliseconds |
| `proposing` | boolean | `consensus.accept.apply`, `consensus.validation.send` | Whether this node was a proposer |
| `parent_close_time` | int64 | `consensus.accept.apply` | Parent ledger close time |
| `close_time_self` | int64 | `consensus.accept.apply` | This node's close-time vote |
| `close_time_vote_bins` | string | `consensus.accept.apply` | Distribution of close-time votes |
| `resolution_direction` | string | `consensus.accept.apply` | Whether close resolution increased/decreased/unchanged |
| `tx_count` | int64 | `consensus.accept.apply` | Transactions in the accepted set |
| `ledger_hash` | string | `consensus.validation.send` | Full hash of the validated ledger (**bare**, not dotted) |
| `full_validation` | boolean | `consensus.validation.send` | Whether this is a full validation |
| `validation_sign_time` | int64 | `consensus.validation.send` | Validation signing time |
| `mode_old` | string | `consensus.mode_change` | Operating mode before the transition |
| `mode_new` | string | `consensus.mode_change` | Operating mode after the transition |
**Tempo query**: `{span.consensus_mode="Proposing"}` to find rounds where the node was proposing.
**Prometheus labels**: `consensus_mode`, `consensus_state`, `consensus_phase`, `consensus_result`, `consensus_stalled`, `mode_new`, `close_time_correct` (SpanMetrics dimensions).
#### Ledger Attributes
| Attribute | Type | Set On | Description |
| --------------------- | ------- | ------------------------------------------------- | ------------------------------------------------ |
| `ledger_seq` | int64 | `ledger.build`, `ledger.validate`, `ledger.store` | Ledger sequence number |
| `close_time` | int64 | `ledger.build` | Ledger close time (epoch seconds) |
| `close_time_correct` | boolean | `ledger.build` | Whether close time was agreed upon by validators |
| `close_resolution_ms` | int64 | `ledger.build` | Close time rounding granularity in milliseconds |
| `tx_count` | int64 | `tx.apply` | Transactions applied to the ledger |
| `tx_failed` | int64 | `tx.apply` | Failed transactions in the apply set |
| `validations` | int64 | `ledger.validate` | Number of validations received for this ledger |
| `acquire_reason` | string | `ledger.acquire` | Why the ledger fetch was triggered |
| `timeouts` | int64 | `ledger.acquire` | Number of fetch timeouts |
| `peer_count` | int64 | `ledger.acquire` | Peers queried during the fetch |
| `outcome` | string | `ledger.acquire` | Fetch outcome |
The apply-step span `tx.apply` (child of `ledger.build`) carries `tx_count`/`tx_failed`;
the parent `ledger.build` carries `ledger_seq` and the close-time attributes.
`ledger.acquire` (InboundLedger) also sets `ledger_seq`.
**Tempo query**: `{span.ledger_seq=12345}` to find all spans for a specific ledger.
#### Peer Attributes
| Attribute | Type | Set On | Description |
| -------------------- | ------- | ---------------------------------------------------------------- | ---------------------------------------------------- |
| `peer_id` | int64 | `tx.receive`, `peer.proposal.receive`, `peer.validation.receive` | Peer identifier |
| `proposal_trusted` | boolean | `peer.proposal.receive` | Whether the proposal came from a trusted validator |
| `validation_trusted` | boolean | `peer.validation.receive` | Whether the validation came from a trusted validator |
| `validation_full` | boolean | `peer.validation.receive` | Whether the validation is a full validation |
| `xrpl.ledger.hash` | string | `peer.validation.receive` | Validated ledger hash (**dotted** — shared constant) |
**Prometheus labels**: `proposal_trusted`, `validation_trusted` (SpanMetrics dimensions).
#### PathFind Attributes
| Attribute | Type | Set On | Description |
| ------------------------- | ------- | --------------------- | ---------------------------------------- |
| `pathfind_source_account` | string | `pathfind.request` | Originating account for the path search |
| `pathfind_dest_account` | string | `pathfind.request` | Destination account |
| `pathfind_fast` | boolean | `pathfind.compute` | Whether fast pathfinding mode is enabled |
| `pathfind_search_level` | int64 | `pathfind.discover` | Depth of graph exploration |
| `pathfind_num_paths` | int64 | `pathfind.discover` | Total paths produced |
| `pathfind_ledger_index` | int64 | `pathfind.update_all` | Target ledger index |
| `pathfind_num_requests` | int64 | `pathfind.update_all` | Active requests recomputed |
---
### 1.3 SpanMetrics — Derived Prometheus Metrics
> **See also**: [01-architecture-analysis.md](./01-architecture-analysis.md) §1.8.2 for how span-derived metrics map to operational insights.
The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Errors, Duration) metrics from every span. No custom metrics code in xrpld is needed.
| Prometheus Metric | Type | Description |
| -------------------------------------------------- | --------- | ------------------------------------------------------------------------------ |
| `traces_span_metrics_calls_total` | Counter | Total span invocations |
| `traces_span_metrics_duration_milliseconds_bucket` | Histogram | Latency distribution (buckets: 1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000 ms) |
| `traces_span_metrics_duration_milliseconds_count` | Histogram | Observation count |
| `traces_span_metrics_duration_milliseconds_sum` | Histogram | Cumulative latency |
**Standard labels on every metric**: `span_name`, `status_code`, `service_name`, `span_kind`
**Additional dimension labels** (configured in `otel-collector-config.yaml`).
The Prometheus label is the **bare span-attribute key verbatim** — the
SpanMetrics connector does not rewrite or prefix it:
| Prometheus Label / Span Attribute | Type | Applies To |
| --------------------------------- | ------- | ---------------------------------------------- |
| `command` | string | `rpc.command.*` |
| `rpc_status` | string | `rpc.command.*` |
| `consensus_mode` | string | `consensus.round`, `consensus.ledger_close` |
| `close_time_correct` | boolean | `consensus.accept.apply` |
| `local` | boolean | `tx.process` |
| `suppressed` | boolean | `tx.receive` |
| `proposal_trusted` | boolean | `peer.proposal.receive` |
| `validation_trusted` | boolean | `peer.validation.receive` |
| `tx_type` | string | `tx.*`, `txq.enqueue` |
| `ter_result` | string | `tx.preflight`, `tx.preclaim`, `tx.transactor` |
| `stage` | string | `tx.preflight`, `tx.preclaim`, `tx.transactor` |
| `txq_status` | string | `txq.enqueue`, `txq.accept.tx` |
| `consensus_state` | string | `consensus.accept.apply` |
| `load_type` | string | `rpc.command.*` |
| `is_batch` | boolean | `rpc.process` |
| `mode_new` | string | `consensus.mode_change` |
| `consensus_stalled` | boolean | `consensus.check` |
| `consensus_phase` | string | `consensus.round` |
| `consensus_result` | string | `consensus.check` |
| `method` | string | `grpc.` |
| `grpc_role` | string | `grpc.` |
| `grpc_status` | string | `grpc.` |
The `stage` dimension (3 values: `preflight`, `preclaim`, `apply`) turns the
apply-pipeline spans into per-stage RED metrics with no native instruments — the
_Transaction Overview_ dashboard charts rate, p95 latency, and failure rate by stage.
> **Sampling caveat**: span-derived metrics inherit the **tracer head-sampling**
> ratio (`sampling_ratio` in `[telemetry]`, via `TraceIdRatioBasedSampler`). At
> `sampling_ratio < 1.0` the stage RED metrics undercount proportionally — they
> reflect sampled traces, not the full transaction volume. Native StatsD/meter
> metrics do not sample. Account for this when reading absolute stage rates.
**Where to query**: Prometheus → `traces_span_metrics_calls_total{span_name="rpc.command.server_info"}`
---
## 2. System Metrics (beast::insight — OTel native)
> **See also**: [02-design-decisions.md](./02-design-decisions.md) for the beast::insight coexistence design. [06-implementation-phases.md](./06-implementation-phases.md) for the Phase 6/7 metric inventory.
>
> **Migration complete**: Phase 7 replaced the StatsD UDP transport with native OTel Metrics SDK export via OTLP/HTTP. The `beast::insight::Collector` interface and all metric names are preserved — only the wire protocol changed. `[insight] server=statsd` remains as a fallback.
These are system-level metrics emitted by xrpld's `beast::insight` framework via OTel OTLP/HTTP. They cover operational data that doesn't map to individual trace spans.
### Configuration
```ini
# Recommended: native OTel metrics via OTLP/HTTP
[insight]
server=otel
endpoint=http://localhost:4318/v1/metrics
prefix=xrpld
```
Fallback (StatsD):
```ini
[insight]
server=statsd
address=127.0.0.1:8125
prefix=xrpld
```
### 2.1 Gauges
| Prometheus Metric | Source File | Description | Typical Range |
| ------------------------------------------------- | --------------------- | ----------------------------------------- | ------------------------------- |
| `xrpld_LedgerMaster_Validated_Ledger_Age` | LedgerMaster.h | Seconds since last validated ledger | 0–10 (healthy), >30 (stale) |
| `xrpld_LedgerMaster_Published_Ledger_Age` | LedgerMaster.h | Seconds since last published ledger | 0–10 (healthy) |
| `xrpld_State_Accounting_Disconnected_duration` | NetworkOPs.cpp | Cumulative seconds in Disconnected state | Monotonic |
| `xrpld_State_Accounting_Connected_duration` | NetworkOPs.cpp | Cumulative seconds in Connected state | Monotonic |
| `xrpld_State_Accounting_Syncing_duration` | NetworkOPs.cpp | Cumulative seconds in Syncing state | Monotonic |
| `xrpld_State_Accounting_Tracking_duration` | NetworkOPs.cpp | Cumulative seconds in Tracking state | Monotonic |
| `xrpld_State_Accounting_Full_duration` | NetworkOPs.cpp | Cumulative seconds in Full state | Monotonic (should dominate) |
| `xrpld_State_Accounting_Disconnected_transitions` | NetworkOPs.cpp | Count of transitions to Disconnected | Low |
| `xrpld_State_Accounting_Connected_transitions` | NetworkOPs.cpp | Count of transitions to Connected | Low |
| `xrpld_State_Accounting_Syncing_transitions` | NetworkOPs.cpp | Count of transitions to Syncing | Low |
| `xrpld_State_Accounting_Tracking_transitions` | NetworkOPs.cpp | Count of transitions to Tracking | Low |
| `xrpld_State_Accounting_Full_transitions` | NetworkOPs.cpp | Count of transitions to Full | Low (should be 1 after startup) |
| `xrpld_Peer_Finder_Active_Inbound_Peers` | PeerfinderManager.cpp | Active inbound peer connections | 0–85 |
| `xrpld_Peer_Finder_Active_Outbound_Peers` | PeerfinderManager.cpp | Active outbound peer connections | 10–21 |
| `xrpld_Overlay_Peer_Disconnects` | OverlayImpl.cpp | Cumulative peer disconnection count | Low growth |
| `xrpld_Overlay_Peer_Disconnects_Charges` | OverlayImpl.cpp | Disconnects due to resource limit charges | Low growth (subset of above) |
| `xrpld_jobq_job_count` | JobQueue.cpp | Current job queue depth (group `jobq`) | 0–100 (healthy) |
**Grafana dashboard**: _Node Health (System Metrics)_ (`xrpld-system-node-health`)
### 2.2 Counters
| Prometheus Metric | Source File | Description |
| ------------------------------- | ------------------ | --------------------------------------------- |
| `xrpld_rpc_requests` | ServerHandler.cpp | Total RPC requests received |
| `xrpld_ledger_fetches` | InboundLedgers.cpp | Inbound ledger fetch attempts |
| `xrpld_ledger_history_mismatch` | LedgerHistory.cpp | Ledger hash mismatches detected |
| `xrpld_warn` | Logic.h | Resource manager warnings issued |
| `xrpld_drop` | Logic.h | Resource manager drops (connections rejected) |
**Note**: With `server=otel`, `xrpld_warn` and `xrpld_drop` are properly exported as OTel Counter instruments. The previous StatsD `|m` type limitation no longer applies.
**Grafana dashboard**: _RPC & Pathfinding (System Metrics)_ (`xrpld-system-rpc`)
### 2.3 Histograms (Event timers)
| Prometheus Metric | Source File | Unit | Description |
| --------------------- | ----------------- | ----- | ------------------------------ |
| `xrpld_rpc_time` | ServerHandler.cpp | ms | RPC response time distribution |
| `xrpld_rpc_size` | ServerHandler.cpp | bytes | RPC response size distribution |
| `xrpld_ios_latency` | Application.cpp | ms | I/O service loop latency |
| `xrpld_pathfind_fast` | PathRequests.h | ms | Fast pathfinding duration |
| `xrpld_pathfind_full` | PathRequests.h | ms | Full pathfinding duration |
Quantiles collected: 0th, 50th, 90th, 95th, 99th, 100th percentile.
**Grafana dashboards**: _Node Health_ (`ios_latency`), _RPC & Pathfinding_ (`rpc_time`, `rpc_size`, `pathfind_*`)
### 2.4 Overlay Traffic Metrics
For each of the 45+ overlay traffic categories (defined in `TrafficCount.h`), four gauges are emitted:
- `xrpld_{category}_Bytes_In`
- `xrpld_{category}_Bytes_Out`
- `xrpld_{category}_Messages_In`
- `xrpld_{category}_Messages_Out`
**Key categories**:
| Category | Description |
| ----------------------------------------------------------------- | -------------------------- |
| `total` | All traffic aggregated |
| `overhead` / `overhead_overlay` | Protocol overhead |
| `transactions` / `transactions_duplicate` | Transaction relay |
| `proposals` / `proposals_untrusted` / `proposals_duplicate` | Consensus proposals |
| `validations` / `validations_untrusted` / `validations_duplicate` | Consensus validations |
| `ledger_data_get` / `ledger_data_share` | Ledger data exchange |
| `ledger_data_Transaction_Node_get/share` | Transaction node data |
| `ledger_data_Account_State_Node_get/share` | Account state node data |
| `ledger_data_Transaction_Set_candidate_get/share` | Transaction set candidates |
| `getObject` / `haveTxSet` / `ledgerData` | Object requests |
| `ping` / `status` | Keepalive and status |
| `set_get` | Set requests |
**Grafana dashboards**: _Network Traffic_ (`xrpld-system-network`), _Overlay Traffic Detail_ (`xrpld-system-overlay-detail`), _Ledger Data & Sync_ (`xrpld-system-ledger-sync`)
---
## 3. Grafana Dashboard Reference
> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8 for Grafana data source provisioning (Tempo, Prometheus) and TraceQL query examples.
### 3.1 Span-Derived Dashboards (5)
| Dashboard | UID | Data Source | Key Panels |
| -------------------- | -------------------- | ------------------------ | ---------------------------------------------------------------------------------- |
| RPC Performance | `xrpld-rpc-perf` | Prometheus (SpanMetrics) | Request rate by command, p95 latency by command, error rate, heatmap, top commands |
| Transaction Overview | `xrpld-transactions` | Prometheus (SpanMetrics) | Processing rate, latency p95/p50, local vs relay split, apply duration, heatmap |
| Consensus Health | `xrpld-consensus` | Prometheus (SpanMetrics) | Round duration p95/p50, proposals rate, close duration, mode timeline, heatmap |
| Ledger Operations | `xrpld-ledger-ops` | Prometheus (SpanMetrics) | Build rate, build duration, validation rate, store rate, build vs close comparison |
| Peer Network | `xrpld-peer-net` | Prometheus (SpanMetrics) | Proposal receive rate, validation receive rate, trusted vs untrusted breakdown |
### 3.2 System Metrics Dashboards (5)
| Dashboard | UID | Data Source | Key Panels |
| ---------------------- | ----------------------------- | ----------------- | --------------------------------------------------------------------------------- |
| Node Health | `xrpld-system-node-health` | Prometheus (OTLP) | Ledger age, operating mode, I/O latency, job queue, fetch rate |
| Network Traffic | `xrpld-system-network` | Prometheus (OTLP) | Active peers, disconnects, bytes in/out, messages in/out, traffic by category |
| RPC & Pathfinding | `xrpld-system-rpc` | Prometheus (OTLP) | RPC rate, response time/size, pathfinding duration, resource warnings/drops |
| Overlay Traffic Detail | `xrpld-system-overlay-detail` | Prometheus (OTLP) | Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths |
| Ledger Data & Sync | `xrpld-system-ledger-sync` | Prometheus (OTLP) | Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap |
### 3.3 Accessing the Dashboards
1. Open Grafana at **http://localhost:3000**
2. Navigate to **Dashboards → xrpld** folder
3. All 10 dashboards are auto-provisioned from `docker/telemetry/grafana/dashboards/`
---
## 4. Tempo Trace Search Guide
> **See also**: [08-appendix.md](./08-appendix.md) §8.2 for span hierarchy visualizations. [05-configuration-reference.md](./05-configuration-reference.md) §5.8.5 for TraceQL query examples.
### Finding Traces by Type
| What to Find | Tempo TraceQL Query |
| ------------------------ | ------------------------------------------------------------------------------ |
| All RPC calls | `{resource.service.name="xrpld" && name="rpc.http_request"}` |
| Specific RPC command | `{resource.service.name="xrpld" && name="rpc.command.server_info"}` |
| Slow RPC calls | `{resource.service.name="xrpld" && name=~"rpc.command.*"} \| duration > 100ms` |
| Failed RPC calls | `{span.rpc_status="error"}` |
| gRPC method calls | `{resource.service.name="xrpld" && name="grpc.GetLedger"}` |
| Specific transaction | `{span.tx_hash=""}` |
| Local transactions only | `{span.local=true}` |
| Consensus rounds | `{resource.service.name="xrpld" && name="consensus.round"}` |
| Rounds by mode | `{span.consensus_mode="Proposing"}` |
| Specific ledger | `{span.ledger_seq=12345}` |
| Peer proposals (trusted) | `{span.proposal_trusted=true}` |
### Trace Structure
A typical RPC trace shows the span hierarchy:
```
rpc.http_request (ServerHandler)
└── rpc.process (ServerHandler)
└── rpc.command.server_info (RPCHandler)
```
A consensus round groups its lifecycle spans under a single root
(`consensus.round`); the build/ledger spans run as their own trees:
```
consensus.round (root — one per round)
├── consensus.phase.open (open phase)
├── consensus.proposal.send (broadcast proposal)
├── consensus.ledger_close (close event)
├── consensus.establish (establish phase)
├── consensus.update_positions (position updates)
├── consensus.check (threshold check)
├── consensus.accept (accept result)
│ └── consensus.accept.apply (apply, jtACCEPT thread)
└── consensus.validation.send (send validation, follows-from link)
ledger.build (build new ledger)
└── tx.apply (apply transaction set)
ledger.validate (promote to validated)
ledger.store (persist to DB)
```
---
## 5. Prometheus Query Examples
> **See also**: [05-configuration-reference.md](./05-configuration-reference.md) §5.8.7 for correlating Prometheus system metrics with trace-derived metrics.
### Span-Derived Metrics
```promql
# RPC request rate by command (last 5 minutes)
sum by (command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))
# RPC p95 latency by command
histogram_quantile(0.95, sum by (le, command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))
# Consensus round duration p95
histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name="consensus.round"}[5m])))
# Transaction processing rate (local vs relay)
sum by (local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))
# Trusted vs untrusted proposal rate
sum by (proposal_trusted) (rate(traces_span_metrics_calls_total{span_name="peer.proposal.receive"}[5m]))
```
### StatsD Metrics
```promql
# Validated ledger age (should be < 10s)
xrpld_LedgerMaster_Validated_Ledger_Age
# Active peer count
xrpld_Peer_Finder_Active_Inbound_Peers + xrpld_Peer_Finder_Active_Outbound_Peers
# RPC response time p95
histogram_quantile(0.95, xrpld_rpc_time_bucket)
# Total network bytes in (rate)
rate(xrpld_total_Bytes_In[5m])
# Operating mode (should be "Full" after startup)
xrpld_State_Accounting_Full_duration
```
---
## 5a. Log-Trace Correlation (Phase 8)
> **Plan details**: [06-implementation-phases.md §6.8.1](./06-implementation-phases.md) — motivation, architecture, Mermaid diagrams
> **Task breakdown**: [Phase8_taskList.md](./Phase8_taskList.md) — per-task implementation details
Phase 8 injects OTel trace context into xrpld's `Logs::format()` output, enabling log-trace correlation. When a log line is emitted within an active OTel span, the trace and span identifiers are automatically appended after the severity field:
### Log Format
```
: trace_id=<32hex> span_id=<16hex>
```
Example:
```
2024-01-15T10:30:45.123Z LedgerMaster:NFO trace_id=abc123def456789012345678abcdef01 span_id=0123456789abcdef Validated ledger 42
```
- **`trace_id=`** — 32-character lowercase hex trace identifier. Links to the distributed trace in Tempo.
- **`span_id=`** — 16-character lowercase hex span identifier. Identifies the specific span within the trace.
- **Only present** when the log is emitted within an active OTel span. Log lines outside of traced code paths have no trace context fields.
### Implementation
The trace context injection is implemented in `Logs::format()` (`src/libxrpl/basics/Log.cpp`), guarded by `#ifdef XRPL_ENABLE_TELEMETRY`. It checks the thread-local runtime context value directly (via `RuntimeContext::GetCurrent().GetValue(kSpanKey)`) to avoid the heap allocation that `GetSpan()` performs on the no-span path. On threads without an active span, the cost is a thread-local read + variant type check (~15-20ns). On the active-span path, total cost is ~50ns per log call.
### Log Ingestion Pipeline
```
xrpld debug.log -> OTel Collector filelog receiver -> regex_parser -> Loki exporter -> Grafana Loki
```
The OTel Collector's `filelog` receiver tails `debug.log` files and uses a `regex_parser` operator to extract structured fields:
| Field | Type | Description |
| ----------- | -------- | -------------------------------------------------------- |
| `timestamp` | datetime | Log timestamp |
| `partition` | string | Log partition (e.g., `LedgerMaster`, `PeerImp`) |
| `severity` | string | Severity code (`TRC`, `DBG`, `NFO`, `WRN`, `ERR`, `FTL`) |
| `trace_id` | string | 32-hex trace identifier (optional) |
| `span_id` | string | 16-hex span identifier (optional) |
| `message` | string | Log message body |
### Grafana Correlation
Bidirectional linking between logs and traces is configured via Grafana datasource provisioning:
- **Tempo -> Loki** (`tracesToLogs`): Clicking "Logs for this trace" on a Tempo trace view filters Loki logs by `trace_id`, showing all log lines from that trace.
- **Loki -> Tempo** (`derivedFields`): A regex-based derived field on the Loki datasource extracts `trace_id` from log lines and renders it as a clickable link to the corresponding trace in Tempo.
### Loki Backend
Grafana Loki (v2.9.0) serves as the log storage backend. It receives log entries from the OTel Collector's `loki` exporter via the push API at `http://loki:3100/loki/api/v1/push`.
### LogQL Query Examples
```logql
# Find all logs for a specific trace
{job="xrpld"} |= "trace_id=abc123def456789012345678abcdef01"
# Error logs with trace context
{job="xrpld"} |= "ERR" |= "trace_id="
# Logs from a specific partition with trace context
{job="xrpld"} |= "LedgerMaster" | regexp `trace_id=(?P[a-f0-9]+)` | trace_id != ""
# Count traced log lines over time
count_over_time({job="xrpld"} |= "trace_id=" [5m])
```
---
## 5b. Internal Metric Gap Fill (Phase 9)
> **Status**: Implemented.
> **Plan details**: [06-implementation-phases.md §6.8.2](./06-implementation-phases.md) — motivation, architecture, third-party context
> **Task breakdown**: [Phase9_taskList.md](./Phase9_taskList.md) — per-task implementation details
Phase 9 fills the metrics that exist inside xrpld but previously lacked time-series export. It
uses a hybrid approach: `beast::insight` extensions for NodeStore I/O plus OTel `ObservableGauge`
async callbacks for new categories.
> **Authoritative metric names live in [§ Phase 9: OTel SDK-Exported Metrics](#phase-9-otel-sdk-exported-metrics-metricsregistry) below.**
> Most internal metrics are emitted as **labeled** gauges — one instrument carrying many logical
> values via a `metric` label (e.g. `xrpld_cache_metrics{metric="SLE_hit_rate"}`,
> `xrpld_txq_metrics{metric="txq_count"}`, `xrpld_load_factor_metrics{metric="load_factor"}`,
> `xrpld_nodestore_state{metric="node_reads_total"}`) — not the flat per-name form. Query the
> labeled names; the flat names (`xrpld_cache_SLE_hit_rate`, `xrpld_txq_count`, …) are **not** emitted.
#### Server Info (via OTel MetricsRegistry)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------------------- | ----- | -------- | -------------------------------------------- |
| `xrpld_server_info{metric="server_state"}` | Gauge | `metric` | Operating mode (0=DISCONNECTED .. 4=FULL) |
| `xrpld_server_info{metric="uptime"}` | Gauge | `metric` | Seconds since server start |
| `xrpld_server_info{metric="peers"}` | Gauge | `metric` | Total connected peers |
| `xrpld_server_info{metric="validated_ledger_seq"}` | Gauge | `metric` | Validated ledger sequence number |
| `xrpld_server_info{metric="ledger_current_index"}` | Gauge | `metric` | Current open ledger sequence |
| `xrpld_server_info{metric="peer_disconnects_resources"}` | Gauge | `metric` | Cumulative resource-related peer disconnects |
| `xrpld_server_info{metric="last_close_proposers"}` | Gauge | `metric` | Proposers in last closed round |
| `xrpld_server_info{metric="last_close_converge_time_ms"}` | Gauge | `metric` | Last close convergence time (milliseconds) |
#### Build Info (via OTel MetricsRegistry)
| Prometheus Metric | Type | Labels | Description |
| ----------------------------------- | ----- | --------- | --------------------------------- |
| `xrpld_build_info{version=""}` | Gauge | `version` | Info-style metric, always value 1 |
#### Complete Ledger Ranges (via OTel MetricsRegistry)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------------- | ----- | --------------- | --------------------------- |
| `xrpld_complete_ledgers{bound="start",index=""}` | Gauge | `bound`,`index` | Start of contiguous range N |
| `xrpld_complete_ledgers{bound="end",index=""}` | Gauge | `bound`,`index` | End of contiguous range N |
#### Database Metrics (via OTel MetricsRegistry)
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------------------- | ----- | -------- | --------------------------------- |
| `xrpld_db_metrics{metric="db_kb_total"}` | Gauge | `metric` | Total database size (KB) |
| `xrpld_db_metrics{metric="db_kb_ledger"}` | Gauge | `metric` | Ledger database size (KB) |
| `xrpld_db_metrics{metric="db_kb_transaction"}` | Gauge | `metric` | Transaction database size (KB) |
| `xrpld_db_metrics{metric="historical_perminute"}` | Gauge | `metric` | Historical ledger fetches per min |
#### Extended Cache Metrics (additions to existing xrpld_cache_metrics)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------- | ----- | -------- | ------------------------- |
| `xrpld_cache_metrics{metric="AL_size"}` | Gauge | `metric` | AcceptedLedger cache size |
#### Extended NodeStore Metrics (additions to existing xrpld_nodestore_state)
| Prometheus Metric | Type | Labels | Description |
| -------------------------------------------------------- | ----- | -------- | ----------------------------------- |
| `xrpld_nodestore_state{metric="node_reads_duration_us"}` | Gauge | `metric` | Cumulative read time (microseconds) |
| `xrpld_nodestore_state{metric="read_request_bundle"}` | Gauge | `metric` | Read request bundle count |
| `xrpld_nodestore_state{metric="read_threads_running"}` | Gauge | `metric` | Active read threads |
| `xrpld_nodestore_state{metric="read_threads_total"}` | Gauge | `metric` | Total read threads configured |
### New Grafana Dashboards (Phase 9)
| Dashboard | UID | Data Source | Key Panels |
| ------------------ | ------------------ | ----------- | ----------------------------------------------------------------- |
| Fee Market & TxQ | `xrpld-fee-market` | Prometheus | TxQ depth/capacity, fee levels, load factor breakdown, escalation |
| Job Queue Analysis | `xrpld-job-queue` | Prometheus | Per-job rates, queue wait times, execution times, queue depth |
---
## 5c. Future: Synthetic Workload Generation & Telemetry Validation (Phase 10)
> **Plan details**: [06-implementation-phases.md §6.8.3](./06-implementation-phases.md) — motivation, architecture
> **Task breakdown**: [Phase10_taskList.md](./Phase10_taskList.md) — per-task implementation details
> **Tools**: [docker/telemetry/workload/](../docker/telemetry/workload/) — RPC load generator, transaction submitter, validation suite, benchmarks
Phase 10 builds a 5-node validator docker-compose harness with RPC load generators, transaction submitters, and automated validation scripts that verify all spans, metrics, dashboards, and log-trace correlation work end-to-end. Includes a benchmark suite comparing telemetry-ON vs telemetry-OFF overhead.
### Running the Validation Suite
```bash
# Full end-to-end validation (start cluster, generate load, validate):
docker/telemetry/workload/run-full-validation.sh --xrpld .build/xrpld
# Validation only (assumes stack and cluster are already running):
python3 docker/telemetry/workload/validate_telemetry.py --report /tmp/report.json
# Performance benchmark (baseline vs telemetry):
docker/telemetry/workload/benchmark.sh --xrpld .build/xrpld --duration 300
```
### Validated Telemetry Inventory
> **Counting note — families vs series.** A _metric family_ is one distinct Prometheus `__name__`
> (histogram `_bucket`/`_count`/`_sum` collapsed to one). A _series_ is a family × its label
> combinations. The legacy overlay-traffic block is the bulk of the count: ~56 message categories ×
> 4 (`_Bytes_In/_Out`, `_Messages_In/_Out`) ≈ 224 families on its own. The labeled gauges
> (`xrpld_cache_metrics{metric}`, …) are few families but many series. Validate against the figures
> below as **families currently emitting** (idle nodes under-report — workload-gated metrics such as
> per-RPC/error counters appear only once exercised, which is Phase 10's purpose).
| Category | Expected Count | Validation Method | Config File |
| ------------------------- | ------------------------- | -------------------------------- | ----------------------- |
| Trace spans | ~37 (required + optional) | Tempo API query | `expected_spans.json` |
| Span attributes | per-span assertion | Per-span attribute assertion | `expected_spans.json` |
| Legacy `xrpld_*` families | ~270 (≈224 traffic) | Prometheus `__name__` query | `expected_metrics.json` |
| Native MetricsRegistry | 35 instruments | Prometheus query | `expected_metrics.json` |
| SpanMetrics RED | 4 per span | Prometheus query | `expected_metrics.json` |
| Grafana dashboards | 15 | Dashboard API "no data" check | `expected_metrics.json` |
| Log-trace links | Present | Loki query + Tempo reverse check | — |
### Performance Overhead Targets
| Metric | Target | Measurement Method |
| ----------------- | ------------ | ----------------------------------- |
| CPU overhead | < 3% | ps avg CPU% baseline vs telemetry |
| Memory overhead | < 5MB | ps peak RSS baseline vs telemetry |
| RPC p99 latency | < 2ms impact | server_info round-trip timing |
| Throughput impact | < 5% | Ledger close rate comparison |
| Consensus impact | < 1% | Consensus round time p95 comparison |
---
## 5d. Future: Third-Party Data Collection Pipelines (Phase 11)
> **Status**: Planned, not yet implemented.
> **Plan details**: [06-implementation-phases.md §6.8.4](./06-implementation-phases.md) — motivation, architecture, consumer gap analysis
> **Task breakdown**: [Phase11_taskList.md](./Phase11_taskList.md) — per-task implementation details
Phase 11 builds a custom OTel Collector receiver (Go) that polls xrpld's admin RPCs and exports `xrpl_*` metrics for external consumers. No xrpld code changes.
### Exported Metrics (via Custom OTel Collector Receiver)
#### Node Health (from server_info)
| Prometheus Metric | Type | Description |
| --------------------------------------- | ----- | ----------------------------------------------- |
| `xrpl_server_state` | Gauge | Operating mode (0=disconnected ... 5=proposing) |
| `xrpl_server_state_duration_seconds` | Gauge | Seconds in current state |
| `xrpl_uptime_seconds` | Gauge | Consecutive seconds running |
| `xrpl_io_latency_ms` | Gauge | I/O subsystem latency |
| `xrpl_amendment_blocked` | Gauge | 1 if amendment-blocked, 0 otherwise |
| `xrpl_peers_count` | Gauge | Connected peers |
| `xrpl_validated_ledger_seq` | Gauge | Latest validated ledger sequence |
| `xrpl_validated_ledger_age_seconds` | Gauge | Seconds since last validated close |
| `xrpl_last_close_proposers` | Gauge | Proposers in last consensus round |
| `xrpl_last_close_converge_time_seconds` | Gauge | Last consensus round duration |
| `xrpl_load_factor` | Gauge | Transaction cost multiplier |
| `xrpl_state_duration_seconds` | Gauge | Per-state duration (`state` label) |
| `xrpl_state_transitions_total` | Gauge | Per-state transition count (`state` label) |
#### Peer Topology (from peers)
| Prometheus Metric | Type | Description |
| --------------------------- | ----- | ----------------------------------- |
| `xrpl_peers_inbound_count` | Gauge | Inbound peer connections |
| `xrpl_peers_outbound_count` | Gauge | Outbound peer connections |
| `xrpl_peer_latency_p50_ms` | Gauge | Median peer latency |
| `xrpl_peer_latency_p95_ms` | Gauge | p95 peer latency |
| `xrpl_peer_version_count` | Gauge | Peers per version (`version` label) |
| `xrpl_peer_diverged_count` | Gauge | Peers with diverged tracking status |
#### Validator & Amendment (from validators, feature)
| Prometheus Metric | Type | Description |
| ------------------------------------- | ----- | --------------------------------------- |
| `xrpl_trusted_validators_count` | Gauge | UNL validator count |
| `xrpl_amendment_enabled_count` | Gauge | Enabled amendments |
| `xrpl_amendment_majority_count` | Gauge | Amendments with majority |
| `xrpl_amendment_unsupported_majority` | Gauge | 1 if unsupported amendment has majority |
| `xrpl_validator_list_active` | Gauge | 1 if validator list is active |
#### Fee Market (from fee)
| Prometheus Metric | Type | Description |
| -------------------------------- | ----- | ------------------------------------- |
| `xrpl_fee_open_ledger_fee_drops` | Gauge | Minimum fee for open ledger inclusion |
| `xrpl_fee_median_fee_drops` | Gauge | Median fee level |
| `xrpl_fee_queue_size` | Gauge | Current transaction queue depth |
| `xrpl_fee_current_ledger_size` | Gauge | Transactions in current open ledger |
#### DEX & AMM (optional, from book_offers, amm_info)
| Prometheus Metric | Type | Labels | Description |
| -------------------------- | ----- | --------------------- | ---------------------- |
| `xrpl_amm_tvl_drops` | Gauge | `pool=""` | Total value locked |
| `xrpl_amm_trading_fee` | Gauge | `pool=""` | Pool trading fee (bps) |
| `xrpl_orderbook_bid_depth` | Gauge | `pair=""` | Total bid volume |
| `xrpl_orderbook_ask_depth` | Gauge | `pair=""` | Total ask volume |
| `xrpl_orderbook_spread` | Gauge | `pair=""` | Best bid-ask spread |
### Phase 9: OTel SDK-Exported Metrics (MetricsRegistry)
Phase 9 introduces the `MetricsRegistry` class (`src/xrpld/telemetry/MetricsRegistry.h/.cpp`)
which registers metrics directly with the OpenTelemetry Metrics SDK. These are exported
via OTLP/HTTP to the OTel Collector and scraped by Prometheus.
#### NodeStore I/O (Observable Gauge — `nodestore_state`)
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------- | ----- | -------- | ------------------------------------ |
| `xrpld_nodestore_state{metric="node_reads_total"}` | Gauge | `metric` | Cumulative NodeStore read operations |
| `xrpld_nodestore_state{metric="node_reads_hit"}` | Gauge | `metric` | Reads served from cache |
| `xrpld_nodestore_state{metric="node_writes"}` | Gauge | `metric` | Cumulative write operations |
| `xrpld_nodestore_state{metric="node_written_bytes"}` | Gauge | `metric` | Cumulative bytes written |
| `xrpld_nodestore_state{metric="node_read_bytes"}` | Gauge | `metric` | Cumulative bytes read |
| `xrpld_nodestore_state{metric="write_load"}` | Gauge | `metric` | Current write load score |
| `xrpld_nodestore_state{metric="read_queue"}` | Gauge | `metric` | Items in read prefetch queue |
#### Cache Hit Rates & Sizes (Observable Gauge — `cache_metrics`)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------------- | ----- | -------- | ----------------------------- |
| `xrpld_cache_metrics{metric="SLE_hit_rate"}` | Gauge | `metric` | SLE cache hit rate (0.0-1.0) |
| `xrpld_cache_metrics{metric="ledger_hit_rate"}` | Gauge | `metric` | Ledger cache hit rate |
| `xrpld_cache_metrics{metric="AL_hit_rate"}` | Gauge | `metric` | AcceptedLedger cache hit rate |
| `xrpld_cache_metrics{metric="treenode_cache_size"}` | Gauge | `metric` | SHAMap TreeNode cache entries |
| `xrpld_cache_metrics{metric="treenode_track_size"}` | Gauge | `metric` | Tracked tree nodes |
| `xrpld_cache_metrics{metric="fullbelow_size"}` | Gauge | `metric` | FullBelow cache entries |
#### Transaction Queue (Observable Gauge — `txq_metrics`)
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------------- | ----- | -------- | -------------------------------- |
| `xrpld_txq_metrics{metric="txq_count"}` | Gauge | `metric` | Transactions currently in queue |
| `xrpld_txq_metrics{metric="txq_max_size"}` | Gauge | `metric` | Maximum queue capacity |
| `xrpld_txq_metrics{metric="txq_in_ledger"}` | Gauge | `metric` | Transactions in open ledger |
| `xrpld_txq_metrics{metric="txq_per_ledger"}` | Gauge | `metric` | Expected transactions per ledger |
| `xrpld_txq_metrics{metric="txq_reference_fee_level"}` | Gauge | `metric` | Reference fee level |
| `xrpld_txq_metrics{metric="txq_min_processing_fee_level"}` | Gauge | `metric` | Minimum fee to get processed |
| `xrpld_txq_metrics{metric="txq_med_fee_level"}` | Gauge | `metric` | Median fee level in queue |
| `xrpld_txq_metrics{metric="txq_open_ledger_fee_level"}` | Gauge | `metric` | Open ledger fee escalation level |
#### Per-RPC Method Metrics (Synchronous Counters/Histogram)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------- | --------- | ----------------- | -------------------------------- |
| `xrpld_rpc_method_started_total` | Counter | `method=""` | RPC calls started |
| `xrpld_rpc_method_finished_total` | Counter | `method=""` | RPC calls completed successfully |
| `xrpld_rpc_method_errored_total` | Counter | `method=""` | RPC calls that errored |
| `xrpld_rpc_method_duration_us` | Histogram | `method=""` | Execution time distribution (us) |
#### Per-Job-Type Metrics (Synchronous Counters/Histogram)
| Prometheus Metric | Type | Labels | Description |
| ------------------------------- | --------- | ------------------- | --------------------------------- |
| `xrpld_job_queued_total` | Counter | `job_type=""` | Jobs enqueued |
| `xrpld_job_started_total` | Counter | `job_type=""` | Jobs started |
| `xrpld_job_finished_total` | Counter | `job_type=""` | Jobs completed |
| `xrpld_job_queued_duration_us` | Histogram | `job_type=""` | Queue wait time distribution (us) |
| `xrpld_job_running_duration_us` | Histogram | `job_type=""` | Execution time distribution (us) |
#### Counted Object Instances (Observable Gauge — `object_count`)
| Prometheus Metric | Type | Labels | Description |
| -------------------------------------------- | ----- | --------------- | ------------------------------ |
| `xrpld_object_count{type="Transaction"}` | Gauge | `type=""` | Live Transaction objects |
| `xrpld_object_count{type="Ledger"}` | Gauge | `type=""` | Live Ledger objects |
| `xrpld_object_count{type="NodeObject"}` | Gauge | `type=""` | Live NodeObject instances |
| `xrpld_object_count{type="STTx"}` | Gauge | `type=""` | Serialized transaction objects |
| `xrpld_object_count{type="STLedgerEntry"}` | Gauge | `type=""` | Serialized ledger entries |
| `xrpld_object_count{type="InboundLedger"}` | Gauge | `type=""` | Ledgers being fetched |
| `xrpld_object_count{type="Pathfinder"}` | Gauge | `type=""` | Active pathfinding operations |
| `xrpld_object_count{type="PathRequest"}` | Gauge | `type=""` | Active path requests |
| `xrpld_object_count{type="HashRouterEntry"}` | Gauge | `type=""` | Hash router entries |
#### Load Factor Breakdown (Observable Gauge — `load_factor_metrics`)
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------------------- | ----- | -------- | --------------------------------------- |
| `xrpld_load_factor_metrics{metric="load_factor"}` | Gauge | `metric` | Combined transaction cost multiplier |
| `xrpld_load_factor_metrics{metric="load_factor_server"}` | Gauge | `metric` | Server + cluster + network contribution |
| `xrpld_load_factor_metrics{metric="load_factor_local"}` | Gauge | `metric` | Local server load only |
| `xrpld_load_factor_metrics{metric="load_factor_net"}` | Gauge | `metric` | Network-wide load estimate |
| `xrpld_load_factor_metrics{metric="load_factor_cluster"}` | Gauge | `metric` | Cluster peer load |
| `xrpld_load_factor_metrics{metric="load_factor_fee_escalation"}` | Gauge | `metric` | Open ledger fee escalation |
| `xrpld_load_factor_metrics{metric="load_factor_fee_queue"}` | Gauge | `metric` | Queue entry fee level |
#### Prometheus Query Examples (Phase 9)
```promql
# NodeStore cache hit ratio
xrpld_nodestore_state{metric="node_reads_hit"} / xrpld_nodestore_state{metric="node_reads_total"}
# RPC error rate for server_info
rate(xrpld_rpc_method_errored_total{method="server_info"}[5m])
# Job queue wait time p95
histogram_quantile(0.95, sum by (le) (rate(xrpld_job_queued_duration_us_bucket[5m])))
# TxQ utilization percentage
xrpld_txq_metrics{metric="txq_count"} / xrpld_txq_metrics{metric="txq_max_size"}
# High load factor alert candidate
xrpld_load_factor_metrics{metric="load_factor"} > 5
```
### Phase 7+: External Dashboard Parity Metrics
> **Source**: [External Dashboard Parity Spec](../docs/superpowers/specs/2026-03-30-external-dashboard-parity-design.md) — metrics inspired by the community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard).
>
> **Task breakdown**: Phase 7 Tasks 7.9-7.16 (implementation), Phase 9 Tasks 9.11-9.13 (dashboards)
These metrics fill gaps identified by comparing xrpld's internal observability with the community external dashboard's 86-metric coverage. All are exported via the OTel Metrics SDK (same `PeriodicMetricReader` as Phase 9 metrics).
#### Validation Agreement (Observable Gauge — `validation_agreement`)
| Prometheus Metric | Type | Labels | Description |
| -------------------------------------------------------- | ------ | -------- | --------------------------------------- |
| `xrpld_validation_agreement{metric="agreement_pct_1h"}` | Double | `metric` | Rolling 1h agreement percentage (0-100) |
| `xrpld_validation_agreement{metric="agreement_pct_24h"}` | Double | `metric` | Rolling 24h agreement percentage |
| `xrpld_validation_agreement{metric="agreements_1h"}` | Int64 | `metric` | Agreed validations in 1h window |
| `xrpld_validation_agreement{metric="missed_1h"}` | Int64 | `metric` | Missed validations in 1h window |
| `xrpld_validation_agreement{metric="agreements_24h"}` | Int64 | `metric` | Agreed validations in 24h window |
| `xrpld_validation_agreement{metric="missed_24h"}` | Int64 | `metric` | Missed validations in 24h window |
Data source: `ValidationTracker` class with 8s grace period and 5m late repair window.
#### Validator Health (Observable Gauge — `validator_health`)
| Prometheus Metric | Type | Labels | Description |
| ---------------------------------------------------- | ------ | -------- | ------------------------------ |
| `xrpld_validator_health{metric="amendment_blocked"}` | Int64 | `metric` | 1 if amendment-blocked, else 0 |
| `xrpld_validator_health{metric="unl_blocked"}` | Int64 | `metric` | 1 if UNL-blocked, else 0 |
| `xrpld_validator_health{metric="unl_expiry_days"}` | Double | `metric` | Days until UNL list expires |
| `xrpld_validator_health{metric="validation_quorum"}` | Int64 | `metric` | Validation quorum threshold |
#### Peer Quality (Observable Gauge — `peer_quality`)
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------------------------- | ------ | -------- | ------------------------------------ |
| `xrpld_peer_quality{metric="peer_latency_p90_ms"}` | Double | `metric` | P90 peer latency in milliseconds |
| `xrpld_peer_quality{metric="peers_insane_count"}` | Int64 | `metric` | Peers with diverged tracking status |
| `xrpld_peer_quality{metric="peers_higher_version_pct"}` | Double | `metric` | % of peers on newer xrpld version |
| `xrpld_peer_quality{metric="upgrade_recommended"}` | Int64 | `metric` | 1 if >60% of peers are newer version |
#### Ledger Economy (Observable Gauge — `ledger_economy`)
| Prometheus Metric | Type | Labels | Description |
| --------------------------------------------------- | ------ | -------- | ---------------------------------- |
| `xrpld_ledger_economy{metric="base_fee_xrp"}` | Double | `metric` | Base transaction fee in drops |
| `xrpld_ledger_economy{metric="reserve_base_xrp"}` | Double | `metric` | Account reserve in drops |
| `xrpld_ledger_economy{metric="reserve_inc_xrp"}` | Double | `metric` | Owner reserve increment in drops |
| `xrpld_ledger_economy{metric="ledger_age_seconds"}` | Double | `metric` | Seconds since last validated close |
| `xrpld_ledger_economy{metric="transaction_rate"}` | Double | `metric` | Smoothed transaction rate (tx/s) |
#### State Tracking (Observable Gauge — `state_tracking`)
| Prometheus Metric | Type | Labels | Description |
| -------------------------------------------------------------- | ------ | -------- | -------------------------------------- |
| `xrpld_state_tracking{metric="state_value"}` | Int64 | `metric` | Numeric state 0-6 (see encoding below) |
| `xrpld_state_tracking{metric="time_in_current_state_seconds"}` | Double | `metric` | Duration in current state |
State value encoding: 0=disconnected, 1=connected, 2=syncing, 3=tracking, 4=full, 5=validating (FULL + validating), 6=proposing (FULL + proposing).
#### Storage Detail (Observable Gauge — `storage_detail`)
| Prometheus Metric | Type | Labels | Description |
| ------------------------------------------- | ----- | -------- | ---------------------- |
| `xrpld_storage_detail{metric="nudb_bytes"}` | Int64 | `metric` | NuDB backend file size |
#### Synchronous Counters (Phase 7+)
| Prometheus Metric | Type | Description | Increment Site |
| --------------------------------- | ------- | ------------------------------- | ---------------- |
| `xrpld_ledgers_closed_total` | Counter | Ledgers closed by consensus | RCLConsensus.cpp |
| `xrpld_validations_sent_total` | Counter | Validations sent | RCLConsensus.cpp |
| `xrpld_validations_checked_total` | Counter | Network validations observed | LedgerMaster.cpp |
| `xrpld_state_changes_total` | Counter | Operating mode transitions | NetworkOPs.cpp |
| `xrpld_jq_trans_overflow_total` | Counter | Job queue transaction overflows | JobQueue.cpp |
Lifetime validation agreement/miss tallies are exported as monotonic **ObservableCounters**
(not synchronous counters) observed from `ValidationTracker`'s gross lifetime totals:
| Prometheus Metric | Type | Description | Source |
| ----------------------------------- | ----------------- | ------------------------------------------ | --------------------- |
| `xrpld_validation_agreements_total` | ObservableCounter | Lifetime validations that initially agreed | ValidationTracker.cpp |
| `xrpld_validation_missed_total` | ObservableCounter | Lifetime validations that initially missed | ValidationTracker.cpp |
> **Counting semantics (initial-classification only):** each reconciled ledger increments exactly
> one of these two counters, at first classification. A later late-repair (miss → agreement) does
> **not** move either counter — keeping both strictly monotonic (a Prometheus `_total` must never
> decrease) and additive (`agreements_total + missed_total` = ledgers reconciled). The
> repair-aware, windowed view remains on `xrpld_validation_agreement{metric="…"}`.
#### Span Attribute Enrichments (Phases 2-4)
| Span Name | New Attribute | Type | Source |
| --------------------------- | ------------------------------------ | ------ | ------------------------ |
| `rpc.command.*` | `xrpl.node.amendment_blocked` | bool | Phase 2 — RPCHandler.cpp |
| `rpc.command.*` | `xrpl.node.server_state` | string | Phase 2 — RPCHandler.cpp |
| `tx.receive` | `xrpl.peer.version` | string | Phase 3 — PeerImp.cpp |
| `consensus.validation.send` | `xrpl.validation.ledger_hash` | string | Phase 4 — RCLConsensus |
| `consensus.validation.send` | `xrpl.validation.full` | bool | Phase 4 — RCLConsensus |
| `peer.validation.receive` | `xrpl.peer.validation.ledger_hash` | string | Phase 4 — PeerImp.cpp |
| `peer.validation.receive` | `xrpl.peer.validation.full` | bool | Phase 4 — PeerImp.cpp |
| `consensus.accept` | `xrpl.consensus.validation_quorum` | int64 | Phase 4 — RCLConsensus |
| `consensus.accept` | `xrpl.consensus.proposers_validated` | int64 | Phase 4 — RCLConsensus |
### New Grafana Dashboards (Phase 9)
| Dashboard | UID | Data Source | Key Panels |
| ---------------------- | ------------------------ | ----------- | --------------------------------------------------------- |
| Fee Market & TxQ | `xrpld-fee-market` | Prometheus | TxQ depth/capacity, fee levels, load factor breakdown |
| Job Queue Analysis | `xrpld-job-queue` | Prometheus | Per-job rates, queue wait times, execution times |
| RPC Performance (OTel) | `xrpld-rpc-perf` | Prometheus | Per-method call rates, error rates, latency distributions |
| Validator Health | `xrpld-validator-health` | Prometheus | Agreement %, validation rate, amendment/UNL, state |
| Peer Quality | `xrpld-peer-quality` | Prometheus | P90 latency, insane peers, version awareness, disconnects |
### Updated Grafana Dashboards (Phase 9)
| Dashboard | UID | New Panels Added |
| -------------------- | -------------------------- | -------------------------------------------------------------------- |
| Node Health (StatsD) | `xrpld-statsd-node-health` | NodeStore I/O, cache hit rates, object instance counts |
| System Node Health | `xrpld-system-node-health` | Ledger economy row: base fee, reserves, ledger age, transaction rate |
### New Grafana Dashboards (Phase 11)
| Dashboard | UID | Data Source | Key Panels |
| ------------------ | --------------------------- | ----------- | ---------------------------------------------------------------------- |
| Validator Health | `xrpld-validator-health` | Prometheus | Server state timeline, proposer count, converge time, amendment voting |
| Network Topology | `xrpld-network-topology` | Prometheus | Peer count, version distribution, latency distribution, diverged peers |
| Fee Market (Ext) | `xrpld-fee-market-external` | Prometheus | Fee levels, queue depth, load factor breakdown, escalation timeline |
| DEX & AMM Overview | `xrpld-dex-amm` | Prometheus | AMM TVL, order book depth, spread trends, trading fee revenue |
### Prometheus Alerting Rules (Phase 11)
| Alert Name | Severity | Condition | For |
| ---------------------------------- | -------- | ----------------------------------------------------------- | --- |
| `XRPLServerNotFull` | Critical | `xrpl_server_state < 4` for 15m | 15m |
| `XRPLAmendmentBlocked` | Critical | `xrpl_amendment_blocked == 1` | 1m |
| `XRPLNoPeers` | Critical | `xrpl_peers_count == 0` | 5m |
| `XRPLLedgerStale` | Critical | `xrpl_validated_ledger_age_seconds > 120` | 2m |
| `XRPLHighIOLatency` | Critical | `xrpl_io_latency_ms > 100` | 5m |
| `XRPLUnsupportedAmendmentMajority` | Critical | `xrpl_amendment_unsupported_majority == 1` | 1m |
| `XRPLLowPeerCount` | Warning | `xrpl_peers_count < 10` | 15m |
| `XRPLHighLoadFactor` | Warning | `xrpl_load_factor > 10` | 10m |
| `XRPLSlowConsensus` | Warning | `xrpl_last_close_converge_time_seconds > 6` | 5m |
| `XRPLValidatorListExpiring` | Warning | `(xrpl_validator_list_expiration_seconds - time()) < 86400` | 1h |
| `XRPLStateFlapping` | Warning | `rate(xrpl_state_transitions_total{state="full"}[1h]) > 2` | 30m |
---
## 6. Known Issues
| Issue | Impact | Status |
| ------------------------------------------------------------------ | ------------------------------------------------ | -------------------------------------------------------------------- |
| `warn` and `drop` metrics use non-standard StatsD `\|m` meter type | Metrics silently dropped by OTel StatsD receiver | Phase 6 Task 6.1 — needs `\|m` → `\|c` change in StatsDCollector.cpp |
| `xrpld_jobq_job_count` may not emit in standalone mode | Missing from Prometheus in some test configs | Requires active job queue activity |
| `xrpld_rpc_requests` depends on `[insight]` config | Zero series if StatsD not configured | Requires `[insight] server=statsd` in xrpld.cfg |
| Peer tracing disabled by default | No `peer.*` spans unless `trace_peer=1` | Intentional — high volume on mainnet |
---
## 7. Privacy and Data Collection
The telemetry system is designed with privacy in mind:
- **No private keys** are ever included in spans or metrics
- **No account balances** or financial data is traced
- **Transaction hashes** are included (public on-ledger data) but not transaction contents
- **Peer IDs** are internal identifiers, not IP addresses
- **All telemetry is opt-in** — disabled by default at build time (`-Dtelemetry=OFF`)
- **Sampling** reduces data volume — `sampling_ratio=0.01` recommended for production
- **Data stays local** — the default stack sends data to `localhost` only
---
## 8. Configuration Quick Reference
> **Full reference**: [05-configuration-reference.md](./05-configuration-reference.md) §5.1 for all `[telemetry]` options with defaults, the config parser implementation, and collector YAML configurations (dev and production).
### Minimal Setup (development)
```ini
[telemetry]
enabled=1
[insight]
server=statsd
address=127.0.0.1:8125
prefix=xrpld
```
### Production Setup
```ini
[telemetry]
enabled=1
endpoint=http://otel-collector:4318/v1/traces
sampling_ratio=0.01
trace_peer=0
batch_size=1024
max_queue_size=4096
[insight]
server=statsd
address=otel-collector:8125
prefix=xrpld
```
### Trace Category Toggle
| Config Key | Default | Controls |
| -------------------- | ------- | ---------------------------- |
| `trace_rpc` | `1` | `rpc.*` spans |
| `trace_transactions` | `1` | `tx.*` spans |
| `trace_consensus` | `1` | `consensus.*` spans |
| `trace_ledger` | `1` | `ledger.*` spans |
| `trace_peer` | `0` | `peer.*` spans (high volume) |