Files
rippled/docs/telemetry-runbook.md
Pratik Mankawde 92d109ce16 docs: add external dashboard parity tasks and metric reference for Phase 9
Add Tasks 9.11-9.13 (Validator Health, Peer Quality, Ledger Economy dashboards),
new metric tables in data-collection-reference, and monitoring sections in runbook
covering validation agreement, validator health, peer quality, and state tracking.

Source: external dashboard parity design spec (2026-03-30).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:40 +01:00

719 lines
50 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# rippled Telemetry Operator Runbook
## Overview
rippled supports OpenTelemetry distributed tracing to provide visibility into RPC requests, transaction processing, and consensus rounds.
## Quick Start
### 1. Start the observability stack
```bash
docker compose -f docker/telemetry/docker-compose.yml up -d
```
This starts:
- **OTel Collector** on ports 4317 (gRPC) and 4318 (HTTP)
- **Jaeger** UI on http://localhost:16686
- **Prometheus** on http://localhost:9090
- **Loki** on http://localhost:3100 (log aggregation)
- **Grafana** on http://localhost:3000
### 2. Enable telemetry in rippled
Add to your `xrpld.cfg`:
```ini
[telemetry]
enabled=1
endpoint=http://localhost:4318/v1/traces
```
### 3. Build with telemetry support
```bash
conan install . --build=missing -o telemetry=True
cmake --preset default -Dtelemetry=ON
cmake --build --preset default
```
## Configuration Reference
| Option | Default | Description |
| -------------------- | --------------------------------- | ----------------------------------------- |
| `enabled` | `0` | Master switch for telemetry |
| `endpoint` | `http://localhost:4318/v1/traces` | OTLP/HTTP endpoint |
| `exporter` | `otlp_http` | Exporter type |
| `sampling_ratio` | `1.0` | Head-based sampling ratio (0.01.0) |
| `trace_rpc` | `1` | Enable RPC request tracing |
| `trace_transactions` | `1` | Enable transaction tracing |
| `trace_consensus` | `1` | Enable consensus tracing |
| `trace_peer` | `0` | Enable peer message tracing (high volume) |
| `trace_ledger` | `1` | Enable ledger tracing |
| `batch_size` | `512` | Max spans per batch export |
| `batch_delay_ms` | `5000` | Delay between batch exports |
| `max_queue_size` | `2048` | Max spans queued before dropping |
| `use_tls` | `0` | Use TLS for exporter connection |
| `tls_ca_cert` | (empty) | Path to CA certificate bundle |
## Span Reference
All spans instrumented in rippled, grouped by subsystem:
### RPC Spans (Phase 2)
| Span Name | Source File | Attributes | Description |
| -------------------- | --------------------- | ---------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
| `rpc.request` | ServerHandler.cpp:271 | — | Top-level HTTP RPC request |
| `rpc.process` | ServerHandler.cpp:573 | — | RPC processing (child of rpc.request) |
| `rpc.ws_message` | ServerHandler.cpp:384 | — | WebSocket RPC message |
| `rpc.command.<name>` | RPCHandler.cpp:161 | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.status`, `xrpl.rpc.duration_ms`, `xrpl.rpc.error_message` | Per-command span (e.g., `rpc.command.server_info`) |
### Transaction Spans (Phase 3)
| Span Name | Source File | Attributes | Description |
| ------------ | ------------------- | ---------------------------------------------------------------------- | ------------------------------------- |
| `tx.process` | NetworkOPs.cpp:1227 | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | Transaction submission and processing |
| `tx.receive` | PeerImp.cpp:1273 | `xrpl.peer.id`, `xrpl.tx.hash`, `xrpl.tx.suppressed`, `xrpl.tx.status` | Transaction received from peer relay |
| `tx.apply` | BuildLedger.cpp:88 | `xrpl.ledger.seq`, `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Transaction set applied per ledger |
### Consensus Spans (Phase 4)
| Span Name | Source File | Attributes | Description |
| --------------------------- | -------------------- | ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------ |
| `consensus.proposal.send` | RCLConsensus.cpp:177 | `xrpl.consensus.round` | Consensus proposal broadcast |
| `consensus.ledger_close` | RCLConsensus.cpp:282 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` | Ledger close event |
| `consensus.accept` | RCLConsensus.cpp:395 | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` | Ledger accepted by consensus |
| `consensus.validation.send` | RCLConsensus.cpp:753 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` | Validation sent after accept |
| `consensus.accept.apply` | RCLConsensus.cpp:453 | `xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state`, `proposing`, `round_time_ms`, `ledger.seq` | Ledger application with close time details |
#### Close Time Queries (Tempo TraceQL)
```
# Find rounds where validators disagreed on close time
{name="consensus.accept.apply"} | xrpl.consensus.close_time_correct = false
# Find consensus failures (moved_on)
{name="consensus.accept.apply"} | xrpl.consensus.state = "moved_on"
# Find slow ledger applications (>5s)
{name="consensus.accept.apply"} | duration > 5s
# Find specific ledger's consensus details
{name="consensus.accept.apply"} | xrpl.consensus.ledger.seq = 92345678
```
### Ledger Spans (Phase 5)
| Span Name | Source File | Attributes | Description |
| ----------------- | -------------------- | ------------------------------------------------------------------ | ----------------------------- |
| `ledger.build` | BuildLedger.cpp:31 | `xrpl.ledger.seq`, `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Ledger build during consensus |
| `ledger.validate` | LedgerMaster.cpp:915 | `xrpl.ledger.seq`, `xrpl.ledger.validations` | Ledger promoted to validated |
| `ledger.store` | LedgerMaster.cpp:409 | `xrpl.ledger.seq` | Ledger stored in history |
### Peer Spans (Phase 5)
| Span Name | Source File | Attributes | Description |
| ------------------------- | ---------------- | ---------------------------------------------- | ----------------------------- |
| `peer.proposal.receive` | PeerImp.cpp:1667 | `xrpl.peer.id`, `xrpl.peer.proposal.trusted` | Proposal received from peer |
| `peer.validation.receive` | PeerImp.cpp:2264 | `xrpl.peer.id`, `xrpl.peer.validation.trusted` | Validation received from peer |
## Prometheus Metrics (Spanmetrics)
The OTel Collector's spanmetrics connector automatically derives RED (Rate, Errors, Duration) metrics from every span. No custom metrics code is needed in rippled.
### Generated Metric Names
| Prometheus Metric | Type | Description |
| -------------------------------------------------- | --------- | ---------------------------- |
| `traces_span_metrics_calls_total` | Counter | Total span invocations |
| `traces_span_metrics_duration_milliseconds_bucket` | Histogram | Latency distribution buckets |
| `traces_span_metrics_duration_milliseconds_count` | Histogram | Latency observation count |
| `traces_span_metrics_duration_milliseconds_sum` | Histogram | Cumulative latency |
### Metric Labels
Every metric carries these standard labels:
| Label | Source | Example |
| -------------- | ------------------ | ---------------------------------------- |
| `span_name` | Span name | `rpc.command.server_info` |
| `status_code` | Span status | `STATUS_CODE_UNSET`, `STATUS_CODE_ERROR` |
| `service_name` | Resource attribute | `rippled` |
| `span_kind` | Span kind | `SPAN_KIND_INTERNAL` |
Additionally, span attributes configured as dimensions in the collector become metric labels (dots → underscores):
| Span Attribute | Metric Label | Applies To |
| ------------------------------ | ------------------------------ | ------------------------------- |
| `xrpl.rpc.command` | `xrpl_rpc_command` | `rpc.command.*` spans |
| `xrpl.rpc.status` | `xrpl_rpc_status` | `rpc.command.*` spans |
| `xrpl.consensus.mode` | `xrpl_consensus_mode` | `consensus.ledger_close` spans |
| `xrpl.tx.local` | `xrpl_tx_local` | `tx.process` spans |
| `xrpl.peer.proposal.trusted` | `xrpl_peer_proposal_trusted` | `peer.proposal.receive` spans |
| `xrpl.peer.validation.trusted` | `xrpl_peer_validation_trusted` | `peer.validation.receive` spans |
### Histogram Buckets
Configured in `otel-collector-config.yaml`:
```
1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s
```
## System Metrics (beast::insight via OTel native)
rippled has a built-in metrics framework (`beast::insight`) that exports metrics natively via OTLP/HTTP. These complement the span-derived RED metrics by providing system-level gauges, counters, and timers that don't map to individual trace spans.
### Configuration
Add to `xrpld.cfg`:
```ini
[insight]
server=otel
endpoint=http://localhost:4318/v1/metrics
prefix=rippled
```
The OTel Collector receives these via the OTLP receiver (same endpoint as traces, port 4318) and exports them to Prometheus alongside spanmetrics.
#### StatsD fallback (backward compatibility)
The legacy StatsD backend is still available:
```ini
[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled
```
When using StatsD, uncomment the `statsd` receiver in `otel-collector-config.yaml` and add port `8125:8125/udp` to the docker-compose otel-collector service.
### Metric Reference
#### Gauges
| Prometheus Metric | Source | Description |
| --------------------------------------------- | ------------------------- | -------------------------------------------------------------------------- |
| `rippled_LedgerMaster_Validated_Ledger_Age` | LedgerMaster.h:373 | Age of validated ledger (seconds) |
| `rippled_LedgerMaster_Published_Ledger_Age` | LedgerMaster.h:374 | Age of published ledger (seconds) |
| `rippled_State_Accounting_{Mode}_duration` | NetworkOPs.cpp:774 | Time in each operating mode (Disconnected/Connected/Syncing/Tracking/Full) |
| `rippled_State_Accounting_{Mode}_transitions` | NetworkOPs.cpp:780 | Transition count per mode |
| `rippled_Peer_Finder_Active_Inbound_Peers` | PeerfinderManager.cpp:214 | Active inbound peer connections |
| `rippled_Peer_Finder_Active_Outbound_Peers` | PeerfinderManager.cpp:215 | Active outbound peer connections |
| `rippled_Overlay_Peer_Disconnects` | OverlayImpl.h:557 | Peer disconnect count |
| `rippled_job_count` | JobQueue.cpp:26 | Current job queue depth |
| `rippled_{category}_Bytes_In/Out` | OverlayImpl.h:535 | Overlay traffic bytes per category (57 categories) |
| `rippled_{category}_Messages_In/Out` | OverlayImpl.h:535 | Overlay traffic messages per category |
#### OTel MetricsRegistry Gauges (Phase 9)
These gauges are exported via the OTel Metrics SDK `PeriodicMetricReader` (10s interval), NOT through beast::insight.
| Prometheus Metric | Source | Description |
| ----------------------------------------------------------- | ------------------- | -------------------------------------------- |
| `rippled_server_info{metric="server_state"}` | MetricsRegistry.cpp | Operating mode (0=DISCONNECTED .. 4=FULL) |
| `rippled_server_info{metric="uptime"}` | MetricsRegistry.cpp | Seconds since server start |
| `rippled_server_info{metric="peers"}` | MetricsRegistry.cpp | Total connected peers |
| `rippled_server_info{metric="validated_ledger_seq"}` | MetricsRegistry.cpp | Validated ledger sequence number |
| `rippled_server_info{metric="ledger_current_index"}` | MetricsRegistry.cpp | Current open ledger sequence |
| `rippled_server_info{metric="peer_disconnects_resources"}` | MetricsRegistry.cpp | Cumulative resource-related peer disconnects |
| `rippled_server_info{metric="last_close_proposers"}` | MetricsRegistry.cpp | Proposers in last closed round |
| `rippled_server_info{metric="last_close_converge_time_ms"}` | MetricsRegistry.cpp | Last close convergence time (ms) |
| `rippled_build_info{version="<ver>"}` | MetricsRegistry.cpp | Info-style metric (always 1) |
| `rippled_complete_ledgers{bound="start\|end",index="<N>"}` | MetricsRegistry.cpp | Complete ledger range start/end pairs |
| `rippled_db_metrics{metric="db_kb_total"}` | MetricsRegistry.cpp | Total database size (KB) |
| `rippled_db_metrics{metric="db_kb_ledger"}` | MetricsRegistry.cpp | Ledger database size (KB) |
| `rippled_db_metrics{metric="db_kb_transaction"}` | MetricsRegistry.cpp | Transaction database size (KB) |
| `rippled_db_metrics{metric="historical_perminute"}` | MetricsRegistry.cpp | Historical ledger fetches per minute |
| `rippled_cache_metrics{metric="AL_size"}` | MetricsRegistry.cpp | AcceptedLedger cache size |
| `rippled_nodestore_state{metric="node_reads_duration_us"}` | MetricsRegistry.cpp | Cumulative read time (microseconds) |
| `rippled_nodestore_state{metric="read_request_bundle"}` | MetricsRegistry.cpp | Read request bundle count |
| `rippled_nodestore_state{metric="read_threads_running"}` | MetricsRegistry.cpp | Active read threads |
| `rippled_nodestore_state{metric="read_threads_total"}` | MetricsRegistry.cpp | Total read threads configured |
#### Counters
| Prometheus Metric | Source | Description |
| --------------------------------- | --------------------- | ------------------------------ |
| `rippled_rpc_requests` | ServerHandler.cpp:108 | Total RPC request count |
| `rippled_ledger_fetches` | InboundLedgers.cpp:44 | Ledger fetch request count |
| `rippled_ledger_history_mismatch` | LedgerHistory.cpp:16 | Ledger hash mismatch count |
| `rippled_warn` | Logic.h:33 | Resource manager warning count |
| `rippled_drop` | Logic.h:34 | Resource manager drop count |
#### Histograms (from StatsD timers)
| Prometheus Metric | Source | Description |
| ----------------------- | --------------------- | ------------------------------ |
| `rippled_rpc_time` | ServerHandler.cpp:110 | RPC response time (ms) |
| `rippled_rpc_size` | ServerHandler.cpp:109 | RPC response size (bytes) |
| `rippled_ios_latency` | Application.cpp:438 | I/O service loop latency (ms) |
| `rippled_pathfind_fast` | PathRequests.h:23 | Fast pathfinding duration (ms) |
| `rippled_pathfind_full` | PathRequests.h:24 | Full pathfinding duration (ms) |
## Grafana Dashboards
Fifteen dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
### RPC Performance (`rippled-rpc-perf`)
| Panel | Type | PromQL | Labels Used |
| --------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
| RPC Request Rate by Command | timeseries | `sum by (xrpl_rpc_command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))` | `xrpl_rpc_command` |
| RPC Latency p95 by Command | timeseries | `histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))` | `xrpl_rpc_command` |
| RPC Error Rate | bargauge | Error spans / total spans × 100, grouped by `xrpl_rpc_command` | `xrpl_rpc_command`, `status_code` |
| RPC Latency Heatmap | heatmap | `sum(increase(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])) by (le)` | `le` (bucket boundaries) |
| Overall RPC Throughput | timeseries | `rpc.request` + `rpc.process` rate | — |
| RPC Success vs Error | timeseries | by `status_code` (UNSET vs ERROR) | `status_code` |
| Top Commands by Volume | bargauge | `topk(10, ...)` by `xrpl_rpc_command` | `xrpl_rpc_command` |
| WebSocket Message Rate | stat | `rpc.ws_message` rate | — |
### Transaction Overview (`rippled-transactions`)
| Panel | Type | PromQL | Labels Used |
| --------------------------------- | ---------- | -------------------------------------------------------------------------------------------- | --------------- |
| Transaction Processing Rate | timeseries | `rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m])` and `tx.receive` | `span_name` |
| Transaction Processing Latency | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="tx.process"})` | — |
| Transaction Path Distribution | piechart | `sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))` | `xrpl_tx_local` |
| Transaction Receive vs Suppressed | timeseries | `rate(traces_span_metrics_calls_total{span_name="tx.receive"}[5m])` | — |
| TX Processing Duration Heatmap | heatmap | `tx.process` histogram buckets | `le` |
| TX Apply Duration per Ledger | timeseries | p95/p50 of `tx.apply` | — |
| Peer TX Receive Rate | timeseries | `tx.receive` rate | — |
| TX Apply Failed Rate | stat | `tx.apply` with `STATUS_CODE_ERROR` | `status_code` |
### Consensus Health (`rippled-consensus`)
| Panel | Type | PromQL | Labels Used |
| ----------------------------- | ---------- | ---------------------------------------------------------------------------------- | --------------------- |
| Consensus Round Duration | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept"})` | — |
| Consensus Proposals Sent Rate | timeseries | `rate(traces_span_metrics_calls_total{span_name="consensus.proposal.send"}[5m])` | — |
| Ledger Close Duration | timeseries | `histogram_quantile(0.95, ... {span_name="consensus.ledger_close"})` | — |
| Validation Send Rate | stat | `rate(traces_span_metrics_calls_total{span_name="consensus.validation.send"}[5m])` | — |
| Ledger Apply Duration | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept.apply"})` | — |
| Close Time Agreement | timeseries | `rate(traces_span_metrics_calls_total{span_name="consensus.accept.apply"}[5m])` | — |
| Consensus Mode Over Time | timeseries | `consensus.ledger_close` by `xrpl_consensus_mode` | `xrpl_consensus_mode` |
| Accept vs Close Rate | timeseries | `consensus.accept` vs `consensus.ledger_close` rate | — |
| Validation vs Close Rate | timeseries | `consensus.validation.send` vs `consensus.ledger_close` | — |
| Accept Duration Heatmap | heatmap | `consensus.accept` histogram buckets | `le` |
### Ledger Operations (`rippled-ledger-ops`)
| Panel | Type | PromQL | Labels Used |
| ----------------------- | ---------- | ---------------------------------------------- | ----------- |
| Ledger Build Rate | stat | `ledger.build` call rate | — |
| Ledger Build Duration | timeseries | p95/p50 of `ledger.build` | — |
| Ledger Validation Rate | stat | `ledger.validate` call rate | — |
| Build Duration Heatmap | heatmap | `ledger.build` histogram buckets | `le` |
| TX Apply Duration | timeseries | p95/p50 of `tx.apply` | — |
| TX Apply Rate | timeseries | `tx.apply` call rate | — |
| Ledger Store Rate | stat | `ledger.store` call rate | — |
| Build vs Close Duration | timeseries | p95 `ledger.build` vs `consensus.ledger_close` | — |
### Peer Network (`rippled-peer-net`)
Requires `trace_peer=1` in the `[telemetry]` config section.
| Panel | Type | PromQL | Labels Used |
| -------------------------------- | ---------- | --------------------------------- | ------------------------------ |
| Proposal Receive Rate | timeseries | `peer.proposal.receive` rate | — |
| Validation Receive Rate | timeseries | `peer.validation.receive` rate | — |
| Proposals Trusted vs Untrusted | piechart | by `xrpl_peer_proposal_trusted` | `xrpl_peer_proposal_trusted` |
| Validations Trusted vs Untrusted | piechart | by `xrpl_peer_validation_trusted` | `xrpl_peer_validation_trusted` |
### Node Health — System Metrics (`rippled-system-node-health`)
| Panel | Type | PromQL | Labels Used |
| -------------------------- | ---------- | ------------------------------------------------------ | ---------------- |
| Validated Ledger Age | stat | `rippled_LedgerMaster_Validated_Ledger_Age` | — |
| Published Ledger Age | stat | `rippled_LedgerMaster_Published_Ledger_Age` | — |
| Operating Mode Duration | timeseries | `rippled_State_Accounting_*_duration` | — |
| Operating Mode Transitions | timeseries | `rippled_State_Accounting_*_transitions` | — |
| I/O Latency | timeseries | `histogram_quantile(0.95, rippled_ios_latency_bucket)` | — |
| Job Queue Depth | timeseries | `rippled_job_count` | — |
| Ledger Fetch Rate | stat | `rate(rippled_ledger_fetches[5m])` | — |
| Ledger History Mismatches | stat | `rate(rippled_ledger_history_mismatch[5m])` | — |
| Server State | stat | `rippled_server_info{metric="server_state"}` | `metric` |
| Uptime | stat | `rippled_server_info{metric="uptime"}` | `metric` |
| Peer Count | stat | `rippled_server_info{metric="peers"}` | `metric` |
| Validated Ledger Seq | stat | `rippled_server_info{metric="validated_ledger_seq"}` | `metric` |
| Build Version | stat | `rippled_build_info` | `version` |
| Complete Ledger Ranges | table | `rippled_complete_ledgers` | `bound`, `index` |
| Database Sizes | timeseries | `rippled_db_metrics{metric=~"db_kb_.*"}` | `metric` |
| Historical Fetch Rate | stat | `rippled_db_metrics{metric="historical_perminute"}` | `metric` |
### Network Traffic — System Metrics (`rippled-system-network`)
| Panel | Type | PromQL | Labels Used |
| ---------------------- | ---------- | -------------------------------------- | ----------- |
| Active Peers | timeseries | `rippled_Peer_Finder_Active_*_Peers` | — |
| Peer Disconnects | timeseries | `rippled_Overlay_Peer_Disconnects` | — |
| Total Network Bytes | timeseries | `rippled_total_Bytes_In/Out` | — |
| Total Network Messages | timeseries | `rippled_total_Messages_In/Out` | — |
| Transaction Traffic | timeseries | `rippled_transactions_Messages_In/Out` | — |
| Proposal Traffic | timeseries | `rippled_proposals_Messages_In/Out` | — |
| Validation Traffic | timeseries | `rippled_validations_Messages_In/Out` | — |
| Traffic by Category | bargauge | `topk(10, rippled_*_Bytes_In)` | — |
### RPC & Pathfinding — System Metrics (`rippled-system-rpc`)
| Panel | Type | PromQL | Labels Used |
| ------------------------- | ---------- | -------------------------------------------------------- | ----------- |
| RPC Request Rate | stat | `rate(rippled_rpc_requests[5m])` | — |
| RPC Response Time | timeseries | `histogram_quantile(0.95, rippled_rpc_time_bucket)` | — |
| RPC Response Size | timeseries | `histogram_quantile(0.95, rippled_rpc_size_bucket)` | — |
| RPC Response Time Heatmap | heatmap | `rippled_rpc_time_bucket` | — |
| Pathfinding Fast Duration | timeseries | `histogram_quantile(0.95, rippled_pathfind_fast_bucket)` | — |
| Pathfinding Full Duration | timeseries | `histogram_quantile(0.95, rippled_pathfind_full_bucket)` | — |
| Resource Warnings Rate | stat | `rate(rippled_warn[5m])` | — |
| Resource Drops Rate | stat | `rate(rippled_drop[5m])` | — |
### Span → Metric → Dashboard Summary
| Span Name | Prometheus Metric Filter | Grafana Dashboard |
| --------------------------- | ----------------------------------------- | --------------------------------------------- |
| `rpc.request` | `{span_name="rpc.request"}` | RPC Performance (Overall Throughput) |
| `rpc.process` | `{span_name="rpc.process"}` | RPC Performance (Overall Throughput) |
| `rpc.ws_message` | `{span_name="rpc.ws_message"}` | RPC Performance (WebSocket Rate) |
| `rpc.command.*` | `{span_name=~"rpc.command.*"}` | RPC Performance (Rate, Latency, Error, Top) |
| `tx.process` | `{span_name="tx.process"}` | Transaction Overview (Rate, Latency, Heatmap) |
| `tx.receive` | `{span_name="tx.receive"}` | Transaction Overview (Rate, Receive) |
| `tx.apply` | `{span_name="tx.apply"}` | Transaction Overview + Ledger Ops (Apply) |
| `consensus.accept` | `{span_name="consensus.accept"}` | Consensus Health (Duration, Rate, Heatmap) |
| `consensus.proposal.send` | `{span_name="consensus.proposal.send"}` | Consensus Health (Proposals Rate) |
| `consensus.ledger_close` | `{span_name="consensus.ledger_close"}` | Consensus Health (Close, Mode) |
| `consensus.validation.send` | `{span_name="consensus.validation.send"}` | Consensus Health (Validation Rate) |
| `consensus.accept.apply` | `{span_name="consensus.accept.apply"}` | Consensus Health (Apply Duration, Close Time) |
| `ledger.build` | `{span_name="ledger.build"}` | Ledger Ops (Build Rate, Duration, Heatmap) |
| `ledger.validate` | `{span_name="ledger.validate"}` | Ledger Ops (Validation Rate) |
| `ledger.store` | `{span_name="ledger.store"}` | Ledger Ops (Store Rate) |
| `peer.proposal.receive` | `{span_name="peer.proposal.receive"}` | Peer Network (Rate, Trusted/Untrusted) |
| `peer.validation.receive` | `{span_name="peer.validation.receive"}` | Peer Network (Rate, Trusted/Untrusted) |
## Log-Trace Correlation (Phase 8)
When rippled is built with `telemetry=ON`, log lines emitted within an active OpenTelemetry span automatically include `trace_id` and `span_id` fields:
```
2024-01-15T10:30:45.123Z LedgerMaster:NFO trace_id=abc123def456789012345678abcdef01 span_id=0123456789abcdef Validated ledger 42
```
This enables bidirectional navigation between logs and traces in Grafana:
- **Tempo -> Loki**: Click "Logs for this trace" on any trace in Grafana Tempo to see all log lines from that trace.
- **Loki -> Tempo**: Click the `TraceID` derived field link on any log line containing `trace_id=` to jump to the full trace in Tempo.
### Log Ingestion Pipeline
Log files are ingested by the OTel Collector's `filelog` receiver, which tails `debug.log` files and parses them with a regex that extracts `timestamp`, `partition`, `severity`, `trace_id`, `span_id`, and `message` fields. Parsed entries are exported to Grafana Loki.
### LogQL Query Examples
```logql
# Find all logs for a specific trace
{job="rippled"} |= "trace_id=abc123def456789012345678abcdef01"
# Error logs with trace context (log lines with ERR severity that have a trace_id)
{job="rippled"} |= "ERR" |= "trace_id="
# All logs from a specific partition that were emitted during a span
{job="rippled"} |= "LedgerMaster" | regexp `trace_id=(?P<trace_id>[a-f0-9]+)` | trace_id != ""
# Logs from the last hour containing trace context
{job="rippled"} |= "trace_id=" | regexp `(?P<partition>\S+):(?P<sev>\S+)\s+trace_id=(?P<tid>[a-f0-9]+)`
# Count of traced vs untraced log lines
count_over_time({job="rippled"} |= "trace_id=" [5m])
```
### Verifying Log Correlation
1. Start the observability stack and rippled with telemetry enabled.
2. Send an RPC request: `curl http://localhost:5005 -d '{"method":"server_info"}'`
3. Check the debug.log for `trace_id=` entries: `grep trace_id= /path/to/debug.log`
4. Open Grafana at http://localhost:3000 -> Explore -> Loki and search for `{job="rippled"} |= "trace_id="`.
5. Click the TraceID link to navigate to the corresponding trace in Tempo.
## Phase 9: OTel Metrics Alerting Rules
The following alerting rules are recommended for the Phase 9 OTel SDK metrics.
Add to your Prometheus alerting rules configuration.
### NodeStore
| Alert Name | Severity | Condition | For | Description |
| --------------------------- | -------- | ---------------------------------------------------- | --- | ------------------------------------------------------- |
| `NodeStoreHighWriteLoad` | Warning | `rippled_nodestore_state{metric="write_load"} > 100` | 5m | NodeStore backend is under sustained write pressure |
| `NodeStoreReadQueueBacklog` | Warning | `rippled_nodestore_state{metric="read_queue"} > 500` | 5m | Prefetch thread pool is saturated; reads are backing up |
### Cache
| Alert Name | Severity | Condition | For | Description |
| ----------------------- | -------- | ------------------------------------------------------- | --- | ------------------------------------------------------ |
| `SLECacheHitRateLow` | Warning | `rippled_cache_metrics{metric="SLE_hit_rate"} < 0.5` | 10m | SLE cache is thrashing; consider increasing cache size |
| `LedgerCacheHitRateLow` | Warning | `rippled_cache_metrics{metric="ledger_hit_rate"} < 0.5` | 10m | Ledger cache hit rate is degraded |
### Transaction Queue
| Alert Name | Severity | Condition | For | Description |
| ---------------------- | -------- | ---------------------------------------------------------------------------------------------------------------------- | --- | -------------------------------------------------- |
| `TxQNearCapacity` | Warning | `rippled_txq_metrics{metric="txq_count"} / rippled_txq_metrics{metric="txq_max_size"} > 0.8` | 5m | TxQ is >80% full; transactions may be rejected |
| `TxQHighFeeEscalation` | Warning | `rippled_txq_metrics{metric="txq_open_ledger_fee_level"} / rippled_txq_metrics{metric="txq_reference_fee_level"} > 10` | 5m | Fee escalation is 10x above reference; high demand |
### Load Factor
| Alert Name | Severity | Condition | For | Description |
| --------------------- | -------- | -------------------------------------------------------------- | --- | -------------------------------------------------------------- |
| `HighLoadFactor` | Warning | `rippled_load_factor_metrics{metric="load_factor"} > 5` | 10m | Combined load factor is elevated; transactions cost 5x+ normal |
| `HighLocalLoadFactor` | Critical | `rippled_load_factor_metrics{metric="load_factor_local"} > 10` | 5m | Local server load is critically elevated |
### RPC Performance
| Alert Name | Severity | Condition | For | Description |
| ------------------ | -------- | ---------------------------------------------------------------------------------------------------------- | --- | --------------------------------- |
| `HighRPCErrorRate` | Warning | `sum(rate(rippled_rpc_method_errored_total[5m])) / sum(rate(rippled_rpc_method_started_total[5m])) > 0.05` | 5m | >5% of RPC calls are erroring |
| `SlowRPCLatency` | Warning | `histogram_quantile(0.95, sum by (le) (rate(rippled_rpc_method_duration_us_bucket[5m]))) > 5000000` | 5m | RPC p95 latency exceeds 5 seconds |
### Job Queue
| Alert Name | Severity | Condition | For | Description |
| ------------------ | -------- | ----------------------------------------------------------------------------------------------------- | --- | ---------------------------------------------------- |
| `JobQueueBacklog` | Warning | `sum(rate(rippled_job_queued_total[5m])) - sum(rate(rippled_job_finished_total[5m])) > 100` | 5m | Jobs are being queued faster than they're completing |
| `SlowJobExecution` | Warning | `histogram_quantile(0.95, sum by (le) (rate(rippled_job_running_duration_us_bucket[5m]))) > 10000000` | 5m | Job execution p95 exceeds 10 seconds |
## Validator Health Monitoring (Phase 7+)
Phase 7 introduces native metrics for validator health, validation agreement, peer quality, ledger economy, and state tracking — inspired by the community [xrpl-validator-dashboard](https://github.com/realgrapedrop/xrpl-validator-dashboard). These metrics are exported via the OTel Metrics SDK `PeriodicMetricReader` (10s interval).
### Validation Agreement
The `ValidationTracker` class computes rolling validation agreement between this node and network consensus. It maintains 1h and 24h sliding windows with an 8-second grace period and 5-minute late repair window.
| Prometheus Metric | Description |
| ---------------------------------------------------------- | ------------------------------ |
| `rippled_validation_agreement{metric="agreement_pct_1h"}` | Agreement % over last 1 hour |
| `rippled_validation_agreement{metric="agreement_pct_24h"}` | Agreement % over last 24 hours |
| `rippled_validation_agreement{metric="agreements_1h"}` | Agreed validations in 1h |
| `rippled_validation_agreement{metric="missed_1h"}` | Missed validations in 1h |
| `rippled_validation_agreement{metric="agreements_24h"}` | Agreed validations in 24h |
| `rippled_validation_agreement{metric="missed_24h"}` | Missed validations in 24h |
| `rippled_validations_sent_total` | Total validations sent |
| `rippled_validations_checked_total` | Total network validations seen |
| `rippled_validation_agreements_total` | Cumulative agreements |
| `rippled_validation_missed_total` | Cumulative misses |
**How reconciliation works**:
1. When the node sends a validation for ledger X, the tracker records `weValidated=true`
2. When the network validates a ledger, the tracker records `networkValidated=true`
3. After an 8-second grace period, the tracker reconciles: if both are true for the same ledger hash, it's an agreement; otherwise, a miss
4. If a late validation arrives within 5 minutes, a previous miss can be corrected (late repair)
**When to worry**: Agreement below 90% over 24h indicates the node is missing network consensus — check connectivity, clock sync, and whether the node is in `Full` mode.
```promql
# Agreement percentage over 24 hours
rippled_validation_agreement{metric="agreement_pct_24h"}
# Validation send rate (should be ~1 per 3-5s during normal operation)
rate(rippled_validations_sent_total[5m]) * 60
# Ratio of agreements to total reconciled
rippled_validation_agreements_total / (rippled_validation_agreements_total + rippled_validation_missed_total)
```
### Validator Health Gauges
| Prometheus Metric | Description | Healthy Value |
| ------------------------------------------------------ | ----------------------------------- | ----------------------- |
| `rippled_validator_health{metric="amendment_blocked"}` | 1 if amendment-blocked, 0 if not | 0 |
| `rippled_validator_health{metric="unl_blocked"}` | 1 if UNL-blocked, 0 if not | 0 |
| `rippled_validator_health{metric="unl_expiry_days"}` | Days until UNL list expires | > 14 |
| `rippled_validator_health{metric="validation_quorum"}` | Current validation quorum threshold | Network-dependent (~28) |
```promql
# Alert if amendment blocked
rippled_validator_health{metric="amendment_blocked"} == 1
# Alert if UNL expiring within 14 days
rippled_validator_health{metric="unl_expiry_days"} < 14
```
### Peer Quality Monitoring
| Prometheus Metric | Description |
| --------------------------------------------------------- | --------------------------------------- |
| `rippled_peer_quality{metric="peer_latency_p90_ms"}` | P90 peer latency in milliseconds |
| `rippled_peer_quality{metric="peers_insane_count"}` | Peers with diverged/insane tracking |
| `rippled_peer_quality{metric="peers_higher_version_pct"}` | % of peers running a newer version |
| `rippled_peer_quality{metric="upgrade_recommended"}` | 1 if >60% of peers are on newer version |
| `rippled_Overlay_Peer_Disconnects_Charges` | Disconnects due to resource charges |
**Key insight**: If `upgrade_recommended` is 1, the node is running an older version than the majority of the network. This doesn't affect functionality immediately but may cause issues when amendments activate.
```promql
# P90 peer latency trend
rippled_peer_quality{metric="peer_latency_p90_ms"}
# Correlate high latency with disconnects
rippled_peer_quality{metric="peer_latency_p90_ms"} > 500
and rate(rippled_Overlay_Peer_Disconnects_Charges[5m]) > 0
```
### Ledger Economy Monitoring
| Prometheus Metric | Description |
| ----------------------------------------------------- | ---------------------------------- |
| `rippled_ledger_economy{metric="base_fee_xrp"}` | Base fee in drops |
| `rippled_ledger_economy{metric="reserve_base_xrp"}` | Account reserve in drops |
| `rippled_ledger_economy{metric="reserve_inc_xrp"}` | Owner reserve increment in drops |
| `rippled_ledger_economy{metric="ledger_age_seconds"}` | Seconds since last validated close |
| `rippled_ledger_economy{metric="transaction_rate"}` | Smoothed transaction rate |
| `rippled_ledgers_closed_total` | Total ledgers closed |
```promql
# Fee values (should match server_info output)
rippled_ledger_economy{metric="base_fee_xrp"}
# Ledger age — should reset to ~0 every 3-5s
rippled_ledger_economy{metric="ledger_age_seconds"}
# Ledger close rate (should be ~12-20 per minute)
rate(rippled_ledgers_closed_total[5m]) * 60
```
### State Tracking
| Prometheus Metric | Description |
| ---------------------------------------------------------------- | ------------------------------ |
| `rippled_state_tracking{metric="state_value"}` | Numeric state (0-6, see table) |
| `rippled_state_tracking{metric="time_in_current_state_seconds"}` | Duration in current state |
| `rippled_state_changes_total` | Total state transitions |
**State value encoding**:
| Value | State | Meaning |
| ----- | ------------ | ---------------------------------------------------- |
| 0 | disconnected | No network connectivity |
| 1 | connected | Connected but not syncing |
| 2 | syncing | Fetching ledger history |
| 3 | tracking | Following network but not fully validated |
| 4 | full | Fully synced, not validating |
| 5 | validating | Fully synced and validating |
| 6 | proposing | Fully synced, validating, and proposing in consensus |
Values 5-6 combine `OperatingMode` (0-4) with `ConsensusMode` (validating/proposing) to give a richer picture of node participation.
```promql
# State timeline (should stay at 5 or 6 for validators)
rippled_state_tracking{metric="state_value"}
# Alert on frequent state changes (flapping)
rate(rippled_state_changes_total[1h]) > 2
```
### Grafana Dashboards (Phase 9)
| Dashboard | UID | Panels | Key Metrics |
| ------------------ | -------------------------- | ------ | --------------------------------------------------------- |
| Validator Health | `rippled-validator-health` | 13 | Agreement %, validation rate, amendment/UNL health, state |
| Peer Quality | `rippled-peer-quality` | 6 | P90 latency, insane peers, version awareness |
| System Node Health | (updated) | +5 | Ledger economy row: fee, reserves, age, tx rate |
---
## Troubleshooting
### No OTel SDK metrics in Prometheus
1. Verify `enabled=1` in the `[telemetry]` config section
2. Check that `metrics_endpoint` points to the OTel Collector's HTTP receiver
(default: `http://localhost:4318/v1/metrics`)
3. Check rippled logs for `MetricsRegistry: started successfully` message
4. Verify the OTel Collector is configured with an OTLP receiver and Prometheus exporter
5. Check Prometheus targets page for the collector scrape target
### Cache hit rates are zero
Cache hit rates may be zero during startup before caches are warmed. Wait for the
node to reach `Full` operating mode and process several ledgers before investigating.
### NodeStore I/O counters not incrementing
NodeStore counters are cumulative and may appear flat if the node is idle. Submit
some transactions or RPC requests to generate I/O activity.
### No traces appearing in Jaeger
1. Check rippled logs for `Telemetry starting` message
2. Verify `enabled=1` in the `[telemetry]` config section
3. Test collector connectivity: `curl -v http://localhost:4318/v1/traces`
4. Check collector logs: `docker compose logs otel-collector`
### No system metrics in Prometheus
1. Check rippled logs for `OTelCollector starting` message
2. Verify `server=otel` in the `[insight]` config section
3. Verify the endpoint in `[insight]` points to the OTLP/HTTP port (default: `http://localhost:4318/v1/metrics`)
4. Check that the `otlp` receiver is in the metrics pipeline receivers in `otel-collector-config.yaml`
5. Query Prometheus directly: `curl 'http://localhost:9090/api/v1/query?query=rippled_job_count'`
### Server info gauge shows server_state=0
This is normal during startup. The server starts in DISCONNECTED mode (0) and
progresses through CONNECTED (1), SYNCING (2), TRACKING (3), to FULL (4).
Wait for the node to sync with the network.
### Database metrics showing zero
The `getKBUsed*()` methods require SQLite databases to exist. If running with
`--standalone` or before the first ledger is stored, these will be zero.
### High memory usage
- Reduce `sampling_ratio` (e.g., `0.1` for 10% sampling)
- Reduce `max_queue_size` and `batch_size`
- Disable high-volume trace categories: `trace_peer=0`
### Collector connection failures
- Verify endpoint URL matches collector address
- Check firewall rules for ports 4317/4318
- If using TLS, verify certificate path with `tls_ca_cert`
### No trace_id in log output
- Verify rippled was built with `telemetry=ON` (the `XRPL_ENABLE_TELEMETRY` preprocessor flag)
- Verify `enabled=1` in the `[telemetry]` config section
- Log lines only contain `trace_id`/`span_id` when emitted inside an active span — background logs outside of RPC/consensus/transaction processing will not have trace context
- Check that the specific trace category is enabled (e.g., `trace_rpc=1`)
### No logs in Loki
- Verify the log file mount in docker-compose.yml points to the correct rippled log directory
- Check OTel Collector logs for filelog receiver errors: `docker compose logs otel-collector`
- Verify Loki is running: `curl http://localhost:3100/ready`
- Check the filelog receiver glob pattern matches your log file paths
## Performance Tuning
| Scenario | Recommendation |
| ------------------------ | ------------------------------------------------- |
| Production mainnet | `sampling_ratio=0.01`, `trace_peer=0` |
| Testnet/devnet | `sampling_ratio=1.0` (full tracing) |
| Debugging specific issue | `sampling_ratio=1.0` temporarily |
| High-throughput node | Increase `batch_size=1024`, `max_queue_size=4096` |
## Disabling Telemetry
Set `enabled=0` in config (runtime disable) or build without the flag:
```bash
cmake --preset default -Dtelemetry=OFF
```
When telemetry is compiled out, all trace macros expand to no-ops with zero overhead.