mirror of https://github.com/XRPLF/rippled.git synced 2026-03-19 11:12:29 +00:00

Files

Pratik Mankawde 598ff8b108 Phase 7: Native OTel metrics migration (Tasks 7.1-7.7)

Replace StatsD UDP metric transport with native OpenTelemetry Metrics SDK
export via OTLP/HTTP behind the existing beast::insight::Collector interface.

- Task 7.1: Link opentelemetry-cpp to beast module in CMake when telemetry=ON
- Task 7.2: New OTelCollector class mapping beast::insight instruments to OTel
  SDK (Counter, ObservableGauge, Histogram, Counter<uint64>) with OTLP/HTTP
  export via PeriodicMetricReader at 1s intervals
- Task 7.3: Add server=otel branch to CollectorManager with endpoint config
- Task 7.4: Update otel-collector-config.yaml to use OTLP receiver for metrics
  pipeline (StatsD receiver commented out for backward compat)
- Task 7.5: Metric names preserved via dot-to-underscore formatting matching
  StatsD->Prometheus conventions
- Task 7.6: Rename Grafana dashboards from statsd-* to system-*, update titles
  and UIDs from "StatsD" to "System Metrics"
- Task 7.7: Update integration test to use server=otel, verify OTLP metrics
- Task 7.8: Update runbook, TESTING.md, config reference, and data collection
  reference docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-11 15:02:10 +00:00

30 KiB

Raw Blame History

rippled Telemetry Operator Runbook

Overview

rippled supports OpenTelemetry distributed tracing to provide visibility into RPC requests, transaction processing, and consensus rounds.

Quick Start

1. Start the observability stack

docker compose -f docker/telemetry/docker-compose.yml up -d

This starts:

OTel Collector on ports 4317 (gRPC) and 4318 (HTTP)
Jaeger UI on http://localhost:16686
Prometheus on http://localhost:9090
Grafana on http://localhost:3000

2. Enable telemetry in rippled

Add to your xrpld.cfg:

[telemetry]
enabled=1
endpoint=http://localhost:4318/v1/traces

3. Build with telemetry support

conan install . --build=missing -o telemetry=True
cmake --preset default -Dtelemetry=ON
cmake --build --preset default

Configuration Reference

Option	Default	Description
`enabled`	`0`	Master switch for telemetry
`endpoint`	`http://localhost:4318/v1/traces`	OTLP/HTTP endpoint
`exporter`	`otlp_http`	Exporter type
`sampling_ratio`	`1.0`	Head-based sampling ratio (0.0–1.0)
`trace_rpc`	`1`	Enable RPC request tracing
`trace_transactions`	`1`	Enable transaction tracing
`trace_consensus`	`1`	Enable consensus tracing
`trace_peer`	`0`	Enable peer message tracing (high volume)
`trace_ledger`	`1`	Enable ledger tracing
`batch_size`	`512`	Max spans per batch export
`batch_delay_ms`	`5000`	Delay between batch exports
`max_queue_size`	`2048`	Max spans queued before dropping
`use_tls`	`0`	Use TLS for exporter connection
`tls_ca_cert`	(empty)	Path to CA certificate bundle

Span Reference

All spans instrumented in rippled, grouped by subsystem:

RPC Spans (Phase 2)

Span Name	Source File	Attributes	Description
`rpc.request`	ServerHandler.cpp:271	—	Top-level HTTP RPC request
`rpc.process`	ServerHandler.cpp:573	—	RPC processing (child of rpc.request)
`rpc.ws_message`	ServerHandler.cpp:384	—	WebSocket RPC message
`rpc.command.<name>`	RPCHandler.cpp:161	`xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`, `xrpl.rpc.status`, `xrpl.rpc.duration_ms`, `xrpl.rpc.error_message`	Per-command span (e.g., `rpc.command.server_info`)

Transaction Spans (Phase 3)

Span Name	Source File	Attributes	Description
`tx.process`	NetworkOPs.cpp:1227	`xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path`	Transaction submission and processing
`tx.receive`	PeerImp.cpp:1273	`xrpl.peer.id`, `xrpl.tx.hash`, `xrpl.tx.suppressed`, `xrpl.tx.status`	Transaction received from peer relay
`tx.apply`	BuildLedger.cpp:88	`xrpl.ledger.seq`, `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed`	Transaction set applied per ledger

Consensus Spans (Phase 4)

Span Name	Source File	Attributes	Description
`consensus.proposal.send`	RCLConsensus.cpp:177	`xrpl.consensus.round`	Consensus proposal broadcast
`consensus.ledger_close`	RCLConsensus.cpp:282	`xrpl.consensus.ledger.seq`, `xrpl.consensus.mode`	Ledger close event
`consensus.accept`	RCLConsensus.cpp:395	`xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms`	Ledger accepted by consensus
`consensus.validation.send`	RCLConsensus.cpp:753	`xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing`	Validation sent after accept
`consensus.accept.apply`	RCLConsensus.cpp:453	`xrpl.consensus.close_time`, `close_time_correct`, `close_resolution_ms`, `state`, `proposing`, `round_time_ms`, `ledger.seq`	Ledger application with close time details

Close Time Queries (Tempo TraceQL)

# Find rounds where validators disagreed on close time
{name="consensus.accept.apply"} | xrpl.consensus.close_time_correct = false

# Find consensus failures (moved_on)
{name="consensus.accept.apply"} | xrpl.consensus.state = "moved_on"

# Find slow ledger applications (>5s)
{name="consensus.accept.apply"} | duration > 5s

# Find specific ledger's consensus details
{name="consensus.accept.apply"} | xrpl.consensus.ledger.seq = 92345678

Ledger Spans (Phase 5)

Span Name	Source File	Attributes	Description
`ledger.build`	BuildLedger.cpp:31	`xrpl.ledger.seq`, `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed`	Ledger build during consensus
`ledger.validate`	LedgerMaster.cpp:915	`xrpl.ledger.seq`, `xrpl.ledger.validations`	Ledger promoted to validated
`ledger.store`	LedgerMaster.cpp:409	`xrpl.ledger.seq`	Ledger stored in history

Peer Spans (Phase 5)

Span Name	Source File	Attributes	Description
`peer.proposal.receive`	PeerImp.cpp:1667	`xrpl.peer.id`, `xrpl.peer.proposal.trusted`	Proposal received from peer
`peer.validation.receive`	PeerImp.cpp:2264	`xrpl.peer.id`, `xrpl.peer.validation.trusted`	Validation received from peer

Prometheus Metrics (Spanmetrics)

The OTel Collector's spanmetrics connector automatically derives RED (Rate, Errors, Duration) metrics from every span. No custom metrics code is needed in rippled.

Generated Metric Names

Prometheus Metric	Type	Description
`traces_span_metrics_calls_total`	Counter	Total span invocations
`traces_span_metrics_duration_milliseconds_bucket`	Histogram	Latency distribution buckets
`traces_span_metrics_duration_milliseconds_count`	Histogram	Latency observation count
`traces_span_metrics_duration_milliseconds_sum`	Histogram	Cumulative latency

Metric Labels

Every metric carries these standard labels:

Label	Source	Example
`span_name`	Span name	`rpc.command.server_info`
`status_code`	Span status	`STATUS_CODE_UNSET`, `STATUS_CODE_ERROR`
`service_name`	Resource attribute	`rippled`
`span_kind`	Span kind	`SPAN_KIND_INTERNAL`

Additionally, span attributes configured as dimensions in the collector become metric labels (dots → underscores):

Span Attribute	Metric Label	Applies To
`xrpl.rpc.command`	`xrpl_rpc_command`	`rpc.command.*` spans
`xrpl.rpc.status`	`xrpl_rpc_status`	`rpc.command.*` spans
`xrpl.consensus.mode`	`xrpl_consensus_mode`	`consensus.ledger_close` spans
`xrpl.tx.local`	`xrpl_tx_local`	`tx.process` spans
`xrpl.peer.proposal.trusted`	`xrpl_peer_proposal_trusted`	`peer.proposal.receive` spans
`xrpl.peer.validation.trusted`	`xrpl_peer_validation_trusted`	`peer.validation.receive` spans

Histogram Buckets

Configured in otel-collector-config.yaml:

1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s

System Metrics (beast::insight via OTel native)

rippled has a built-in metrics framework (beast::insight) that exports metrics natively via OTLP/HTTP. These complement the span-derived RED metrics by providing system-level gauges, counters, and timers that don't map to individual trace spans.

Configuration

Add to xrpld.cfg:

[insight]
server=otel
endpoint=http://localhost:4318/v1/metrics
prefix=rippled

The OTel Collector receives these via the OTLP receiver (same endpoint as traces, port 4318) and exports them to Prometheus alongside spanmetrics.

StatsD fallback (backward compatibility)

The legacy StatsD backend is still available:

[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled

When using StatsD, uncomment the statsd receiver in otel-collector-config.yaml and add port 8125:8125/udp to the docker-compose otel-collector service.

Metric Reference

Gauges

Prometheus Metric	Source	Description
`rippled_LedgerMaster_Validated_Ledger_Age`	LedgerMaster.h:373	Age of validated ledger (seconds)
`rippled_LedgerMaster_Published_Ledger_Age`	LedgerMaster.h:374	Age of published ledger (seconds)
`rippled_State_Accounting_{Mode}_duration`	NetworkOPs.cpp:774	Time in each operating mode (Disconnected/Connected/Syncing/Tracking/Full)
`rippled_State_Accounting_{Mode}_transitions`	NetworkOPs.cpp:780	Transition count per mode
`rippled_Peer_Finder_Active_Inbound_Peers`	PeerfinderManager.cpp:214	Active inbound peer connections
`rippled_Peer_Finder_Active_Outbound_Peers`	PeerfinderManager.cpp:215	Active outbound peer connections
`rippled_Overlay_Peer_Disconnects`	OverlayImpl.h:557	Peer disconnect count
`rippled_job_count`	JobQueue.cpp:26	Current job queue depth
`rippled_{category}_Bytes_In/Out`	OverlayImpl.h:535	Overlay traffic bytes per category (57 categories)
`rippled_{category}_Messages_In/Out`	OverlayImpl.h:535	Overlay traffic messages per category

Counters

Prometheus Metric	Source	Description
`rippled_rpc_requests`	ServerHandler.cpp:108	Total RPC request count
`rippled_ledger_fetches`	InboundLedgers.cpp:44	Ledger fetch request count
`rippled_ledger_history_mismatch`	LedgerHistory.cpp:16	Ledger hash mismatch count
`rippled_warn`	Logic.h:33	Resource manager warning count
`rippled_drop`	Logic.h:34	Resource manager drop count

Histograms (from StatsD timers)

Prometheus Metric	Source	Description
`rippled_rpc_time`	ServerHandler.cpp:110	RPC response time (ms)
`rippled_rpc_size`	ServerHandler.cpp:109	RPC response size (bytes)
`rippled_ios_latency`	Application.cpp:438	I/O service loop latency (ms)
`rippled_pathfind_fast`	PathRequests.h:23	Fast pathfinding duration (ms)
`rippled_pathfind_full`	PathRequests.h:24	Full pathfinding duration (ms)

Grafana Dashboards

Eight dashboards are pre-provisioned in docker/telemetry/grafana/dashboards/:

RPC Performance (`rippled-rpc-perf`)

Panel	Type	PromQL	Labels Used
RPC Request Rate by Command	timeseries	`sum by (xrpl_rpc_command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))`	`xrpl_rpc_command`
RPC Latency p95 by Command	timeseries	`histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))`	`xrpl_rpc_command`
RPC Error Rate	bargauge	Error spans / total spans × 100, grouped by `xrpl_rpc_command`	`xrpl_rpc_command`, `status_code`
RPC Latency Heatmap	heatmap	`sum(increase(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])) by (le)`	`le` (bucket boundaries)
Overall RPC Throughput	timeseries	`rpc.request` + `rpc.process` rate	—
RPC Success vs Error	timeseries	by `status_code` (UNSET vs ERROR)	`status_code`
Top Commands by Volume	bargauge	`topk(10, ...)` by `xrpl_rpc_command`	`xrpl_rpc_command`
WebSocket Message Rate	stat	`rpc.ws_message` rate	—

Transaction Overview (`rippled-transactions`)

Panel	Type	PromQL	Labels Used
Transaction Processing Rate	timeseries	`rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m])` and `tx.receive`	`span_name`
Transaction Processing Latency	timeseries	`histogram_quantile(0.95 / 0.50, ... {span_name="tx.process"})`	—
Transaction Path Distribution	piechart	`sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))`	`xrpl_tx_local`
Transaction Receive vs Suppressed	timeseries	`rate(traces_span_metrics_calls_total{span_name="tx.receive"}[5m])`	—
TX Processing Duration Heatmap	heatmap	`tx.process` histogram buckets	`le`
TX Apply Duration per Ledger	timeseries	p95/p50 of `tx.apply`	—
Peer TX Receive Rate	timeseries	`tx.receive` rate	—
TX Apply Failed Rate	stat	`tx.apply` with `STATUS_CODE_ERROR`	`status_code`

Consensus Health (`rippled-consensus`)

Panel	Type	PromQL	Labels Used
Consensus Round Duration	timeseries	`histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept"})`	—
Consensus Proposals Sent Rate	timeseries	`rate(traces_span_metrics_calls_total{span_name="consensus.proposal.send"}[5m])`	—
Ledger Close Duration	timeseries	`histogram_quantile(0.95, ... {span_name="consensus.ledger_close"})`	—
Validation Send Rate	stat	`rate(traces_span_metrics_calls_total{span_name="consensus.validation.send"}[5m])`	—
Ledger Apply Duration	timeseries	`histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept.apply"})`	—
Close Time Agreement	timeseries	`rate(traces_span_metrics_calls_total{span_name="consensus.accept.apply"}[5m])`	—
Consensus Mode Over Time	timeseries	`consensus.ledger_close` by `xrpl_consensus_mode`	`xrpl_consensus_mode`
Accept vs Close Rate	timeseries	`consensus.accept` vs `consensus.ledger_close` rate	—
Validation vs Close Rate	timeseries	`consensus.validation.send` vs `consensus.ledger_close`	—
Accept Duration Heatmap	heatmap	`consensus.accept` histogram buckets	`le`

Ledger Operations (`rippled-ledger-ops`)

Panel	Type	PromQL	Labels Used
Ledger Build Rate	stat	`ledger.build` call rate	—
Ledger Build Duration	timeseries	p95/p50 of `ledger.build`	—
Ledger Validation Rate	stat	`ledger.validate` call rate	—
Build Duration Heatmap	heatmap	`ledger.build` histogram buckets	`le`
TX Apply Duration	timeseries	p95/p50 of `tx.apply`	—
TX Apply Rate	timeseries	`tx.apply` call rate	—
Ledger Store Rate	stat	`ledger.store` call rate	—
Build vs Close Duration	timeseries	p95 `ledger.build` vs `consensus.ledger_close`	—

Peer Network (`rippled-peer-net`)

Requires trace_peer=1 in the [telemetry] config section.

Panel	Type	PromQL	Labels Used
Proposal Receive Rate	timeseries	`peer.proposal.receive` rate	—
Validation Receive Rate	timeseries	`peer.validation.receive` rate	—
Proposals Trusted vs Untrusted	piechart	by `xrpl_peer_proposal_trusted`	`xrpl_peer_proposal_trusted`
Validations Trusted vs Untrusted	piechart	by `xrpl_peer_validation_trusted`	`xrpl_peer_validation_trusted`

Node Health — System Metrics (`rippled-system-node-health`)

Panel	Type	PromQL	Labels Used
Validated Ledger Age	stat	`rippled_LedgerMaster_Validated_Ledger_Age`	—
Published Ledger Age	stat	`rippled_LedgerMaster_Published_Ledger_Age`	—
Operating Mode Duration	timeseries	`rippled_State_Accounting_*_duration`	—
Operating Mode Transitions	timeseries	`rippled_State_Accounting_*_transitions`	—
I/O Latency	timeseries	`histogram_quantile(0.95, rippled_ios_latency_bucket)`	—
Job Queue Depth	timeseries	`rippled_job_count`	—
Ledger Fetch Rate	stat	`rate(rippled_ledger_fetches[5m])`	—
Ledger History Mismatches	stat	`rate(rippled_ledger_history_mismatch[5m])`	—

Network Traffic — System Metrics (`rippled-system-network`)

Panel	Type	PromQL	Labels Used
Active Peers	timeseries	`rippled_Peer_Finder_Active_*_Peers`	—
Peer Disconnects	timeseries	`rippled_Overlay_Peer_Disconnects`	—
Total Network Bytes	timeseries	`rippled_total_Bytes_In/Out`	—
Total Network Messages	timeseries	`rippled_total_Messages_In/Out`	—
Transaction Traffic	timeseries	`rippled_transactions_Messages_In/Out`	—
Proposal Traffic	timeseries	`rippled_proposals_Messages_In/Out`	—
Validation Traffic	timeseries	`rippled_validations_Messages_In/Out`	—
Traffic by Category	bargauge	`topk(10, rippled_*_Bytes_In)`	—

RPC & Pathfinding — System Metrics (`rippled-system-rpc`)

Panel	Type	PromQL	Labels Used
RPC Request Rate	stat	`rate(rippled_rpc_requests[5m])`	—
RPC Response Time	timeseries	`histogram_quantile(0.95, rippled_rpc_time_bucket)`	—
RPC Response Size	timeseries	`histogram_quantile(0.95, rippled_rpc_size_bucket)`	—
RPC Response Time Heatmap	heatmap	`rippled_rpc_time_bucket`	—
Pathfinding Fast Duration	timeseries	`histogram_quantile(0.95, rippled_pathfind_fast_bucket)`	—
Pathfinding Full Duration	timeseries	`histogram_quantile(0.95, rippled_pathfind_full_bucket)`	—
Resource Warnings Rate	stat	`rate(rippled_warn[5m])`	—
Resource Drops Rate	stat	`rate(rippled_drop[5m])`	—

Span → Metric → Dashboard Summary

Span Name	Prometheus Metric Filter	Grafana Dashboard
`rpc.request`	`{span_name="rpc.request"}`	RPC Performance (Overall Throughput)
`rpc.process`	`{span_name="rpc.process"}`	RPC Performance (Overall Throughput)
`rpc.ws_message`	`{span_name="rpc.ws_message"}`	RPC Performance (WebSocket Rate)
`rpc.command.*`	`{span_name=~"rpc.command.*"}`	RPC Performance (Rate, Latency, Error, Top)
`tx.process`	`{span_name="tx.process"}`	Transaction Overview (Rate, Latency, Heatmap)
`tx.receive`	`{span_name="tx.receive"}`	Transaction Overview (Rate, Receive)
`tx.apply`	`{span_name="tx.apply"}`	Transaction Overview + Ledger Ops (Apply)
`consensus.accept`	`{span_name="consensus.accept"}`	Consensus Health (Duration, Rate, Heatmap)
`consensus.proposal.send`	`{span_name="consensus.proposal.send"}`	Consensus Health (Proposals Rate)
`consensus.ledger_close`	`{span_name="consensus.ledger_close"}`	Consensus Health (Close, Mode)
`consensus.validation.send`	`{span_name="consensus.validation.send"}`	Consensus Health (Validation Rate)
`consensus.accept.apply`	`{span_name="consensus.accept.apply"}`	Consensus Health (Apply Duration, Close Time)
`ledger.build`	`{span_name="ledger.build"}`	Ledger Ops (Build Rate, Duration, Heatmap)
`ledger.validate`	`{span_name="ledger.validate"}`	Ledger Ops (Validation Rate)
`ledger.store`	`{span_name="ledger.store"}`	Ledger Ops (Store Rate)
`peer.proposal.receive`	`{span_name="peer.proposal.receive"}`	Peer Network (Rate, Trusted/Untrusted)
`peer.validation.receive`	`{span_name="peer.validation.receive"}`	Peer Network (Rate, Trusted/Untrusted)

Troubleshooting

No traces appearing in Jaeger

Check rippled logs for Telemetry starting message
Verify enabled=1 in the [telemetry] config section
Test collector connectivity: curl -v http://localhost:4318/v1/traces
Check collector logs: docker compose logs otel-collector

No system metrics in Prometheus

Check rippled logs for OTelCollector starting message
Verify server=otel in the [insight] config section
Verify the endpoint in [insight] points to the OTLP/HTTP port (default: http://localhost:4318/v1/metrics)
Check that the otlp receiver is in the metrics pipeline receivers in otel-collector-config.yaml
Query Prometheus directly: curl 'http://localhost:9090/api/v1/query?query=rippled_job_count'

High memory usage

Reduce sampling_ratio (e.g., 0.1 for 10% sampling)
Reduce max_queue_size and batch_size
Disable high-volume trace categories: trace_peer=0

Collector connection failures

Verify endpoint URL matches collector address
Check firewall rules for ports 4317/4318
If using TLS, verify certificate path with tls_ca_cert

Performance Tuning

Scenario	Recommendation
Production mainnet	`sampling_ratio=0.01`, `trace_peer=0`
Testnet/devnet	`sampling_ratio=1.0` (full tracing)
Debugging specific issue	`sampling_ratio=1.0` temporarily
High-throughput node	Increase `batch_size=1024`, `max_queue_size=4096`

Disabling Telemetry

Set enabled=0 in config (runtime disable) or build without the flag:

cmake --preset default -Dtelemetry=OFF

When telemetry is compiled out, all trace macros expand to no-ops with zero overhead.

30 KiB Raw Blame History Unescape Escape