mirror of https://github.com/XRPLF/rippled.git synced 2026-03-20 11:42:29 +00:00

Files

Pratik Mankawde 4db67bc191 Phase 10: Synthetic workload generation and telemetry validation tools

Add comprehensive workload harness for end-to-end validation of the
Phases 1-9 telemetry stack:

Task 10.1 — Multi-node test harness:
  - docker-compose.workload.yaml with full OTel stack (Collector, Jaeger,
    Tempo, Prometheus, Loki, Grafana)
  - generate-validator-keys.sh for automated key generation
  - xrpld-validator.cfg.template for node configuration

Task 10.2 — RPC load generator:
  - rpc_load_generator.py with WebSocket client, configurable rates,
    realistic command distribution (40% health, 30% wallet, 15% explorer,
    10% tx lookups, 5% DEX), W3C traceparent injection

Task 10.3 — Transaction submitter:
  - tx_submitter.py with 10 transaction types (Payment, OfferCreate,
    OfferCancel, TrustSet, NFTokenMint, NFTokenCreateOffer, EscrowCreate,
    EscrowFinish, AMMCreate, AMMDeposit), auto-funded test accounts

Task 10.4 — Telemetry validation suite:
  - validate_telemetry.py checking spans (Jaeger), metrics (Prometheus),
    log-trace correlation (Loki), dashboards (Grafana)
  - expected_spans.json (17 span types, 22 attributes, 3 hierarchies)
  - expected_metrics.json (SpanMetrics, StatsD, Phase 9, dashboards)

Task 10.5 — Performance benchmark suite:
  - benchmark.sh for baseline vs telemetry comparison
  - collect_system_metrics.sh for CPU/memory/latency sampling
  - Thresholds: <3% CPU, <5MB memory, <2ms RPC p99, <5% TPS, <1% consensus

Task 10.6 — CI integration:
  - telemetry-validation.yml GitHub Actions workflow
  - run-full-validation.sh orchestrator script
  - Manual trigger + telemetry branch auto-trigger

Task 10.7 — Documentation:
  - workload/README.md with quick start and tool reference
  - Updated telemetry-runbook.md with validation and benchmark sections
  - Updated 09-data-collection-reference.md with validation inventory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-12 22:12:57 +00:00

63 KiB

Raw Blame History

Observability Data Collection Reference

Audience: Developers and operators. This is the single source of truth for all telemetry data collected by rippled's observability stack.

Related docs: docs/telemetry-runbook.md (operator runbook with alerting and troubleshooting) | 03-implementation-strategy.md (code structure and performance optimization) | 04-code-samples.md (C++ instrumentation examples)

Data Flow Overview

graph LR
    subgraph rippledNode["rippled Node"]
        A["Trace Macros<br/>XRPL_TRACE_SPAN<br/>(OTLP/HTTP exporter)"]
        B["beast::insight<br/>OTel native metrics<br/>(OTLP/HTTP exporter)"]
        C["MetricsRegistry<br/>OTel SDK metrics<br/>(OTLP/HTTP exporter)"]
    end

    subgraph collector["OTel Collector  :4317 / :4318"]
        direction TB
        R1["OTLP Receiver<br/>:4317 gRPC  |  :4318 HTTP<br/>(traces + metrics)"]
        BP["Batch Processor<br/>timeout 1s, batch 100"]
        SM["SpanMetrics Connector<br/>derives RED metrics<br/>from trace spans"]

        R1 --> BP
        BP --> SM
    end

    subgraph backends["Trace Backends  (choose one or both)"]
        D["Jaeger  :16686<br/>Trace search &<br/>visualization"]
        T["Grafana Tempo<br/>(preferred for production)<br/>S3/GCS long-term storage"]
    end

    subgraph metrics["Metrics Stack"]
        E["Prometheus  :9090<br/>scrapes :8889<br/>span-derived + system metrics"]
    end

    subgraph viz["Visualization"]
        F["Grafana  :3000<br/>13 dashboards"]
    end

    A -->|"OTLP/HTTP :4318<br/>(traces + attributes)"| R1
    B -->|"OTLP/HTTP :4318<br/>(gauges, counters, histograms)"| R1
    C -->|"OTLP/HTTP :4318<br/>(counters, histograms,<br/>observable gauges)"| R1

    BP -->|"OTLP/gRPC :4317"| D
    BP -->|"OTLP/gRPC"| T

    SM -->|"span_calls_total<br/>span_duration_ms<br/>(6 dimension labels)"| E
    R1 -->|"rippled_* gauges<br/>rippled_* counters<br/>rippled_* histograms"| E

    E -->|"Prometheus<br/>data source"| F
    D -->|"Jaeger<br/>data source"| F
    T -->|"Tempo<br/>data source"| F

    style A fill:#4a90d9,color:#fff,stroke:#2a6db5
    style B fill:#4a90d9,color:#fff,stroke:#2a6db5
    style R1 fill:#5cb85c,color:#fff,stroke:#3d8b3d
    style BP fill:#449d44,color:#fff,stroke:#2d6e2d
    style SM fill:#449d44,color:#fff,stroke:#2d6e2d
    style D fill:#f0ad4e,color:#000,stroke:#c78c2e
    style T fill:#e8953a,color:#000,stroke:#b5732a
    style E fill:#f0ad4e,color:#000,stroke:#c78c2e
    style F fill:#5bc0de,color:#000,stroke:#3aa8c1
    style rippledNode fill:#1a2633,color:#ccc,stroke:#4a90d9
    style collector fill:#1a3320,color:#ccc,stroke:#5cb85c
    style backends fill:#332a1a,color:#ccc,stroke:#f0ad4e
    style metrics fill:#332a1a,color:#ccc,stroke:#f0ad4e
    style viz fill:#1a2d33,color:#ccc,stroke:#5bc0de

There are two independent telemetry pipelines entering a single OTel Collector via the same OTLP receiver:

OpenTelemetry Traces — Distributed spans with attributes, exported via OTLP/HTTP (:4318) to the collector's OTLP Receiver. The Batch Processor groups spans (1s timeout, batch size 100) before forwarding to trace backends. The SpanMetrics Connector derives RED metrics (rate, errors, duration) from every span and feeds them into the metrics pipeline.
beast::insight OTel Metrics — System-level gauges, counters, and histograms exported natively via OTLP/HTTP (:4318) to the same OTLP Receiver. These are batched and exported to Prometheus alongside span-derived metrics. The StatsD UDP transport has been replaced by native OTLP; server=statsd remains available as a fallback.

Trace backends — The collector exports traces via OTLP/gRPC to one or both:

Jaeger (development) — Provides trace search UI at :16686. Easy single-binary setup.
Grafana Tempo (production) — Preferred for production. Supports S3/GCS object storage for cost-effective long-term trace retention and integrates natively with Grafana.

Further reading: 00-tracing-fundamentals.md for core OpenTelemetry concepts (traces, spans, context propagation, sampling). 07-observability-backends.md for production backend selection, collector placement, and sampling strategies.

1. OpenTelemetry Spans

1.1 Complete Span Inventory (16 spans)

See also: 02-design-decisions.md §2.3 for naming conventions and the full span catalog with rationale. 04-code-samples.md §4.6 for span flow diagrams.

RPC Spans

Controlled by trace_rpc=1 in [telemetry] config.

Span Name	Parent	Source File	Description
`rpc.request`	—	ServerHandler.cpp	Top-level HTTP RPC request entry point
`rpc.process`	`rpc.request`	ServerHandler.cpp	RPC processing pipeline
`rpc.ws_message`	—	ServerHandler.cpp	WebSocket message handling
`rpc.command.<name>`	`rpc.process`	RPCHandler.cpp	Per-command span (e.g., `rpc.command.server_info`, `rpc.command.ledger`)

Where to find: Jaeger → Service: rippled → Operation: rpc.request or rpc.command.*

Grafana dashboard: RPC Performance (rippled-rpc-perf)

Transaction Spans

Controlled by trace_transactions=1 in [telemetry] config.

Span Name	Parent	Source File	Description
`tx.process`	—	NetworkOPs.cpp	Transaction submission entry point (local or peer-relayed)
`tx.receive`	—	PeerImp.cpp	Raw transaction received from peer overlay (before deduplication)
`tx.apply`	`ledger.build`	BuildLedger.cpp	Transaction set applied to new ledger during consensus

Where to find: Jaeger → Operation: tx.process or tx.receive

Grafana dashboard: Transaction Overview (rippled-transactions)

Consensus Spans

Controlled by trace_consensus=1 in [telemetry] config.

Span Name	Parent	Source File	Description
`consensus.proposal.send`	—	RCLConsensus.cpp	Node broadcasts its transaction set proposal
`consensus.ledger_close`	—	RCLConsensus.cpp	Ledger close event triggered by consensus
`consensus.accept`	—	RCLConsensus.cpp	Consensus accepts a ledger (round complete)
`consensus.validation.send`	—	RCLConsensus.cpp	Validation message sent after ledger accepted
`consensus.accept.apply`	—	RCLConsensus.cpp	Ledger application with close time details

Where to find: Jaeger → Operation: consensus.*

Grafana dashboard: Consensus Health (rippled-consensus)

Ledger Spans

Controlled by trace_ledger=1 in [telemetry] config.

Span Name	Parent	Source File	Description
`ledger.build`	—	BuildLedger.cpp	Build new ledger from accepted transaction set
`ledger.validate`	—	LedgerMaster.cpp	Ledger promoted to validated status
`ledger.store`	—	LedgerMaster.cpp	Ledger stored to database/history

Where to find: Jaeger → Operation: ledger.*

Grafana dashboard: Ledger Operations (rippled-ledger-ops)

Peer Spans

Controlled by trace_peer=1 in [telemetry] config. Disabled by default (high volume).

Span Name	Parent	Source File	Description
`peer.proposal.receive`	—	PeerImp.cpp	Consensus proposal received from peer
`peer.validation.receive`	—	PeerImp.cpp	Validation message received from peer

Where to find: Jaeger → Operation: peer.*

Grafana dashboard: Peer Network (rippled-peer-net)

1.2 Complete Attribute Inventory (22 attributes)

See also: 02-design-decisions.md §2.4.2 for attribute design rationale and privacy considerations.

Every span can carry key-value attributes that provide context for filtering and aggregation.

RPC Attributes

Attribute	Type	Set On	Description
`xrpl.rpc.command`	string	`rpc.command.*`	RPC command name (e.g., `server_info`, `ledger`)
`xrpl.rpc.version`	int64	`rpc.command.*`	API version number
`xrpl.rpc.role`	string	`rpc.command.*`	Caller role: `"admin"` or `"user"`
`xrpl.rpc.status`	string	`rpc.command.*`	Result: `"success"` or `"error"`
`xrpl.rpc.duration_ms`	int64	`rpc.command.*`	Command execution time in milliseconds
`xrpl.rpc.error_message`	string	`rpc.command.*`	Error details (only set on failure)

Jaeger query: Tag xrpl.rpc.command=server_info to find all server_info calls.

Prometheus label: xrpl_rpc_command (dots converted to underscores by SpanMetrics).

Transaction Attributes

Attribute	Type	Set On	Description
`xrpl.tx.hash`	string	`tx.process`, `tx.receive`	Transaction hash (hex-encoded)
`xrpl.tx.local`	boolean	`tx.process`	`true` if locally submitted, `false` if peer-relayed
`xrpl.tx.path`	string	`tx.process`	Submission path: `"sync"` or `"async"`
`xrpl.tx.suppressed`	boolean	`tx.receive`	`true` if transaction was suppressed (duplicate)
`xrpl.tx.status`	string	`tx.receive`	Transaction status (e.g., `"known_bad"`)

Jaeger query: Tag xrpl.tx.hash=<hash> to trace a specific transaction across nodes.

Prometheus label: xrpl_tx_local (used as SpanMetrics dimension).

Consensus Attributes

Attribute	Type	Set On	Description
`xrpl.consensus.round`	int64	`consensus.proposal.send`	Consensus round number
`xrpl.consensus.mode`	string	`consensus.proposal.send`, `consensus.ledger_close`	Node mode: `"syncing"`, `"tracking"`, `"full"`, `"proposing"`
`xrpl.consensus.proposers`	int64	`consensus.proposal.send`, `consensus.accept`	Number of proposers in the round
`xrpl.consensus.proposing`	boolean	`consensus.validation.send`	Whether this node was a proposer
`xrpl.consensus.ledger.seq`	int64	`consensus.ledger_close`, `consensus.accept`, `consensus.validation.send`, `consensus.accept.apply`	Ledger sequence number
`xrpl.consensus.close_time`	int64	`consensus.accept.apply`	Agreed-upon ledger close time (epoch seconds)
`xrpl.consensus.close_time_correct`	boolean	`consensus.accept.apply`	Whether validators reached agreement on close time
`xrpl.consensus.close_resolution_ms`	int64	`consensus.accept.apply`	Close time rounding granularity in milliseconds
`xrpl.consensus.state`	string	`consensus.accept.apply`	Consensus outcome: `"finished"` or `"moved_on"`
`xrpl.consensus.round_time_ms`	int64	`consensus.accept.apply`	Total consensus round duration in milliseconds

Jaeger query: Tag xrpl.consensus.mode=proposing to find rounds where node was proposing.

Prometheus label: xrpl_consensus_mode (used as SpanMetrics dimension).

Ledger Attributes

Attribute	Type	Set On	Description
`xrpl.ledger.seq`	int64	`ledger.build`, `ledger.validate`, `ledger.store`, `tx.apply`	Ledger sequence number
`xrpl.ledger.validations`	int64	`ledger.validate`	Number of validations received for this ledger
`xrpl.ledger.tx_count`	int64	`ledger.build`, `tx.apply`	Transactions in the ledger
`xrpl.ledger.tx_failed`	int64	`ledger.build`, `tx.apply`	Failed transactions in the ledger

Jaeger query: Tag xrpl.ledger.seq=12345 to find all spans for a specific ledger.

Peer Attributes

Attribute	Type	Set On	Description
`xrpl.peer.id`	int64	`tx.receive`, `peer.proposal.receive`, `peer.validation.receive`	Peer identifier
`xrpl.peer.proposal.trusted`	boolean	`peer.proposal.receive`	Whether the proposal came from a trusted validator
`xrpl.peer.validation.trusted`	boolean	`peer.validation.receive`	Whether the validation came from a trusted validator

Prometheus labels: xrpl_peer_proposal_trusted, xrpl_peer_validation_trusted (SpanMetrics dimensions).

1.3 SpanMetrics — Derived Prometheus Metrics

See also: 01-architecture-analysis.md §1.8.2 for how span-derived metrics map to operational insights.

The OTel Collector's SpanMetrics connector automatically generates RED (Rate, Errors, Duration) metrics from every span. No custom metrics code in rippled is needed.

Prometheus Metric	Type	Description
`traces_span_metrics_calls_total`	Counter	Total span invocations
`traces_span_metrics_duration_milliseconds_bucket`	Histogram	Latency distribution (buckets: 1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000 ms)
`traces_span_metrics_duration_milliseconds_count`	Histogram	Observation count
`traces_span_metrics_duration_milliseconds_sum`	Histogram	Cumulative latency

Standard labels on every metric: span_name, status_code, service_name, span_kind

Additional dimension labels (configured in otel-collector-config.yaml):

Span Attribute	Prometheus Label	Applies To
`xrpl.rpc.command`	`xrpl_rpc_command`	`rpc.command.*`
`xrpl.rpc.status`	`xrpl_rpc_status`	`rpc.command.*`
`xrpl.consensus.mode`	`xrpl_consensus_mode`	`consensus.ledger_close`
`xrpl.tx.local`	`xrpl_tx_local`	`tx.process`
`xrpl.peer.proposal.trusted`	`xrpl_peer_proposal_trusted`	`peer.proposal.receive`
`xrpl.peer.validation.trusted`	`xrpl_peer_validation_trusted`	`peer.validation.receive`

Where to query: Prometheus → traces_span_metrics_calls_total{span_name="rpc.command.server_info"}

2. System Metrics (beast::insight — OTel native)

See also: 02-design-decisions.md for the beast::insight coexistence design. 06-implementation-phases.md for the Phase 6/7 metric inventory.

Migration complete: Phase 7 replaced the StatsD UDP transport with native OTel Metrics SDK export via OTLP/HTTP. The beast::insight::Collector interface and all metric names are preserved — only the wire protocol changed. [insight] server=statsd remains as a fallback.

These are system-level metrics emitted by rippled's beast::insight framework via OTel OTLP/HTTP. They cover operational data that doesn't map to individual trace spans.

Configuration

# Recommended: native OTel metrics via OTLP/HTTP
[insight]
server=otel
endpoint=http://localhost:4318/v1/metrics
prefix=rippled

Fallback (StatsD):

[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled

2.1 Gauges

Prometheus Metric	Source File	Description	Typical Range
`rippled_LedgerMaster_Validated_Ledger_Age`	LedgerMaster.h	Seconds since last validated ledger	0–10 (healthy), >30 (stale)
`rippled_LedgerMaster_Published_Ledger_Age`	LedgerMaster.h	Seconds since last published ledger	0–10 (healthy)
`rippled_State_Accounting_Disconnected_duration`	NetworkOPs.cpp	Cumulative seconds in Disconnected state	Monotonic
`rippled_State_Accounting_Connected_duration`	NetworkOPs.cpp	Cumulative seconds in Connected state	Monotonic
`rippled_State_Accounting_Syncing_duration`	NetworkOPs.cpp	Cumulative seconds in Syncing state	Monotonic
`rippled_State_Accounting_Tracking_duration`	NetworkOPs.cpp	Cumulative seconds in Tracking state	Monotonic
`rippled_State_Accounting_Full_duration`	NetworkOPs.cpp	Cumulative seconds in Full state	Monotonic (should dominate)
`rippled_State_Accounting_Disconnected_transitions`	NetworkOPs.cpp	Count of transitions to Disconnected	Low
`rippled_State_Accounting_Connected_transitions`	NetworkOPs.cpp	Count of transitions to Connected	Low
`rippled_State_Accounting_Syncing_transitions`	NetworkOPs.cpp	Count of transitions to Syncing	Low
`rippled_State_Accounting_Tracking_transitions`	NetworkOPs.cpp	Count of transitions to Tracking	Low
`rippled_State_Accounting_Full_transitions`	NetworkOPs.cpp	Count of transitions to Full	Low (should be 1 after startup)
`rippled_Peer_Finder_Active_Inbound_Peers`	PeerfinderManager.cpp	Active inbound peer connections	0–85
`rippled_Peer_Finder_Active_Outbound_Peers`	PeerfinderManager.cpp	Active outbound peer connections	10–21
`rippled_Overlay_Peer_Disconnects`	OverlayImpl.cpp	Cumulative peer disconnection count	Low growth
`rippled_job_count`	JobQueue.cpp	Current job queue depth	0–100 (healthy)

Grafana dashboard: Node Health (System Metrics) (rippled-system-node-health)

2.2 Counters

Prometheus Metric	Source File	Description
`rippled_rpc_requests`	ServerHandler.cpp	Total RPC requests received
`rippled_ledger_fetches`	InboundLedgers.cpp	Inbound ledger fetch attempts
`rippled_ledger_history_mismatch`	LedgerHistory.cpp	Ledger hash mismatches detected
`rippled_warn`	Logic.h	Resource manager warnings issued
`rippled_drop`	Logic.h	Resource manager drops (connections rejected)

Note: With server=otel, rippled_warn and rippled_drop are properly exported as OTel Counter instruments. The previous StatsD |m type limitation no longer applies.

Grafana dashboard: RPC & Pathfinding (System Metrics) (rippled-system-rpc)

2.3 Histograms (Event timers)

Prometheus Metric	Source File	Unit	Description
`rippled_rpc_time`	ServerHandler.cpp	ms	RPC response time distribution
`rippled_rpc_size`	ServerHandler.cpp	bytes	RPC response size distribution
`rippled_ios_latency`	Application.cpp	ms	I/O service loop latency
`rippled_pathfind_fast`	PathRequests.h	ms	Fast pathfinding duration
`rippled_pathfind_full`	PathRequests.h	ms	Full pathfinding duration

Quantiles collected: 0th, 50th, 90th, 95th, 99th, 100th percentile.

Grafana dashboards: Node Health (ios_latency), RPC & Pathfinding (rpc_time, rpc_size, pathfind_*)

2.4 Overlay Traffic Metrics

For each of the 45+ overlay traffic categories (defined in TrafficCount.h), four gauges are emitted:

rippled_{category}_Bytes_In
rippled_{category}_Bytes_Out
rippled_{category}_Messages_In
rippled_{category}_Messages_Out

Key categories:

Category	Description
`total`	All traffic aggregated
`overhead` / `overhead_overlay`	Protocol overhead
`transactions` / `transactions_duplicate`	Transaction relay
`proposals` / `proposals_untrusted` / `proposals_duplicate`	Consensus proposals
`validations` / `validations_untrusted` / `validations_duplicate`	Consensus validations
`ledger_data_get` / `ledger_data_share`	Ledger data exchange
`ledger_data_Transaction_Node_get/share`	Transaction node data
`ledger_data_Account_State_Node_get/share`	Account state node data
`ledger_data_Transaction_Set_candidate_get/share`	Transaction set candidates
`getObject` / `haveTxSet` / `ledgerData`	Object requests
`ping` / `status`	Keepalive and status
`set_get`	Set requests

Grafana dashboards: Network Traffic (rippled-system-network), Overlay Traffic Detail (rippled-system-overlay-detail), Ledger Data & Sync (rippled-system-ledger-sync)

3. Grafana Dashboard Reference

See also: 05-configuration-reference.md §5.8 for Grafana data source provisioning (Tempo, Jaeger, Prometheus) and TraceQL query examples.

3.1 Span-Derived Dashboards (5)

Dashboard	UID	Data Source	Key Panels
RPC Performance	`rippled-rpc-perf`	Prometheus (SpanMetrics)	Request rate by command, p95 latency by command, error rate, heatmap, top commands
Transaction Overview	`rippled-transactions`	Prometheus (SpanMetrics)	Processing rate, latency p95/p50, local vs relay split, apply duration, heatmap
Consensus Health	`rippled-consensus`	Prometheus (SpanMetrics)	Round duration p95/p50, proposals rate, close duration, mode timeline, heatmap
Ledger Operations	`rippled-ledger-ops`	Prometheus (SpanMetrics)	Build rate, build duration, validation rate, store rate, build vs close comparison
Peer Network	`rippled-peer-net`	Prometheus (SpanMetrics)	Proposal receive rate, validation receive rate, trusted vs untrusted breakdown

3.2 System Metrics Dashboards (5)

Dashboard	UID	Data Source	Key Panels
Node Health	`rippled-system-node-health`	Prometheus (OTLP)	Ledger age, operating mode, I/O latency, job queue, fetch rate
Network Traffic	`rippled-system-network`	Prometheus (OTLP)	Active peers, disconnects, bytes in/out, messages in/out, traffic by category
RPC & Pathfinding	`rippled-system-rpc`	Prometheus (OTLP)	RPC rate, response time/size, pathfinding duration, resource warnings/drops
Overlay Traffic Detail	`rippled-system-overlay-detail`	Prometheus (OTLP)	Squelch, overhead, validator lists, set get/share, have/requested tx, proof paths
Ledger Data & Sync	`rippled-system-ledger-sync`	Prometheus (OTLP)	Ledger data exchange, legacy ledger share/get, getobject by type, traffic heatmap

3.3 Accessing the Dashboards

Open Grafana at http://localhost:3000
Navigate to Dashboards → rippled folder
All 10 dashboards are auto-provisioned from docker/telemetry/grafana/dashboards/

4. Jaeger Trace Search Guide

See also: 08-appendix.md §8.2 for span hierarchy visualizations. 05-configuration-reference.md §5.8.5 for TraceQL examples when using Grafana Tempo instead of Jaeger.

Finding Traces by Type

What to Find	Jaeger Search Parameters
All RPC calls	Service: `rippled`, Operation: `rpc.request`
Specific RPC command	Operation: `rpc.command.server_info` (or any command name)
Slow RPC calls	Operation: `rpc.command.*`, Min Duration: `100ms`
Failed RPC calls	Tag: `xrpl.rpc.status=error`
Specific transaction	Tag: `xrpl.tx.hash=<hex_hash>`
Local transactions only	Tag: `xrpl.tx.local=true`
Consensus rounds	Operation: `consensus.accept`
Rounds by mode	Tag: `xrpl.consensus.mode=proposing`
Specific ledger	Tag: `xrpl.ledger.seq=12345`
Peer proposals (trusted)	Tag: `xrpl.peer.proposal.trusted=true`

Trace Structure

A typical RPC trace shows the span hierarchy:

rpc.request (ServerHandler)
  └── rpc.process (ServerHandler)
       └── rpc.command.server_info (RPCHandler)

A consensus round produces independent spans (not parent-child):

consensus.ledger_close        (close event)
consensus.proposal.send       (broadcast proposal)
ledger.build                  (build new ledger)
  └── tx.apply                (apply transaction set)
consensus.accept              (accept result)
consensus.validation.send     (send validation)
ledger.validate               (promote to validated)
ledger.store                  (persist to DB)

5. Prometheus Query Examples

See also: 05-configuration-reference.md §5.8.7 for correlating Prometheus system metrics with trace-derived metrics.

Span-Derived Metrics

# RPC request rate by command (last 5 minutes)
sum by (xrpl_rpc_command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))

# RPC p95 latency by command
histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))

# Consensus round duration p95
histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name="consensus.accept"}[5m])))

# Transaction processing rate (local vs relay)
sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))

# Trusted vs untrusted proposal rate
sum by (xrpl_peer_proposal_trusted) (rate(traces_span_metrics_calls_total{span_name="peer.proposal.receive"}[5m]))

StatsD Metrics

# Validated ledger age (should be < 10s)
rippled_LedgerMaster_Validated_Ledger_Age

# Active peer count
rippled_Peer_Finder_Active_Inbound_Peers + rippled_Peer_Finder_Active_Outbound_Peers

# RPC response time p95
histogram_quantile(0.95, rippled_rpc_time_bucket)

# Total network bytes in (rate)
rate(rippled_total_Bytes_In[5m])

# Operating mode (should be "Full" after startup)
rippled_State_Accounting_Full_duration

5a. Log-Trace Correlation (Phase 8)

Plan details: 06-implementation-phases.md §6.8.1 — motivation, architecture, Mermaid diagrams Task breakdown: Phase8_taskList.md — per-task implementation details

Phase 8 injects OTel trace context into rippled's Logs::format() output, enabling log-trace correlation. When a log line is emitted within an active OTel span, the trace and span identifiers are automatically appended after the severity field:

Log Format

<timestamp> <partition>:<severity> trace_id=<32hex> span_id=<16hex> <message>

Example:

2024-01-15T10:30:45.123Z LedgerMaster:NFO trace_id=abc123def456789012345678abcdef01 span_id=0123456789abcdef Validated ledger 42

trace_id=<hex32> — 32-character lowercase hex trace identifier. Links to the distributed trace in Tempo/Jaeger.
span_id=<hex16> — 16-character lowercase hex span identifier. Identifies the specific span within the trace.
Only present when the log is emitted within an active OTel span. Log lines outside of traced code paths have no trace context fields.

Implementation

The trace context injection is implemented in Logs::format() (src/libxrpl/basics/Log.cpp), guarded by #ifdef XRPL_ENABLE_TELEMETRY. It reads the current span from OTel's thread-local runtime context via opentelemetry::trace::GetSpan() and opentelemetry::context::RuntimeContext::GetCurrent(). Both calls are lock-free thread-local reads measured at <10ns per call.

Log Ingestion Pipeline

rippled debug.log -> OTel Collector filelog receiver -> regex_parser -> Loki exporter -> Grafana Loki

The OTel Collector's filelog receiver tails debug.log files and uses a regex_parser operator to extract structured fields:

Field	Type	Description
`timestamp`	datetime	Log timestamp
`partition`	string	Log partition (e.g., `LedgerMaster`, `PeerImp`)
`severity`	string	Severity code (`TRC`, `DBG`, `NFO`, `WRN`, `ERR`, `FTL`)
`trace_id`	string	32-hex trace identifier (optional)
`span_id`	string	16-hex span identifier (optional)
`message`	string	Log message body

Grafana Correlation

Bidirectional linking between logs and traces is configured via Grafana datasource provisioning:

Tempo -> Loki (tracesToLogs): Clicking "Logs for this trace" on a Tempo trace view filters Loki logs by trace_id, showing all log lines from that trace.
Loki -> Tempo (derivedFields): A regex-based derived field on the Loki datasource extracts trace_id from log lines and renders it as a clickable link to the corresponding trace in Tempo.

Loki Backend

Grafana Loki (v2.9.0) serves as the log storage backend. It receives log entries from the OTel Collector's loki exporter via the push API at http://loki:3100/loki/api/v1/push.

LogQL Query Examples

# Find all logs for a specific trace
{job="rippled"} |= "trace_id=abc123def456789012345678abcdef01"

# Error logs with trace context
{job="rippled"} |= "ERR" |= "trace_id="

# Logs from a specific partition with trace context
{job="rippled"} |= "LedgerMaster" | regexp `trace_id=(?P<trace_id>[a-f0-9]+)` | trace_id != ""

# Count traced log lines over time
count_over_time({job="rippled"} |= "trace_id=" [5m])

5b. Future: Internal Metric Gap Fill (Phase 9)

Status: Planned, not yet implemented. Plan details: 06-implementation-phases.md §6.8.2 — motivation, architecture, third-party context Task breakdown: Phase9_taskList.md — per-task implementation details

Phase 9 fills ~50+ metrics that exist inside rippled but currently lack time-series export. Uses a hybrid approach: beast::insight extensions for NodeStore I/O, OTel ObservableGauge async callbacks for new categories.

New Metric Categories

NodeStore I/O (via beast::insight)

Prometheus Metric	Type	Description
`rippled_nodestore_reads_total`	Gauge	Cumulative read operations
`rippled_nodestore_reads_hit`	Gauge	Cache-served reads
`rippled_nodestore_writes`	Gauge	Cumulative write operations
`rippled_nodestore_written_bytes`	Gauge	Cumulative bytes written
`rippled_nodestore_read_bytes`	Gauge	Cumulative bytes read
`rippled_nodestore_read_duration_us`	Gauge	Cumulative read time (microseconds)
`rippled_nodestore_write_load`	Gauge	Current write load score
`rippled_nodestore_read_queue`	Gauge	Items in read queue

Cache Hit Rates (via OTel MetricsRegistry)

Prometheus Metric	Type	Description
`rippled_cache_SLE_hit_rate`	Gauge	SLE cache hit rate (0.0-1.0)
`rippled_cache_ledger_hit_rate`	Gauge	Ledger object cache hit rate
`rippled_cache_AL_hit_rate`	Gauge	AcceptedLedger cache hit rate
`rippled_cache_treenode_size`	Gauge	SHAMap TreeNode cache size (entries)
`rippled_cache_fullbelow_size`	Gauge	FullBelow cache size

Transaction Queue (via OTel MetricsRegistry)

Prometheus Metric	Type	Description
`rippled_txq_count`	Gauge	Current transactions in queue
`rippled_txq_max_size`	Gauge	Maximum queue capacity
`rippled_txq_in_ledger`	Gauge	Transactions in open ledger
`rippled_txq_per_ledger`	Gauge	Expected transactions per ledger
`rippled_txq_open_ledger_fee_level`	Gauge	Open ledger fee escalation level
`rippled_txq_med_fee_level`	Gauge	Median fee level in queue
`rippled_txq_reference_fee_level`	Gauge	Reference fee level
`rippled_txq_min_processing_fee_level`	Gauge	Minimum fee to get processed

PerfLog Per-RPC Method (via OTel Metrics SDK)

Prometheus Metric	Type	Labels	Description
`rippled_rpc_method_started_total`	Counter	`method="<name>"`	RPC calls started
`rippled_rpc_method_finished_total`	Counter	`method="<name>"`	RPC calls completed
`rippled_rpc_method_errored_total`	Counter	`method="<name>"`	RPC calls errored
`rippled_rpc_method_duration_us_bucket`	Histogram	`method="<name>"`	Execution time distribution

PerfLog Per-Job Type (via OTel Metrics SDK)

Prometheus Metric	Type	Labels	Description
`rippled_job_queued_total`	Counter	`job_type="<name>"`	Jobs queued
`rippled_job_started_total`	Counter	`job_type="<name>"`	Jobs started
`rippled_job_finished_total`	Counter	`job_type="<name>"`	Jobs completed
`rippled_job_queued_duration_us_bucket`	Histogram	`job_type="<name>"`	Queue wait time
`rippled_job_running_duration_us_bucket`	Histogram	`job_type="<name>"`	Execution time

Counted Object Instances (via OTel MetricsRegistry)

Prometheus Metric	Type	Labels	Description
`rippled_object_count`	Gauge	`type="<name>"`	Live instances of internal type

Tracked types: Transaction, Ledger, NodeObject, STTx, STLedgerEntry, InboundLedger, Pathfinder, PathRequest, HashRouterEntry

Fee Escalation & Load Factors (via OTel MetricsRegistry)

Prometheus Metric	Type	Description
`rippled_load_factor`	Gauge	Combined transaction cost multiplier
`rippled_load_factor_server`	Gauge	Server + cluster + network load
`rippled_load_factor_local`	Gauge	Local server load only
`rippled_load_factor_net`	Gauge	Network-wide load estimate
`rippled_load_factor_cluster`	Gauge	Cluster peer load
`rippled_load_factor_fee_escalation`	Gauge	Open ledger fee escalation
`rippled_load_factor_fee_queue`	Gauge	Queue entry fee level

New Grafana Dashboards (Phase 9)

Dashboard	UID	Data Source	Key Panels
Fee Market & TxQ	`rippled-fee-market`	Prometheus	TxQ depth/capacity, fee levels, load factor breakdown, escalation
Job Queue Analysis	`rippled-job-queue`	Prometheus	Per-job rates, queue wait times, execution times, queue depth

5c. Future: Synthetic Workload Generation & Telemetry Validation (Phase 10)

Plan details: 06-implementation-phases.md §6.8.3 — motivation, architecture Task breakdown: Phase10_taskList.md — per-task implementation details Tools: docker/telemetry/workload/ — RPC load generator, transaction submitter, validation suite, benchmarks

Phase 10 builds a 5-node validator docker-compose harness with RPC load generators, transaction submitters, and automated validation scripts that verify all spans, metrics, dashboards, and log-trace correlation work end-to-end. Includes a benchmark suite comparing telemetry-ON vs telemetry-OFF overhead.

Running the Validation Suite

# Full end-to-end validation (start cluster, generate load, validate):
docker/telemetry/workload/run-full-validation.sh --xrpld .build/xrpld

# Validation only (assumes stack and cluster are already running):
python3 docker/telemetry/workload/validate_telemetry.py --report /tmp/report.json

# Performance benchmark (baseline vs telemetry):
docker/telemetry/workload/benchmark.sh --xrpld .build/xrpld --duration 300

Validated Telemetry Inventory

Category	Expected Count	Validation Method	Config File
Trace spans	17	Jaeger/Tempo API query	`expected_spans.json`
Span attributes	22	Per-span attribute assertion	`expected_spans.json`
StatsD metrics	255+	Prometheus query	`expected_metrics.json`
Phase 9 metrics	50+	Prometheus query	`expected_metrics.json`
SpanMetrics RED	4 per span	Prometheus query	`expected_metrics.json`
Grafana dashboards	10	Dashboard API "no data" check	`expected_metrics.json`
Log-trace links	Present	Loki query + Tempo reverse check	—

Performance Overhead Targets

Metric	Target	Measurement Method
CPU overhead	< 3%	ps avg CPU% baseline vs telemetry
Memory overhead	< 5MB	ps peak RSS baseline vs telemetry
RPC p99 latency	< 2ms impact	server_info round-trip timing
Throughput impact	< 5%	Ledger close rate comparison
Consensus impact	< 1%	Consensus round time p95 comparison

5d. Future: Third-Party Data Collection Pipelines (Phase 11)

Status: Planned, not yet implemented. Plan details: 06-implementation-phases.md §6.8.4 — motivation, architecture, consumer gap analysis Task breakdown: Phase11_taskList.md — per-task implementation details

Phase 11 builds a custom OTel Collector receiver (Go) that polls rippled's admin RPCs and exports xrpl_* metrics for external consumers. No rippled code changes.

Exported Metrics (via Custom OTel Collector Receiver)

Node Health (from server_info)

Prometheus Metric	Type	Description
`xrpl_server_state`	Gauge	Operating mode (0=disconnected ... 5=proposing)
`xrpl_server_state_duration_seconds`	Gauge	Seconds in current state
`xrpl_uptime_seconds`	Gauge	Consecutive seconds running
`xrpl_io_latency_ms`	Gauge	I/O subsystem latency
`xrpl_amendment_blocked`	Gauge	1 if amendment-blocked, 0 otherwise
`xrpl_peers_count`	Gauge	Connected peers
`xrpl_validated_ledger_seq`	Gauge	Latest validated ledger sequence
`xrpl_validated_ledger_age_seconds`	Gauge	Seconds since last validated close
`xrpl_last_close_proposers`	Gauge	Proposers in last consensus round
`xrpl_last_close_converge_time_seconds`	Gauge	Last consensus round duration
`xrpl_load_factor`	Gauge	Transaction cost multiplier
`xrpl_state_duration_seconds`	Gauge	Per-state duration (`state` label)
`xrpl_state_transitions_total`	Gauge	Per-state transition count (`state` label)

Peer Topology (from peers)

Prometheus Metric	Type	Description
`xrpl_peers_inbound_count`	Gauge	Inbound peer connections
`xrpl_peers_outbound_count`	Gauge	Outbound peer connections
`xrpl_peer_latency_p50_ms`	Gauge	Median peer latency
`xrpl_peer_latency_p95_ms`	Gauge	p95 peer latency
`xrpl_peer_version_count`	Gauge	Peers per version (`version` label)
`xrpl_peer_diverged_count`	Gauge	Peers with diverged tracking status

Validator & Amendment (from validators, feature)

Prometheus Metric	Type	Description
`xrpl_trusted_validators_count`	Gauge	UNL validator count
`xrpl_amendment_enabled_count`	Gauge	Enabled amendments
`xrpl_amendment_majority_count`	Gauge	Amendments with majority
`xrpl_amendment_unsupported_majority`	Gauge	1 if unsupported amendment has majority
`xrpl_validator_list_active`	Gauge	1 if validator list is active

Fee Market (from fee)

Prometheus Metric	Type	Description
`xrpl_fee_open_ledger_fee_drops`	Gauge	Minimum fee for open ledger inclusion
`xrpl_fee_median_fee_drops`	Gauge	Median fee level
`xrpl_fee_queue_size`	Gauge	Current transaction queue depth
`xrpl_fee_current_ledger_size`	Gauge	Transactions in current open ledger

DEX & AMM (optional, from book_offers, amm_info)

Prometheus Metric	Type	Labels	Description
`xrpl_amm_tvl_drops`	Gauge	`pool="<id>"`	Total value locked
`xrpl_amm_trading_fee`	Gauge	`pool="<id>"`	Pool trading fee (bps)
`xrpl_orderbook_bid_depth`	Gauge	`pair="<base/quote>"`	Total bid volume
`xrpl_orderbook_ask_depth`	Gauge	`pair="<base/quote>"`	Total ask volume
`xrpl_orderbook_spread`	Gauge	`pair="<base/quote>"`	Best bid-ask spread

Phase 9: OTel SDK-Exported Metrics (MetricsRegistry)

Phase 9 introduces the MetricsRegistry class (src/xrpld/telemetry/MetricsRegistry.h/.cpp) which registers metrics directly with the OpenTelemetry Metrics SDK. These are exported via OTLP/HTTP to the OTel Collector and scraped by Prometheus.

NodeStore I/O (Observable Gauge — `nodestore_state`)

Prometheus Metric	Type	Labels	Description
`rippled_nodestore_state{metric="node_reads_total"}`	Gauge	`metric`	Cumulative NodeStore read operations
`rippled_nodestore_state{metric="node_reads_hit"}`	Gauge	`metric`	Reads served from cache
`rippled_nodestore_state{metric="node_writes"}`	Gauge	`metric`	Cumulative write operations
`rippled_nodestore_state{metric="node_written_bytes"}`	Gauge	`metric`	Cumulative bytes written
`rippled_nodestore_state{metric="node_read_bytes"}`	Gauge	`metric`	Cumulative bytes read
`rippled_nodestore_state{metric="write_load"}`	Gauge	`metric`	Current write load score
`rippled_nodestore_state{metric="read_queue"}`	Gauge	`metric`	Items in read prefetch queue

Cache Hit Rates & Sizes (Observable Gauge — `cache_metrics`)

Prometheus Metric	Type	Labels	Description
`rippled_cache_metrics{metric="SLE_hit_rate"}`	Gauge	`metric`	SLE cache hit rate (0.0-1.0)
`rippled_cache_metrics{metric="ledger_hit_rate"}`	Gauge	`metric`	Ledger cache hit rate
`rippled_cache_metrics{metric="AL_hit_rate"}`	Gauge	`metric`	AcceptedLedger cache hit rate
`rippled_cache_metrics{metric="treenode_cache_size"}`	Gauge	`metric`	SHAMap TreeNode cache entries
`rippled_cache_metrics{metric="treenode_track_size"}`	Gauge	`metric`	Tracked tree nodes
`rippled_cache_metrics{metric="fullbelow_size"}`	Gauge	`metric`	FullBelow cache entries

Transaction Queue (Observable Gauge — `txq_metrics`)

Prometheus Metric	Type	Labels	Description
`rippled_txq_metrics{metric="txq_count"}`	Gauge	`metric`	Transactions currently in queue
`rippled_txq_metrics{metric="txq_max_size"}`	Gauge	`metric`	Maximum queue capacity
`rippled_txq_metrics{metric="txq_in_ledger"}`	Gauge	`metric`	Transactions in open ledger
`rippled_txq_metrics{metric="txq_per_ledger"}`	Gauge	`metric`	Expected transactions per ledger
`rippled_txq_metrics{metric="txq_reference_fee_level"}`	Gauge	`metric`	Reference fee level
`rippled_txq_metrics{metric="txq_min_processing_fee_level"}`	Gauge	`metric`	Minimum fee to get processed
`rippled_txq_metrics{metric="txq_med_fee_level"}`	Gauge	`metric`	Median fee level in queue
`rippled_txq_metrics{metric="txq_open_ledger_fee_level"}`	Gauge	`metric`	Open ledger fee escalation level

Per-RPC Method Metrics (Synchronous Counters/Histogram)

Prometheus Metric	Type	Labels	Description
`rippled_rpc_method_started_total`	Counter	`method="<name>"`	RPC calls started
`rippled_rpc_method_finished_total`	Counter	`method="<name>"`	RPC calls completed successfully
`rippled_rpc_method_errored_total`	Counter	`method="<name>"`	RPC calls that errored
`rippled_rpc_method_duration_us`	Histogram	`method="<name>"`	Execution time distribution (us)

Per-Job-Type Metrics (Synchronous Counters/Histogram)

Prometheus Metric	Type	Labels	Description
`rippled_job_queued_total`	Counter	`job_type="<name>"`	Jobs enqueued
`rippled_job_started_total`	Counter	`job_type="<name>"`	Jobs started
`rippled_job_finished_total`	Counter	`job_type="<name>"`	Jobs completed
`rippled_job_queued_duration_us`	Histogram	`job_type="<name>"`	Queue wait time distribution (us)
`rippled_job_running_duration_us`	Histogram	`job_type="<name>"`	Execution time distribution (us)

Counted Object Instances (Observable Gauge — `object_count`)

Prometheus Metric	Type	Labels	Description
`rippled_object_count{type="Transaction"}`	Gauge	`type="<name>"`	Live Transaction objects
`rippled_object_count{type="Ledger"}`	Gauge	`type="<name>"`	Live Ledger objects
`rippled_object_count{type="NodeObject"}`	Gauge	`type="<name>"`	Live NodeObject instances
`rippled_object_count{type="STTx"}`	Gauge	`type="<name>"`	Serialized transaction objects
`rippled_object_count{type="STLedgerEntry"}`	Gauge	`type="<name>"`	Serialized ledger entries
`rippled_object_count{type="InboundLedger"}`	Gauge	`type="<name>"`	Ledgers being fetched
`rippled_object_count{type="Pathfinder"}`	Gauge	`type="<name>"`	Active pathfinding operations
`rippled_object_count{type="PathRequest"}`	Gauge	`type="<name>"`	Active path requests
`rippled_object_count{type="HashRouterEntry"}`	Gauge	`type="<name>"`	Hash router entries

Load Factor Breakdown (Observable Gauge — `load_factor_metrics`)

Prometheus Metric	Type	Labels	Description
`rippled_load_factor_metrics{metric="load_factor"}`	Gauge	`metric`	Combined transaction cost multiplier
`rippled_load_factor_metrics{metric="load_factor_server"}`	Gauge	`metric`	Server + cluster + network contribution
`rippled_load_factor_metrics{metric="load_factor_local"}`	Gauge	`metric`	Local server load only
`rippled_load_factor_metrics{metric="load_factor_net"}`	Gauge	`metric`	Network-wide load estimate
`rippled_load_factor_metrics{metric="load_factor_cluster"}`	Gauge	`metric`	Cluster peer load
`rippled_load_factor_metrics{metric="load_factor_fee_escalation"}`	Gauge	`metric`	Open ledger fee escalation
`rippled_load_factor_metrics{metric="load_factor_fee_queue"}`	Gauge	`metric`	Queue entry fee level

Prometheus Query Examples (Phase 9)

# NodeStore cache hit ratio
rippled_nodestore_state{metric="node_reads_hit"} / rippled_nodestore_state{metric="node_reads_total"}

# RPC error rate for server_info
rate(rippled_rpc_method_errored_total{method="server_info"}[5m])

# Job queue wait time p95
histogram_quantile(0.95, sum by (le) (rate(rippled_job_queued_duration_us_bucket[5m])))

# TxQ utilization percentage
rippled_txq_metrics{metric="txq_count"} / rippled_txq_metrics{metric="txq_max_size"}

# High load factor alert candidate
rippled_load_factor_metrics{metric="load_factor"} > 5

New Grafana Dashboards (Phase 9)

Dashboard	UID	Data Source	Key Panels
Fee Market & TxQ	`rippled-fee-market`	Prometheus	TxQ depth/capacity, fee levels, load factor breakdown
Job Queue Analysis	`rippled-job-queue`	Prometheus	Per-job rates, queue wait times, execution times
RPC Performance (OTel)	`rippled-rpc-perf`	Prometheus	Per-method call rates, error rates, latency distributions

Updated Grafana Dashboards (Phase 9)

Dashboard	UID	New Panels Added
Node Health (StatsD)	`rippled-statsd-node-health`	NodeStore I/O, cache hit rates, object instance counts

New Grafana Dashboards (Phase 11)

Dashboard	UID	Data Source	Key Panels
Validator Health	`rippled-validator-health`	Prometheus	Server state timeline, proposer count, converge time, amendment voting
Network Topology	`rippled-network-topology`	Prometheus	Peer count, version distribution, latency distribution, diverged peers
Fee Market (Ext)	`rippled-fee-market-external`	Prometheus	Fee levels, queue depth, load factor breakdown, escalation timeline
DEX & AMM Overview	`rippled-dex-amm`	Prometheus	AMM TVL, order book depth, spread trends, trading fee revenue

Prometheus Alerting Rules (Phase 11)

Alert Name	Severity	Condition	For
`XRPLServerNotFull`	Critical	`xrpl_server_state < 4` for 15m	15m
`XRPLAmendmentBlocked`	Critical	`xrpl_amendment_blocked == 1`	1m
`XRPLNoPeers`	Critical	`xrpl_peers_count == 0`	5m
`XRPLLedgerStale`	Critical	`xrpl_validated_ledger_age_seconds > 120`	2m
`XRPLHighIOLatency`	Critical	`xrpl_io_latency_ms > 100`	5m
`XRPLUnsupportedAmendmentMajority`	Critical	`xrpl_amendment_unsupported_majority == 1`	1m
`XRPLLowPeerCount`	Warning	`xrpl_peers_count < 10`	15m
`XRPLHighLoadFactor`	Warning	`xrpl_load_factor > 10`	10m
`XRPLSlowConsensus`	Warning	`xrpl_last_close_converge_time_seconds > 6`	5m
`XRPLValidatorListExpiring`	Warning	`(xrpl_validator_list_expiration_seconds - time()) < 86400`	1h
`XRPLStateFlapping`	Warning	`rate(xrpl_state_transitions_total{state="full"}[1h]) > 2`	30m

6. Known Issues

Issue	Impact	Status
`warn` and `drop` metrics use non-standard StatsD `\|m` meter type	Metrics silently dropped by OTel StatsD receiver	Phase 6 Task 6.1 — needs `\|m` → `\|c` change in StatsDCollector.cpp
`rippled_job_count` may not emit in standalone mode	Missing from Prometheus in some test configs	Requires active job queue activity
`rippled_rpc_requests` depends on `[insight]` config	Zero series if StatsD not configured	Requires `[insight] server=statsd` in xrpld.cfg
Peer tracing disabled by default	No `peer.*` spans unless `trace_peer=1`	Intentional — high volume on mainnet

7. Privacy and Data Collection

The telemetry system is designed with privacy in mind:

No private keys are ever included in spans or metrics
No account balances or financial data is traced
Transaction hashes are included (public on-ledger data) but not transaction contents
Peer IDs are internal identifiers, not IP addresses
All telemetry is opt-in — disabled by default at build time (-Dtelemetry=OFF)
Sampling reduces data volume — sampling_ratio=0.01 recommended for production
Data stays local — the default stack sends data to localhost only

8. Configuration Quick Reference

Full reference: 05-configuration-reference.md §5.1 for all [telemetry] options with defaults, the config parser implementation, and collector YAML configurations (dev and production).

Minimal Setup (development)

[telemetry]
enabled=1

[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled

Production Setup

[telemetry]
enabled=1
endpoint=http://otel-collector:4318/v1/traces
sampling_ratio=0.01
trace_peer=0
batch_size=1024
max_queue_size=4096

[insight]
server=statsd
address=otel-collector:8125
prefix=rippled

Trace Category Toggle

Config Key	Default	Controls
`trace_rpc`	`1`	`rpc.*` spans
`trace_transactions`	`1`	`tx.*` spans
`trace_consensus`	`1`	`consensus.*` spans
`trace_ledger`	`1`	`ledger.*` spans
`trace_peer`	`0`	`peer.*` spans (high volume)

63 KiB Raw Blame History Unescape Escape

Observability Data Collection Reference

Data Flow Overview

1. OpenTelemetry Spans

1.1 Complete Span Inventory (16 spans)

RPC Spans

Transaction Spans

Consensus Spans

Ledger Spans

Peer Spans

1.2 Complete Attribute Inventory (22 attributes)

RPC Attributes

Transaction Attributes

Consensus Attributes

Ledger Attributes

Peer Attributes

1.3 SpanMetrics — Derived Prometheus Metrics

2. System Metrics (beast::insight — OTel native)

Configuration

2.1 Gauges

2.2 Counters

2.3 Histograms (Event timers)

2.4 Overlay Traffic Metrics

3. Grafana Dashboard Reference

3.1 Span-Derived Dashboards (5)

3.2 System Metrics Dashboards (5)

3.3 Accessing the Dashboards

4. Jaeger Trace Search Guide

Finding Traces by Type

Trace Structure

5. Prometheus Query Examples

Span-Derived Metrics

StatsD Metrics

5a. Log-Trace Correlation (Phase 8)

Log Format

Implementation

Log Ingestion Pipeline

Grafana Correlation

Loki Backend

LogQL Query Examples

5b. Future: Internal Metric Gap Fill (Phase 9)

New Metric Categories

NodeStore I/O (via beast::insight)

Cache Hit Rates (via OTel MetricsRegistry)

Transaction Queue (via OTel MetricsRegistry)

PerfLog Per-RPC Method (via OTel Metrics SDK)

PerfLog Per-Job Type (via OTel Metrics SDK)

Counted Object Instances (via OTel MetricsRegistry)

Fee Escalation & Load Factors (via OTel MetricsRegistry)

New Grafana Dashboards (Phase 9)

5c. Future: Synthetic Workload Generation & Telemetry Validation (Phase 10)

Running the Validation Suite

Validated Telemetry Inventory

Performance Overhead Targets

5d. Future: Third-Party Data Collection Pipelines (Phase 11)

Exported Metrics (via Custom OTel Collector Receiver)

Node Health (from server_info)

Peer Topology (from peers)

Validator & Amendment (from validators, feature)

Fee Market (from fee)

DEX & AMM (optional, from book_offers, amm_info)

Phase 9: OTel SDK-Exported Metrics (MetricsRegistry)

NodeStore I/O (Observable Gauge — nodestore_state)

Cache Hit Rates & Sizes (Observable Gauge — cache_metrics)

Transaction Queue (Observable Gauge — txq_metrics)

Per-RPC Method Metrics (Synchronous Counters/Histogram)

Per-Job-Type Metrics (Synchronous Counters/Histogram)

Counted Object Instances (Observable Gauge — object_count)

Load Factor Breakdown (Observable Gauge — load_factor_metrics)

Prometheus Query Examples (Phase 9)

New Grafana Dashboards (Phase 9)

Updated Grafana Dashboards (Phase 9)

New Grafana Dashboards (Phase 11)

Prometheus Alerting Rules (Phase 11)

6. Known Issues

7. Privacy and Data Collection

8. Configuration Quick Reference

Minimal Setup (development)

Production Setup

Trace Category Toggle

63 KiB

Raw Blame History

NodeStore I/O (Observable Gauge — `nodestore_state`)

Cache Hit Rates & Sizes (Observable Gauge — `cache_metrics`)

Transaction Queue (Observable Gauge — `txq_metrics`)

Counted Object Instances (Observable Gauge — `object_count`)

Load Factor Breakdown (Observable Gauge — `load_factor_metrics`)